As part of the Digital Collections & Access team, Collections Data Manager Gareth Watkins supports kaimahi across the museum to input and maintain accurate information in EMu, Te Papa’s Collection Management System, ensuring it supports the documentation and management of more than a million collection taonga objects and natural history specimens. Here, Gareth details how regular, automated data checks are helping teams strengthen the quality, consistency, and discoverability of our collections data – and how Python and AI tools are making that work faster and smarter.
In April 2024, I introduced a set of quarterly collections data checks that can be run automatically in batches with minimal human input. The checks generate spreadsheets listing records that require review, such as those with duplicate registration numbers, missing titles or locations, or dates that unexpectedly fall in the future.
The quality of our collections data is crucial because it underpins every aspect of museum work, both now and into the future. Collection care, exhibitions, valuations, and online public access all depend on accurate, authoritative data. And with the growing use of AI tools that draw on large datasets, having robust source data has become more important than ever to ensure reliable outputs.

Smarter tools for cleaner data
Each collections data test is bespoke and can range from querying a single field to interrogating multiple fields with if/then logic and querying internet resources. For example, testing whether a person has a related Wikidata identifier or if the current location field is empty or not.
The aim is to help teams across the museum identify and prioritise work on collections data by surfacing and quantifying issues. But the tests are not only about finding problems — they also highlight opportunities. For example, one test checks whether publicly available images on Collections Online can be published to a higher resolution to enable greater access. An example of this is the highly zoomable Muir & Moodie image of the Waimangu Geyser erupting in 1903. Another data check highlights iwi connections to taonga across the collections.

Scaling up the data checks
I began with 21 tests a year ago, and this has since grown to more than 60. All are written in Python, one of the world’s most widely used programming languages.
Python is a free, open-source language that is developed and maintained by a global community. It’s a very friendly language, reading almost like plain English. If you’ve never programmed before, Python is both straightforward and powerful.

My Python journey started a few years ago when I attended an introductory session through the annual ResBaz (Research Bazaar) event – an amazing free series of online webinars and workshops that bring together researchers from around the country to develop skills in digital tools and research practices.
From there, I signed up to learn more about Python from The Carpentries, a non-profit organisation that teaches foundational coding and data science skills to researchers worldwide. There are numerous courses freely available, including an introduction to Python for library and information workers.
More recently, I’ve been utilising Artificial Intelligence as my computer coding buddy. Many Generative AI platforms – like CoPilot, Cursor, Claude, Gemini and ChatGPT – offer help with computer coding and troubleshooting. Without sharing private or sensitive information (like passwords, file paths etc), you can get AI to show you example code or test your own code for vulnerabilities. AI is also great at explaining what each line of code does and what error messages mean.

What the checks are revealing
So, what have the collections data checks uncovered?
So far, the tests have picked up collection taonga objects missing core data elements such as titles, object classifications, and location information — all of which can be easily updated. In Natural History, they have also identified records where the collection event notes a region (for example, Wellington) but the geo-coordinates map elsewhere. We’ve also been able to find when a specimen and associated tissue samples have been given different taxonomic identifications. This sometimes happens when the main specimen gets reidentified, but the tissue sample records aren’t updated.
The data checks don’t just highlight issues, they also create an audit trail — snapshots of our collections data that let us track improvements over time. They’ve also been useful in showing where we should focus our EMu training. If you’d like to know more about the full suite of quarterly collections data checks, feel free to get in touch – gareth.watkins@tepapa.govt.nz




I think that is a very good point to make that “having robust source data has become more important than ever to ensure reliable outputs.” But AI alone cannot do that. It takes proper curation of specimens and their labels. i read a troubling case of original labels for snail specimens in Te Papa being removed and replaced. that does not seem like good practice and results in a less than ‘robust source data’ for anyone to work on. Hope that practice of removing original labels has been discontinued for the sake of the collections and their significance for biodiversity research
Kia ora John, thanks for your comment. I’m not aware of the case you refer to, but will pass on the comment to the Natural History teams. In general though, our collections management system (EMu) does allow for previous numbers, inscriptions and notes etc to be recorded in the database