How do you embed 14,500 images into a spreadsheet?

How do you embed 14,500 images into a spreadsheet?

We look after a large collection here at Te Papa and with that comes a lot of data. With many of our teams wanting various information about our collections as a spreadsheet – preferably with thumbnail images – we’re always looking at how we can streamline requests. Collections Data Manager Gareth Watkins answers the question: how do you embed 14,500 images into a spreadsheet in less than 10 minutes?

A black and white scenic image of an island rising up from the sea with 8 lines of computer code overlaid at an angle
A modified scenic image overlaid with Python code. [Kapiti coast], 1920s to 1930s, by Roland Searle. Purchased 1999 with New Zealand Lottery Grants Board funds. Te Papa (A.019230)
One of the most enjoyable parts about my role as Collections Data Manager is the variety of requests that come in from teams across Te Papa. From visualising the gender balance of artists in the Art collection to creating spreadsheets that list the value vs. exhibition history of a collection item – there are always new ways to analyse Te Papa’s collections.

Our collection data is managed in EMu – a collection management system originally designed in Australia and now used internationally by many top collecting institutions. At Te Papa, the relational database holds information on over 1.2 million Humanities objects and Natural History specimens collected over 150+ years.

I recently received a request to create a spreadsheet of collection items for an upcoming valuation round. Te Papa has a regular cycle of independent valuations for insurance, auditing, and collection care purposes. Often the valuers require the collection information in a spreadsheet format for easy reference. This includes detailed information about the collection items plus images so they can be easily identified.

A six-line spreadsheet with thumbnail images in one column and data about each image in the other columns. Each line is alternating shades of blue.
An example of a valuation/audit spreadsheet for part of the International History collection.

Embedding images into a spreadsheet is quite straightforward when there are a small number of images, but it gets trickier when a collection contains over 14,000 images.

Fortunately, there are tools freely available online to help accomplish this task – including the Python programming language.

I was introduced to Python via a number of online workshops. Firstly via VALA – an Australian based not-for-profit organisation that promotes the use of information technology in the GLAM (Galleries, Libraries, Archives and Museums) sector. And then through ResBaz (Research Bazaar) Aotearoa – the local version of a worldwide festival promoting digital literacy in research settings with free workshops.

At the same time, our technology team repurposed a server specifically to do data manipulation. They named it the DataCruncher500 and installed a number of free tools for data transformation, including Open Refine and Python.

This python won’t constrict you

The Python programming language was created in the late 1980s by Guido van Rossum.

Colour photo of a head and shoulders of a man with a beard and glasses looking off to his right.
Photograph of Guido Van Rossum by Daniel Stroud, via Wikimedia, CC BY-SA 4.0

It has developed into one of the most popular and widely used languages in the world. Its design encourages logical coding that can easily be read and understood by humans. It can be downloaded for free and runs in both Mac and PC environments. The Python code you write is saved as a text file. You can write it using a simple text editor (e.g. Notepad), or by using an Integrated Development Environment (IDE) like Spyder.

Python has a very active online community which have created many additional code libraries that can be easily incorporated into your own programs. For the task of embedding images, I was able to utilise the xlsxwriter spreadsheet module and the Pillow module for image manipulation. Both contain many in-built functions that simplify your own coding.

A requirement for the spreadsheet was that the images should be thumbnail size, constrained to a maximum dimension and be able to respond to other spreadsheet columns being filtered. Another consideration was that the overall file size of the spreadsheet needed to be kept manageable.

Firstly the original images were exported from our collection management system as access/viewing copy JPGs. Then to achieve the desired image transformations, the Python script read the original image into memory and turned it into a smaller image using the thumbnail method. The method requires a maximum width and height and then reduces the image to fit while retaining the original aspect ratio.

im = Image.open(full_filename)
im.thumbnail(max_image_size)

I found that sometimes the original image was encoded in the RGBA colour space which actually caused the script to fail at a subsequent step. So I used the convert method to make sure that it was in the RGB colour space:

if im.mode in ("RGBA", "P"):
   im = im.convert("RGB")

The thumbnail RGB image was then saved into memory with a lower quality encoding to lessen the file size (the smaller the quality number the more blocky and low-resolution the image will become).

this_modified_image = BytesIO()
im.save(this_modified_image, format="JPEG", quality=50)

The final part of the Python script embedded the image into the spreadsheet using the insert image method. A property (object position) is set in the spreadsheet to allow for the image to be moved and resized. Doing this allows the end-user to use filters on the spreadsheet with the images showing/hiding correctly.

worksheet.insert_image(this_image_cell, this_filename, {'image_data':this_modified_image, 'object_position': 1, 'x_offset': 2, 'y_offset': 2})

Prior to utilising Python, the task of embedding large amounts of images into a spreadsheet could take days and was fraught with software glitches. Now, when I run the Python script it takes nine minutes to embed 14,500 thumbnail (90×90) images. The total file size is around 16MB (the original images total 247MB).

This is just a very small example of what can be achieved with Python. But it’s also a very good example of how Python developers have been freely contributing their knowledge and code libraries for decades on the internet so that we all may benefit.

The full code is shown below and is available (with clearer indentation) on GitHub. Thanks to Tom Moynihan for assistance with formatting the code.

If you have any questions, please get in touch or comment below.

Related Links

Thumbnail Python script

#!/usr/bin/env python3

"""

Takes a directory of .jpg images and embeds 90x90 thumbnails into a .xlsx spreadsheet.

Author: Gareth Watkins, Collections Data Manager @ Te Papa Tongarewa, Museum of New Zealand

This example code is provided without license from the author. Elements of the code utilise external modules which have their own licenses.

"""

# Import modules
import os
import xlsxwriter
from io import BytesIO
from PIL import Image

# Define our folder that contains the source images and the name of the output file (note each time the script is run
# it will overwrite the output file).
# The folder where the images are located:
working_folder = "C:\\python_scripts\\image_encode\\"

# The name of the output file. It will be created in the same folder as the source images:
out_file = working_folder + 'result.xlsx'

# Define the size of the image cell in the spreadsheet and the slightly smaller max size of the thumbnail
cell_size = 95
max_image_size = (cell_size-5, cell_size-5)

# Generate a list of jpg images from the working folder (and any child folders)
file_list = []
for root, dirs, files in os.walk(working_folder):
  for file in files:
    if file.endswith(".jpg"):
      file_list.append(os.path.join(root, file))

# Create a workbook. Col A will contain filenames, Col B will contain images
workbook = xlsxwriter.Workbook(out_file)
worksheet = workbook.add_worksheet()

# Size the image column (B) in the worksheet
worksheet.set_column_pixels('B:B', cell_size)

# Add column headers to the worksheet
worksheet.write('A1', 'Filename')
worksheet.write('B1', 'Image')

# Define which row we start at (row 1 being the column headers)
row = 2

# Main loop to iterate through the file_list and add each image to the worksheet
for full_filename in file_list:

  # Get the name of the current image
  this_filename = os.path.basename(full_filename)

  # As a courtesy, output the filename of the current image to the console window so that we can track progress
  print("Processing " + this_filename)

  # Define the worksheet cells we are about populate
  this_filename_cell = 'A' + str(row)
  this_image_cell = 'B' + str(row)

  # Set the size of the current row in the worksheet.
  # Note the row has an offset of -1, e.g. 0 actually equals the first row in the spreadsheet
  worksheet.set_row_pixels(row - 1, cell_size)

  # Write the current image filename into column A of the worksheet
  worksheet.write(this_filename_cell, this_filename)

  # Load the current image into memory
  im = Image.open(full_filename)

  # Reduce the image using the thumbnail method
  im.thumbnail(max_image_size)

  # Sometimes the files may be encoded in the RGBA colour space so we re-encode as RGB
  if im.mode in ("RGBA", "P"):
    im = im.convert("RGB")

  # Save the modified image into memory
  this_modified_image = BytesIO()
  im.save(this_modified_image, format="JPEG", quality=50)

  # Write the modified image into Col B of the worksheet.
  # We offset the image slightly (2px) so that it falls within the cell
  # Note: setting object_position to 1 sets the image property in the worksheet so that it 
  # will move and resize (allowing for user filtering)
  worksheet.insert_image(this_image_cell, this_filename, {'image_data': this_modified_image, 'object_position': 1,
  'x_offset': 2, 'y_offset': 2})

  # Advance the row pointer by 1
  row = row + 1

# Close the workbook once the loop has finished
workbook.close()

Leave a Reply

Your email address will not be published.