Skip to content

vai2fc/ds4-cv-vases

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Looking for Signal, Listening for Noise: Using a Convolutional Neural Network to Identify Musical Instruments in 4th c. BCE Apulian Vases

Welcome to the repo for my project on identifying stringed instruments in 4th century BCE Apulian vase-painting.

Why ancient pottery + computer vision?

At the heart of this project is the question of whether or not a computer can be trained to 'read' the rich and varied imagery ('iconography') on ancient, figure-decorated pottery. As photographic and other visual archives digitize their vast collections, is there a place for computer vision algorithms to assist the stewards and curators of these materials in generating more informative and inclusive metadata that accelerates research and increases discoverability of archival materials?

This project sits at the crossroads of my past work as a Mediterranean Archaeologist and Digital Humanities specialist and my current professional interests in deep learning, explainable ML, and ethical AI. By blending my experience in executing hands-on, archival research with my developing understanding of Convolutional Neural Networks (CNNs) for image recognition, I aim to explore the potential of computer vision in making archival collections not only more available, but also more accessible, online.

The data

For this project, I revisited the data I collected while conducting research at the A.D. Trendall Research Centre for Ancient Mediterranean Studies for (my dissertation on the representation of music and musicians in 4th c. BCE Apulian Red-Figure Vase-Painting. My dissertation dataset included thousands of images of South Italian vases (like this one) collected by Arthur Dale Trendall, the foremost 20th-century specialist in South Italian pottery, throughout his career. In order to narrow down the dataset to a scale that I could experiment with over the course of a few weeks, I focused on just over 400 images that contained the representation of a subset of stringed instruments (specifically, those in the lyre family). N.B. I had already completed the manual labeling of the type and number of instruments on each vase, as well as the photographs associated with those images, as part of my dissertation research.

The approach

The project can be broken down into three parts: image pre-processing, model training, and setup for experimentation.

Image pre-processing

In order to work with the archival images I had digitized for my research, I was inspired by Chris Birchall's 2017 There's Wally experiment to build a CNN to find Wally. (Many thanks to Michael Holloway for sharing this project!) I used ImageJ to isolate the stringed instruments on each photograph and wrote a macro to extract the x and y coordinates of the bounding box around each instrument to a CSV. Then, I wrote a set of functions to chop up each archival photograph into instrument-sized squares, automatically identifying any that included part of an instrument. In addition, I extracted a 'clean', cropped image of each musical instrument, automatically writing the instrument and non-instrument cropped photos to their own folders. This pre-processing generated:

  • 410 images of stringed instruments
  • 362,980 images with no stringed instrument
  • 6,299 images of partial stringed instruments (which have been excluded from the modelling process)

Due to the imbalanced nature of the resulting dataset (c. 0.1% of all images are musical instruments), I opted to undersample the no-instrument images, resulting in a dataset of 410 stringed instruments and 5,000 no-instrument images. Since the overall dataset is relatively small for deep learning, I split the images up in two ways: a 75/25 train/test split and a 60/20/20 train/validation/test split.

Model training

While I had initially intended to build a CNN from scratch, the volume of stringed instrument images available in my dataset - as well as the very helpful NSS DS4 capstone roundtable discussions with Tim Blass, Jason King, Ashutosh Singhal, and David Tinsley - put me on the path toward a transfer learning approach for v1 of this project. I opted to leverage the InceptionV3 architecture, with weights pretrained on ImageNet, as the base for four of the five models I trained (notebooks 03a, 03b, 03c, and 03e). In addition, I tested the newer EfficientNetB7 architecture in one of my models.

In addition to testing out these two base architectures, there are three main differences between the models I trained: the train/test split (either 75/25 or 60/20/20, with a validation set), the types of image augmentation / transformation applied to images in the training dataset (rescaling only, minimal augmentation, maximal augmentation), and for the InceptionV3 models, the number of layers I trained the model on (just the additional 3 layers or the top two InceptionV3 blocks of layers + the additional 3 layers). An example of the types of image transformations applied may be seen in my visualizations notebook.

Setup for experimentation

As mentioned above, all of the original images in my dataset were labeled and contained at least one stringed instrument (though the processed images included examples of stringed instruments and no-instrument sections). In the wild, however, we would not have the data to chop up each image to the size of the instrument (if there even is one) to run through the model. After conducting a brief analysis of the quartile and decile breakdowns of the proportion of the x and y dimensions of an image that an instrument typically took up, I opted to pre-process entirely unknown images by chopping them up into squares that were at the 25th, 50th, and 75th percentile of the proportion of instrument dimensions. Demo notebook 02 then walks through the process of applying two of the pretrained models to the unknown image pieces to return a prediction.

The results

First, a note on what I believe is important in assessing the efficacy of any model which seeks to teach a computer how to read vase-painting iconography. A good model that could support archivists and curators in identifying images with musical instruments needs to correctly identify as many instruments as possible to gain the confidence of the end users. In addition, it should allow curators to spend less time doing the manual labor required to find the musical instruments in the first place by only identifying actual representations of musical instruments. Finally, the model should be quite confident about the difference between a musical instrument and a no-instrument image. As a result, I would suggest that three axes of measurement would be appropriate ways to evaluate a model: precision, recall, and AUC.

By the metrics outlined, the most performant models I achieved are in notebook 3e. Both models are based on the InceptionV3 architecture, with pretrained ImageNet weights. The top two blocks of InceptionV3, as well as the three additional custom layers (GlobalAveragePooling2D, RELU 1,024, and the final softmax layer to output two class probabilities), were trained for ten epochs on 60% of the 5,410 images in my dataset, validated on 20%, and tested on the remaining 20%. The threshold for no-instrument categorization was set high, at 95%.

Both models score over 0.99 in AUC, indicating that there is very little overlap between what they identify as an instrument vs. not an instrument. Their precision (the percentage of actual instruments out of the ones identified as instruments) and recall (the percentage of actual instruments that were identified by the model), however, vary:

The difference between these two models is the amount and types of image augmentation applied to the training dataset. See notebook 03e for the details. They both perform well on the 20% holdout training data, but in testing on entirely novel images, the higher recall value of model 5b proves useful in identifying musical instruments.

What's next?

This project is a prototype of a broader vision to train an algorithm to generate valuable metadata for photographic archives. I would love to expand this project to include other iconographical features in South Italian vases (e.g. other musical instruments, more specific sub-types of musical instruments, figure types, etc.) and I look forward to testing out different image recognition model architectures to improve its performance. Most of all, however, I look forward to experimenting with model explicability libraries such as ProtoPNet to understand what is actually happening under the hood of these image recognition models to make better-informed decisions about how to implement this sort of technology in a responsible manner.

Postscript: contents of this repo

This repo contains three folders: macros, notebooks, and resources. The macros folder contains a short macro to facilitate image labeling using ImageJ. The resources folder contains a .yaml file that explicitly states the versions of the primary packages used in this project. The notebooks folder contains notebooks for image pre-processing, data sampling, and model building, as well as a couple of notebooks specifically for live demonstration. Out of respect for the rights of the A.D. Trendall Research Centre for Ancient Mediterranean Studies to the images I worked with throughout this project, the data folder has not been shared (though the data folder structure may be inferred from the notebooks).

About

Project leveraging InceptionV3 computer vision architecture to identify stringed instruments in 4th century BCE Apulian vase-painting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published