This repository contains the code and data for the research presented in "Quantifying iconicity in 1.1M online circulations of 26 iconic photographs" (Smits & Ros, 2020). Find the paper here.
Included in this repository are scripts for parsing data outputted by the Google Cloud Vision API and scripts used for the analysis of textual and visual material.
In the paper we use clustered document embeddings to map the different type of textual context around online reproductions of iconic images. The method is described in the paper. It is based on the Top2Vec method of embeddings clustering.
Besides the code and data related to our paper, this repo also contains code for:
We train separate topic models for every set of URLs associated with a specific iconic image. Although we are aware of the work in multilingual topic modelling, a large share of our corpus (around 70%) consists of English-language URLs, and we therefore choose to report in this article only the results for this subset. Preprocessing entails tokenization, lemmatization and the removal of stopwords. We use a combination of qualitative evaluation and coherence measures,based on normalized pointwise mutual information and cosine similarity, to determine a number of topics for every subset.
Iconic images are frequently supplemented with text for various reasons. To map this aspect of the online reproduction, we run Optical Character Recognition on all the images, using the powerful Tesseract OCR engine.
Online reproductions of iconic images often use specific parts of the original image. To detect crops and regions of interest, we use Scale-Invariant Feature Transform (SIFT) to identify the keypoints of the original image and inspect which part of the original image is used in a reproduction. We first preprocessed our corpus by converting the image to grayscale, downsampling the images and applying a Gaussian blur.
The context of online republications of iconic images is not only textual. We aggregate the visual context of online reproduction by identifying and clustering all other images published on a webpage. We also look at images that combine the iconic image with another image (in a single image file).