https://expfactory-data.github.io/cookies
This is an example dataset that is being distributed by using the JSON API specification. The entire thing is rendered statically using Jekyll on Github Pages.
You can parse the entire dataset programatically by starting at the base datasets json.
The dataset can be viewed in entity on Github.
Each dataset is a separate folder within _datasets
, and this means that you can add a new dataset by simply adding a folder with the appropriate subfolders and metadata.
A dataset folder includes top level files for metadata, images, and texts, and subfolders with the actual images
and texts
. The organization might look like the following:
_datasets
cookie-1
metadata.txt
images.txt
texts.txt
images
image1.txt
image1/image1.jpg
texts
text1.txt
text1/text1.csv
In the above example, we have an entity named "cookie-1" with a metadata.txt file that will be rendered at the url /datasets/cookie-1/metadata
as json, and this metadata file will have an includes
section that will indicate if we have images and/or text, or neither, and then link to /datasets/cookie-1/images
and/or /datasets/cookie-1/texts
. Details about the metadata file, images and text files, are below. For the above, we should note that the folder name cookie-1
is going to coincide with the dataset-id
.
metadata.txt
should be a text file located at the top level of the subject folder. Note that the dataset-id
coincides with the folder name for the dataset. THe metadata.txt
includes the fields specified in meta.yml, organized according to being required or not. We can look at an example:
---
title: Cookie Tumor 1
type: entity
dataset-id: cookie-1
hidden: false
description: This is a cookie tumor. I am describing the cookie tumor!
license: This is a license for this dataset.
attributes:
- cookie_type: sugar
- cookie_age: 2
- cookie_candy: m&ms
includes:
- images
- texts
---
Any features about the dataset should be put in the list of attributes
. The includes
section indicates that the entity has subfolders "images" and "texts," and an images.txt and texts.txt file to describe the contents. This file could be very minimumal, and perhaps only have the following:
---
type: entity
dataset-id: cookie-1
includes:
- images
---
Each of the images.txt and texts.txt file in a dataset folder simply need to have a list of the files that you want published, with type "images" for images, and "texts" for texts:
---
type: images
dataset-id: cookie-1
images:
- image1
---
As a reminder, in the example above, we have a folder that looks like this, and we are viewing the images.txt file:
images.txt
images
image1.txt
image1/image1.jpg
Within the images folder, we should have an image1.txt file for each image that we want to serve, and include with this text file metadata (features or attributes) specified to the image:
---
type: image
dataset-id: image1
files:
- image1.jpg
attributes:
- EXIF SubjectDistance": 0
- EXIF SceneType: 0
- Image Resolution: 768/17
- EXIF FlashEnergy: 1800
---
We also have added a list of files that are expected to be located within a subdirectory named by the image id, followed by the filename. In the example above, image1.jpg
described in the file images/image1.txt
would be located in images/image1/image1.jpeg
.
While these variables could be sniffied programmatically, it is important that you are able to include a data object in a repository, but turn it's "published" status on or off. If an image is not included in the list above, it will not be rendered in the json data structure for the API.
The richness for data comes with it's metadata, meaning labels and attributes about the image. Thus, we want to represent this data on the same level as the image.