Training a multi-label FastXML classifier on the OpenImages dataset

Used architecture

Instructions

If you have further questions about the instructions or in case of errors, please open an issue!

Download images, prepare repository

Download the VGG models, annotations and initialize the submodules:

./bootstrap.sh

See the bootstrap.sh for options.

Download images

Download the images from this torrent to a $IMAGES_FOLDER of your choice.

Extract image ids

Extract the images.txt with the image ids by:

# $IMAGES_FOLDER is the folder where you downloaded the torrent
# $DATA_PATH is the folder where in the next step the features get extracted to
find $IMAGES_FOLDER -name '*.jpg' > $DATA_PATH/images.txt

VGG16 feature extraction

Run the vgg16 feature extraction by:

./tmp/cluster_scripts/run-vgg-extraction.sh fc6,fc7

You have to adapt the script first!

The first parameter to the run-vgg-extraction.sh script specifies the layers to be extracted, seperated by comma (eg. in the above case fc6 and fc7).

After executing this script the features will reside in the data folder (see script). The filename is features.LAYERS_TO_BE_EXTRACTED.txt. It is an CSV where the first column is the file id, the following columns are the features.

Format change

The format of the VGG16 feature extraction is a CSV with the image ids and the features.

To train the FastXML classifier with these features, the format has to be changed to the sparse matrix format for FastXML.

First you have to extract the annotations:

./dataset/extract-csv-columns.sh dataset/download/human_ann_2016_08/validation/labels.csv 1,3
grep -v ",0\.0" dataset/download/human_ann_2016_08/validation/labels_1_3.csv > dataset/download/human_ann_2016_08/validation/labels_1_3_correct.csv

The imageid/labels pairs will now reside in dataset/download/human_ann_2016_08/validation/labels_1_3_correct.csv.

Now add the classes to the features.

./tmp/format_convert_scripts/add_classes_to_features.py \
    --features-file $IN_FEATURES_FILE \
    --features-labels-file $OUT_FEATURES_WITH_CLASSES \
    --labels-npy dataset/download/human_ann_2016_08/validation/labels_1_3_correct.csv.npy \
    --labels dataset/download/human_ann_2016_08/validation/labels_1_3_correct.csv

$IN_FEATURES_FILE: the features from the feature extraction step (eg. features.LAYERS_TO_BE_EXTRACTED.txt).

$OUT_FEATURES_WITH_CLASSES: the file where the imageid/features/classes CSVs are saved to.

See the script ./tmp/format_convert_scripts/add_classes_to_features.py for more options and explanations.

Extract the classes from the annotations:

cut -d , -f 1 dataset/download/human_ann_2016_08/validation/labels.csv | sort | uniq > dataset/download/human_ann_2016_08/validation/labels.sorted.csv

Now the format can be changed to the FastXML format:

./tmp/format_convert_scripts/cpp_fastxml_format.py  \
    --features-labels-file $OUT_FEATURES_WITH_CLASSES \
    --classes-sorted-file dataset/download/human_ann_2016_08/validation/labels.sorted.csv \
    --features-out-file $OUT_FEATURES_FASTXML \
    --classes-out-file $OUT_CLASSES_FASTXML

See script for explanation.

To convert to the MULAN format:

./tmp/format_convert_scripts/convert_to_mulan.py \
    --classes-in-file $OUT_CLASSES_FASTXML \
    --features-in-file $OUT_FEATURES_FASTXML \
    --out-file $OUT_MULAN

See script for explanation.

Train FastXML classifier

Now train the cpp classifier:

./tmp/cluster_scripts/run-fastxml-cluster.sh \
    $DATA_DIR \
    $NUM_THREADS \
    $NUM_THREADS_TEST \
    $START_TREE \
    $NUM_TREE \
    $BIAS \
    $LOG_LOSS_COEFF \
    $MAX_LEAF \
    $LBL_PER_LEAF

$DATA_DIR: the directory where the classes and features reside (eg. $OUT_FEATURES_FASTXML and $OUT_CLASSES_FASTXML from the last step).

See the report for explanations for the other FastXML hyperparameters.

See the script ./tmp/cluster_scripts/run-fastxml-cluster.sh for the functionality.

Dataset

images

CSV Headers
- 1 ImageID
- 2 Subset
- 3 OriginalURL
- 4 OriginalLandingURL
- 5 License
- 6 AuthorProfileURL
- 7 Author
- 8 Title
- 9 OriginalSize
- 10 OriginalMD5
- 11 Thumbnail300KURL
train
- Total: 9.011.220
- Size: 18.3 TB
validation
- Total: 167.057
- Size: 309.9 GB

annotations

CSV Headers
- 1 ImageID
- 2 Source
- 3 LabelName
- 4 Confidence
MISPREDICTED
- Ratio: 31% (false positives)
- Most: "produce, flower, plant, food, sports, shrub, human body"
human
- validation
  - Total: 1.741.385
  - Confidences:
    - 0.0 31.5%
    - 1.0 68.5%
  - Images that have no positive label: ca. 2000
machine
- validation
  - Total: 2.060.221
  - Confidences:
    - 0.5 14.5%
    - 0.6 22.2%
    - 0.7 20.5%
    - 0.8 17.4%
    - 0.9 18.4%
    - 1.0 7.0%
- train
  - Total: 79.196.416
  - Confidences:
    - 0.5 12.3%
    - 0.6 22.0%
    - 0.7 21.3%
    - 0.8 20.5%
    - 0.9 19.8%
    - 1.0 4.0%

Useful commands

Headers

head -n 1 $FILE

Linecount

wc -l $FILE

Extract 4th column

cat $FILE | cut -d , -f 4 > $NEW_FILE

working-with-data-on-the-command-line

cat file.csv | sed -e 's/,,/, ,/g' | column -s, -t

Top10 mispredicted

head -n 10 download/human_ann_2016_08/validation/labels_mispredicted_wc.csv | cut -d , -f 1 | ./labelnames.sh

cat labels.csv | grep -E '/m/036qh8.*,0.0' > labels_mispredicted_wc2.csv

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
dataset		dataset
fastxml		fastxml
other		other
tmp		tmp
vgg		vgg
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
README_formats.md		README_formats.md
TMP.md		TMP.md
bootstrap.sh		bootstrap.sh

License

davidgengenbach/openimages-fastxml-classification

Folders and files

Latest commit

History

Repository files navigation

Training a multi-label FastXML classifier on the OpenImages dataset

Used architecture

Instructions

Download images, prepare repository

Download images

Extract image ids

VGG16 feature extraction

Format change

Train FastXML classifier

Dataset

images

annotations

Useful commands

About

Topics

Resources

License

Stars

Watchers

Forks

Languages