If you have further questions about the instructions or in case of errors, please open an issue!
Download the VGG models, annotations and initialize the submodules:
./bootstrap.sh
See the bootstrap.sh
for options.
Download the images from this torrent to a $IMAGES_FOLDER
of your choice.
Extract the images.txt
with the image ids by:
# $IMAGES_FOLDER is the folder where you downloaded the torrent
# $DATA_PATH is the folder where in the next step the features get extracted to
find $IMAGES_FOLDER -name '*.jpg' > $DATA_PATH/images.txt
Run the vgg16 feature extraction by:
./tmp/cluster_scripts/run-vgg-extraction.sh fc6,fc7
You have to adapt the script first!
The first parameter to the run-vgg-extraction.sh
script specifies the layers to be extracted, seperated by comma (eg. in the above case fc6 and fc7).
After executing this script the features will reside in the data folder (see script).
The filename is features.LAYERS_TO_BE_EXTRACTED.txt
.
It is an CSV where the first column is the file id, the following columns are the features.
The format of the VGG16 feature extraction is a CSV with the image ids and the features.
To train the FastXML classifier with these features, the format has to be changed to the sparse matrix format for FastXML.
First you have to extract the annotations:
./dataset/extract-csv-columns.sh dataset/download/human_ann_2016_08/validation/labels.csv 1,3
grep -v ",0\.0" dataset/download/human_ann_2016_08/validation/labels_1_3.csv > dataset/download/human_ann_2016_08/validation/labels_1_3_correct.csv
The imageid/labels pairs will now reside in dataset/download/human_ann_2016_08/validation/labels_1_3_correct.csv
.
Now add the classes to the features.
./tmp/format_convert_scripts/add_classes_to_features.py \
--features-file $IN_FEATURES_FILE \
--features-labels-file $OUT_FEATURES_WITH_CLASSES \
--labels-npy dataset/download/human_ann_2016_08/validation/labels_1_3_correct.csv.npy \
--labels dataset/download/human_ann_2016_08/validation/labels_1_3_correct.csv
$IN_FEATURES_FILE
: the features from the feature extraction step (eg. features.LAYERS_TO_BE_EXTRACTED.txt
).
$OUT_FEATURES_WITH_CLASSES
: the file where the imageid/features/classes CSVs are saved to.
See the script ./tmp/format_convert_scripts/add_classes_to_features.py
for more options and explanations.
Extract the classes from the annotations:
cut -d , -f 1 dataset/download/human_ann_2016_08/validation/labels.csv | sort | uniq > dataset/download/human_ann_2016_08/validation/labels.sorted.csv
Now the format can be changed to the FastXML format:
./tmp/format_convert_scripts/cpp_fastxml_format.py \
--features-labels-file $OUT_FEATURES_WITH_CLASSES \
--classes-sorted-file dataset/download/human_ann_2016_08/validation/labels.sorted.csv \
--features-out-file $OUT_FEATURES_FASTXML \
--classes-out-file $OUT_CLASSES_FASTXML
See script for explanation.
To convert to the MULAN format:
./tmp/format_convert_scripts/convert_to_mulan.py \
--classes-in-file $OUT_CLASSES_FASTXML \
--features-in-file $OUT_FEATURES_FASTXML \
--out-file $OUT_MULAN
See script for explanation.
Now train the cpp classifier:
./tmp/cluster_scripts/run-fastxml-cluster.sh \
$DATA_DIR \
$NUM_THREADS \
$NUM_THREADS_TEST \
$START_TREE \
$NUM_TREE \
$BIAS \
$LOG_LOSS_COEFF \
$MAX_LEAF \
$LBL_PER_LEAF
$DATA_DIR
: the directory where the classes and features reside (eg. $OUT_FEATURES_FASTXML
and $OUT_CLASSES_FASTXML
from the last step).
See the report for explanations for the other FastXML hyperparameters.
See the script ./tmp/cluster_scripts/run-fastxml-cluster.sh
for the functionality.
- CSV Headers
- 1 ImageID
- 2 Subset
- 3 OriginalURL
- 4 OriginalLandingURL
- 5 License
- 6 AuthorProfileURL
- 7 Author
- 8 Title
- 9 OriginalSize
- 10 OriginalMD5
- 11 Thumbnail300KURL
- train
- Total: 9.011.220
- Size: 18.3 TB
- validation
- Total: 167.057
- Size: 309.9 GB
-
CSV Headers
- 1 ImageID
- 2 Source
- 3 LabelName
- 4 Confidence
-
MISPREDICTED
- Ratio: 31% (false positives)
- Most: "produce, flower, plant, food, sports, shrub, human body"
-
human
- validation
- Total: 1.741.385
- Confidences:
- 0.0 31.5%
- 1.0 68.5%
- Images that have no positive label: ca. 2000
- validation
-
machine
- validation
- Total: 2.060.221
- Confidences:
- 0.5 14.5%
- 0.6 22.2%
- 0.7 20.5%
- 0.8 17.4%
- 0.9 18.4%
- 1.0 7.0%
- train
- Total: 79.196.416
- Confidences:
- 0.5 12.3%
- 0.6 22.0%
- 0.7 21.3%
- 0.8 20.5%
- 0.9 19.8%
- 1.0 4.0%
- validation
Headers
head -n 1 $FILE
Linecount
wc -l $FILE
Extract 4th column
cat $FILE | cut -d , -f 4 > $NEW_FILE
working-with-data-on-the-command-line
cat file.csv | sed -e 's/,,/, ,/g' | column -s, -t
Top10 mispredicted
head -n 10 download/human_ann_2016_08/validation/labels_mispredicted_wc.csv | cut -d , -f 1 | ./labelnames.sh
cat labels.csv | grep -E '/m/036qh8.*,0.0' > labels_mispredicted_wc2.csv