In this repo, we provide code and pretrained models for the paper Improving Semantic Video Retrieval models by Training with a Relevance-aware Online Mining strategy, which is under journal review. The code also covers the implementation of a preliminary version of this work, called "Learning video retrieval models with relevance-aware online mining", which was accepted for presentation at the 21st International Conference on Image Analysis and Processing (ICIAP).
Requirements: python 3, allennlp 2.8.0, h5py 3.6.0, pandas 1.3.5, spacy 2.3.5, torch 1.7.0 (also tested with 1.8)
# clone the repository
cd ranp
export PYTHONPATH=$(pwd):${PYTHONPATH}
-
Features:
-
Additional:
- pre-extracted annotations for EPIC-Kitchens-100 and MSR-VTT
- split folders for EPIC-Kitchens-100 and MSR-VTT
- GloVe checkpoints for EPIC-Kitchens-100 and MSR-VTT
- HowTo100M weights for EAO
To launch a training, first select a configuration file (e.g. prepare_mlmatch_configs_EK100_TBN_thrPos_hardPos.py
) and execute the following:
python t2vretrieval/driver/configs/prepare_mlmatch_configs_EK100_TBN_thrPos_hardPos.py .
This will return a folder name (where config, models, logs, etc will be saved). Let that folder be $resdir
. Then, execute the following to start a training:
python t2vretrieval/driver/multilevel_match.py $resdir/model.json $resdir/path.json --is_train --load_video_first --resume_file glove_checkpoint_path
Replace multilevel_match.py
with eao_match.py
to use Everything-at-once (txt-vid version) in place of HGR.
To automatically check for the best checkpoint (after a training run):
python t2vretrieval/driver/multilevel_match.py $resdir/model.json $resdir/path.json --eval_set tst
To resume one of the checkpoints provided:
python t2vretrieval/driver/multilevel_match.py $resdir/model.json $resdir/path.json --eval_set tst --resume_file checkpoint.th
On EPIC-Kitchens-100:
- HGR: (35.9 nDCG, 39.5 mAP)
- HGR with Triplet-RANP: thr=0.15 (58.8 nDCG, 47.2 mAP)
- EAO: (34.5 nDCG, 35.0 mAP)
- EAO with Triplet-RANP: thr=0.10 (59.5 nDCG, 45.1 mAP)
On MSR-VTT:
- HGR: (26.7 nDCG)
- HGR with Triplet-RANP: thr=0.10 (35.4 nDCG)
- EAO: (24.8 nDCG)
- EAO with Triplet-RANP: thr=0.10 (34.4 nDCG)
- EAO with Triplet-RANP (+HowTo100M PT): thr=0.10 (35.6 nDCG)
We thank the authors of Chen et al. (CVPR, 2020) (github), Wray et al. (ICCV, 2019) (github), Wray et al. (CVPR, 2021) (github), Shvetsova et al. (CVPR, 2022) (github) for the release of their codebases.
If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:
@inproceedings{falcon2022learning,
title={Learning video retrieval models with relevance-aware online mining},
author={Falcon, Alex and Serra, Giuseppe and Lanz, Oswald},
booktitle={International Conference on Image Analysis and Processing},
pages={182--194},
year={2022},
organization={Springer}
}
MIT License