Final project for Human-computer interaction course

Cross modal retrieval - Can we retrieve a text which describes an image?

This work is focused on finding of intermodal correspondences between textual and visual modalities with incorporation gaze information as a regularization technique.

Datasets

MS-coco: download and place under "dataset" folder.
Salicon: download and place under "dataset" folder.
run python utils.py

Or download from here.

Training

Depends on which sampling algorithm,

batch sampling: python train_two_encoders.py
queue sampling: python train_queue.py
rerank sampling: python train_rerank.py

Note: training expects high GPU memory usage, so either use a single GPU with more than 10GB or two GPUs. If it is the former case, change argument multi to 0.

Inference

See inference_with_two_enc.py.

Visualization

You can vizualize the results via Tensorboard, specify the path to the logs

tensorboard --logdir=path/to/the/log

Examples

Examples are available here: debug_post_training.ipynb Set the name(timeline) of your pre-trained model:

timeline = "20200130-130202"

Models

We provide a sample model pre-trained with the queue sampling algorithim (queue size 0.3) Download from here.

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
figures		figures
.gitignore		.gitignore
HCI_project.pdf		HCI_project.pdf
README.md		README.md
check_gpu.py		check_gpu.py
debug_post_training.ipynb		debug_post_training.ipynb
inference_with_two_enc.py		inference_with_two_enc.py
inspect_data.ipynb		inspect_data.ipynb
intro_image.png		intro_image.png
plotting.py		plotting.py
teacher_network.py		teacher_network.py
test_loading_attmap.py		test_loading_attmap.py
text_network.py		text_network.py
train.sh		train.sh
train_queue.py		train_queue.py
train_rerank.py		train_rerank.py
train_two_encoders.py		train_two_encoders.py
train_with_att_maps.py		train_with_att_maps.py
utils.py		utils.py
vision_network.py		vision_network.py

sontung/hci-intermodal-reasoning

Folders and files

Latest commit

History

Repository files navigation

Final project for Human-computer interaction course

Cross modal retrieval - Can we retrieve a text which describes an image?

Datasets

Training

Inference

Visualization

Examples

Models

About

Topics

Resources

Stars

Watchers

Forks

Languages