Skip to content

scutcyr/dstc11-simmc2.1-scut-bds-lab

Repository files navigation

Team: scut-bds-lab

Recent Update

Overview

The SIMMC2.1 challenge aims to lay the foundations for the real-world assistant agents that can handle multimodal inputs, and perform multimodal actions. It has 4 tasks: Ambiguous Candidate Identification, Multimodal Coreference Resolution, Multimodal Dialog State Tracking, Response Generation. We consider the joint input of textual context, tokenized objects and scene as multi-modal input, as well as compare the performance of single task training and multi task joint training. As to subtask4, we also consider the system belief state (act and slot values) as the prombt for response generation. Non-visual metadata is also considered by adding the embedding to the object.

About Folder and File

Clone Our Project:

cd ~
git clone https://github.com/scutcyr/dstc11-simmc2.1-scut-bds-lab.git

Requirements and Installation

You can create the python environment for running this project by using conda:

conda create -n py38 python=3.8
conda activate py38
cd ~/dstc11-simmc2.1-scut-bds-lab
pip install -r requirements.txt

The python package for running this project are shown as:

python>=3.7
accelerate==0.11.0
attrs
chardet==5.0.0
datargs==0.11.0
future==0.18.2
gdown==4.5.1
imagesize==1.4.1
ipdb==0.13.9
matplotlib==3.5.2
nltk==3.7
notebook
opencv-python==4.6.0.66
opencv-python-headless==4.6.0.66
pandas==1.4.3
parlai==1.6.0
pytorch-ignite==0.4.8
sacremoses==0.0.53
scikit-learn==1.1.1
sentencepiece==0.1.96
setuptools==59.5.0
sklearn
tensorflow==2.10.0
torch==1.12.0
torchaudio==0.12.0
torchtext==0.13.0
torchvision==0.13.0
tqdm==4.62.3
transformers==4.22.2

Dataset Preprocessing

(1) Download SIMMC2.1

Download the dataset SIMMC2.1 from https://github.com/facebookresearch/simmc2 by git:

cd ~
git lfs install
git clone https://github.com/facebookresearch/simmc2.git

(2) Copy the Dataset and Unzip the .zip file

There are five .zip file:

  • simmc2_scene_images_dstc10_public_part1.zip
  • simmc2_scene_images_dstc10_public_part2.zip
  • simmc2_scene_images_dstc10_teststd.zip
  • simmc2_scene_jsons_dstc10_public.zip
  • simmc2_scene_jsons_dstc10_teststd.zip
cp -rf ~/simmc2/data ~/dstc11-simmc2.1-scut-bds-lab/
cd ~/dstc11-simmc2.1-scut-bds-lab/data
ls # list the file

The files in ```~/dstc11-simmc2.1-scut-bds-lab/data```` are shown as follow:

fashion_prefab_metadata_all.json    simmc2.1_dials_dstc11_dev.json      simmc2_scene_images_dstc10_public_part1.zip  simmc2_scene_jsons_dstc10_teststd.zip
furniture_prefab_metadata_all.json  simmc2.1_dials_dstc11_devtest.json  simmc2_scene_images_dstc10_public_part2.zip
scut-bds-lab_PREPOCESS.md                simmc2.1_dials_dstc11_mini.json     simmc2_scene_images_dstc10_teststd.zip
README.md                           simmc2.1_dials_dstc11_train.json    simmc2_scene_jsons_dstc10_public.zip

Then unzip the .zip file to current path:

cd ~/dstc11-simmc2.1-scut-bds-lab/data
unzip simmc2_scene_images_dstc10_public_part1.zip  # --> ./simmc2_scene_images_dstc10_public_part1
unzip simmc2_scene_images_dstc10_public_part2.zip  # --> ./simmc2_scene_images_dstc10_public_part2
# Merge part1 and part2 files into ./simmc2_scene_images_dstc10_public
mkdir simmc2_scene_images_dstc10_public
cp simmc2_scene_images_dstc10_public_part1/* simmc2_scene_images_dstc10_public
cp simmc2_scene_images_dstc10_public_part2/* simmc2_scene_images_dstc10_public
rm -rf simmc2_scene_images_dstc10_public_part1
rm -rf simmc2_scene_images_dstc10_public_part2

unzip simmc2_scene_images_dstc10_teststd.zip       # --> ./simmc2_scene_images_dstc10_teststd
unzip simmc2_scene_jsons_dstc10_public.zip         # --> ./public
mkdir simmc2_scene_jsons_dstc10_public
cp public/* simmc2_scene_jsons_dstc10_public
rm -rf public

unzip simmc2_scene_jsons_dstc10_teststd.zip        # --> ./simmc2_scene_jsons_dstc10_teststd

(3) Preprocess the dataset

cd ~/dstc11-simmc2.1-scut-bds-lab/scripts
./0_dataset_preprocessing.sh
./0_dataset_preprocessing_predict_with_sys_state.sh
./0_dataset_preprocessing_for_task4.sh

The above preprocessing scripts takes about two days, because it generates samples of different conversation rounds. You can simplify it.
After preprocessing the dataset, you can find the preprocessed file in ~/dstc11-simmc2.1-scut-bds-lab/data_convert.

Note: We have two different preprocessed dataset file, Files like simmc2.1_dials_dstc11_train_ctxlen2_sysana_for_task4.txt gather all data and labels (split by Tab) required for training into one file. Another data form contains multiple line-by-line files.

Model

All the models are defined in the folder ./models. We provide the following pre training model interface codes for fine-tuning on SIMMC2.1:
(a) Encoder-Decoder:

(b) Multi-Modal Encoder

(c) Multi-Modal Encoder-Decoder

Note: You can specify different pre training models for further fine-tuning by modifying arguments --model_type and --model_name_or_path. For T5-11B and UL-2, you can achieve model pipeline parallelism by specifying parameter --model_parallel.

Training

You can find the training script examples in the folder ./scripts. Before running the script, you need to modify some arguments, mainly including: source .bashrc path, conda python environment, WORK_DIR, INIT_DATA_DIR, PREPROCESS_DATA_DIR, --model_name_or_path. All the pretrained model can be downloaded from 🤗 Transformers - Hugging Face.
For example, you can download the BART-large model by using:

cd ~
mkdir pretraining_model
cd pretraining_model
git lfs install
git clone https://huggingface.co/facebook/bart-large

After modifying the above arguments and downloading the pretrained model, you can run the bash script to fine tune the model:

cd ~/dstc11-simmc2.1-scut-bds-lab/scripts
./run_train_model_simmc21_bart_20221017_1040.sh

View the training process of the model via Tensorboard:

cd ~/dstc11-simmc2.1-scut-bds-lab
tensorboard --logdir=./runs --port=6666 --bind_all

Then use browser to open the url, according to the terminal prompts:

TensorBoard 2.10.1 at http://<your_server_ip_or_name>:6666/ (Press CTRL+C to quit)

Note: If you use the preprocessed data converted by 0_dataset_preprocessing_for_task4.sh, you only need to appoint the --train_input_file and --eval_input_file, e.g. run_train_model_simmc21_ofa_20221013_0930.sh

INIT_DATA_DIR=~/dstc11-simmc2.1-scut-bds-lab/data
PREPROCESS_DATA_DIR=~/dstc11-simmc2.1-scut-bds-lab/data_convert
CONTEXT_LENGTH=6 # 2,4,6,8
# Single file input format
    --train_input_file=$PREPROCESS_DATA_DIR/simmc2.1_dials_dstc11_train_ctxlen${CONTEXT_LENGTH}_sysana_for_task4.txt \
    --eval_input_file=$PREPROCESS_DATA_DIR/simmc2.1_dials_dstc11_devtest_ctxlen${CONTEXT_LENGTH}_sysana_for_task4.txt \
# Multiple files input format
    --train_input_file=$PREPROCESS_DATA_DIR/simmc2.1_dials_dstc11_train_predict_ctxlen${CONTEXT_LENGTH}.txt \
    --train_target_file=$PREPROCESS_DATA_DIR/simmc2.1_dials_dstc11_train_target_ctxlen${CONTEXT_LENGTH}.txt  \
    --disambiguation_file=$PREPROCESS_DATA_DIR/simmc2.1_dials_dstc11_train_disambiguation_label.txt \
    --response_file=$PREPROCESS_DATA_DIR/simmc2.1_dials_dstc11_train_response.txt \
    --eval_input_file=$PREPROCESS_DATA_DIR/simmc2.1_dials_dstc11_devtest_predict_ctxlen${CONTEXT_LENGTH}.txt \
    --eval_target_file=$PREPROCESS_DATA_DIR/simmc2.1_dials_dstc11_devtest_target_ctxlen${CONTEXT_LENGTH}.txt \
# IF the model need images
    --train_image_path_file=$PREPROCESS_DATA_DIR/simmc2.1_dials_dstc11_train_scene_name.txt \
    --train_image_dir=$INIT_DATA_DIR/simmc2_scene_images_dstc10_public \
    --eval_image_path_file=$PREPROCESS_DATA_DIR/simmc2.1_dials_dstc11_devtest_scene_name.txt \
    --eval_image_dir=$INIT_DATA_DIR/simmc2_scene_images_dstc10_public \

Evaluation

We provide the code eval_model.py for evaluating the model.

You can download our model from https://huggingface.co/scutcyr/dstc11-simmc2.1-scut-bds-lab by using follow scrips:

cd ~
mkdir pretrained_model
cd pretrained_model
git lfs install
git clone https://huggingface.co/scutcyr/dstc11-simmc2.1-scut-bds-lab

Then change the --model_dir to specify model, such as:

--model_dir=~/pretrained_model/dstc11-simmc2.1-scut-bds-lab/mt-bart/checkpoint-12

in the bash script file run_test_model_simmc21_bart_20221020_2000_use_focalloss_exp3.sh

or

--model_dir=~/pretrained_model/dstc11-simmc2.1-scut-bds-lab/mt-bart-sys/checkpoint-11

in the bash script file run_infer_model_simmc21_bart_20221020_1800.sh

or

--model_dir=~/pretrained_model/dstc11-simmc2.1-scut-bds-lab/mt-bart-sys-nvattr/checkpoint-15

in the bash script file run_test_model_simmc21_bart_sys_state_attr_ctxlen6_20221025_0100_use_focalloss.sh

Results

devtest result

Model Subtask-1 Amb. Candi. F1 Subtask-2 MM Coref F1 Subtask-3 MM DST Slot F1 Subtask-3 MM DST Intent F1 Subtask-4 Response Gen. BLEU-4
mt-bart-ensemble 0.68466 0.77860 0.91816 0.97828 0.34496
mt-bart-dstcla 0.67589 0.78407 0.92013 0.97468
mt-bart-dstcla-ensemble 0.67777 0.78640 0.92055 0.97456
mt-bart-sys 0.39064
mt-bart-sys-2 0.3909
mt-bart-sys-ensemble 0.3894
mt-bart-sys-nvattr 0.38995

teststd result

The teststd result is provided in the ./results/teststd-result. One subfolder corresponds to one model.

References

@inproceedings{chen-etal-2023-exploring-prompt,
    title = "Exploring Prompt-based Multi-task Learning for Multimodal Dialog State Tracking and Immersive Multimodal Conversation",
    author = "Chen, Yirong  and
      Li, Ya  and
      Wang, Tao  and
      Xing, Xiaofen  and
      Xu, Xiangmin  and
      Liu, Quan  and
      Liu, Cong  and
      Hu, Guoping",
    booktitle = "Proceedings of The Eleventh Dialog System Technology Challenge",
    month = sep,
    year = "2023",
    address = "Prague, Czech Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.dstc-1.1",
    pages = "1--8",
    abstract = "With the rise of the metaverse, immersive multimodal conversation has attracted more and more researchers{'} attention. Multimodal contexts will become more important for human-computer interaction in the metaverse, especially in shopping domain. Unlike traditional conversation tasks, immersive multimodal conversation has challenges such as multimodal ambiguous candidate identification and multimodal coreference resolution, which makes it more difficult to dialog state tracking and response generation, as described in SIMMC 2.1 challenge, a part of DSTC11. In particular, as the number of objects in the scene increases, the difficulty will increase dramatically. We proposed a prompt-based multi-task learning Encoder-Decoder, in which different subtasks use different prompts to make the model tend to focus on the current subtask. We achieve the winner in ambiguous candidates indentification and runner-up in multimodal coreference resolution (MM-Coref), multimodal dialog state tracking (MM-DST) and assistant response generation. Our code and model are made publicly available at https://github.com/scutcyr/dstc11-simmc2.1-scut-bds-lab.",
}


@inproceedings{kottur-etal-2021-simmc,
    title = "{SIMMC} 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations",
    author = "Kottur, Satwik  and
      Moon, Seungwhan  and
      Geramifard, Alborz  and
      Damavandi, Babak",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.401",
    doi = "10.18653/v1/2021.emnlp-main.401",
    pages = "4903--4912",
}

@inproceedings{lee-etal-2022-learning,
    title = "Learning to Embed Multi-Modal Contexts for Situated Conversational Agents",
    author = "Lee, Haeju  and
      Kwon, Oh Joon  and
      Choi, Yunseon  and
      Park, Minho  and
      Han, Ran  and
      Kim, Yoonhyung  and
      Kim, Jinhyeon  and
      Lee, Youngjune  and
      Shin, Haebin  and
      Lee, Kangwook  and
      Kim, Kee-Eung",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-naacl.61",
    doi = "10.18653/v1/2022.findings-naacl.61",
    pages = "813--830",
}

Acknowledge

License

The project is provided under the Apache-2.0 License.