About the source code

This repository contains a pytorch implementation for the ICASSP 2023 paper, "Prefix tuning for automated audio captioning"
[project page] [paper]

Model preparation

Downloading the audio encoder pre-trained on AudioSet

Move to AAC_Prefix/PANNs
Type in the command below

gdown https://drive.google.com/file/d/1O-rPXe_anLArvRG4Z3-nLdBYNO_JaFYL/view?usp=sharing --fuzzy

Downloading the pre-trained GPT2 from Huggingface

Move to ClipCap_forAAC
Type in the command below

gdown https://drive.google.com/file/d/15ASmIoWg0ac6qm0ixdiVwh88e8EA2MZ7/view?usp=share_link --fuzzy

Unzip the zip file

Downloading the pre-trained model for this work

Move to ClipCap_forAAC
Type in the command below

gdown https://drive.google.com/file/d/1y2yeK7eO5DFY8n9l9QfiVRwv6GZLEnFA/view?usp=share_link --fuzzy

Unzip the zip file

Dataset download

download the Clotho dataset

Move to Clotho/clotho_audio_files
Type in the command below

gdown https://drive.google.com/file/d/1kOuZrOs1yuOwlOky7ZohVVeiVwYQg1V0/view?usp=sharing --fuzzy

Unzip the zip file

download the AudioCaps Dataset

Move to AudioCaps
Type in the command below

gdown https://drive.google.com/file/d/15ODyZmXDu_gwl-GcgQ6i_dBIeLKPG5-S/view?usp=sharing --fuzzy

Unzip the zip file

Download the audio caption evaluation tools

Move to coco_caption
Type in the command below

sh get_stanford_models.sh

Train the model

# If you are using GPT2 Tokenizer
python3 Experiment_AudioCaps.py <Experiment_name> # AudioCaps Dataset
python3 Experiment_Clotho.py <Experiment_name> # Clotho Dataset
python3 Experiment_FusionDataset.py <Experiment_name> # AudioCaps&Clotho Dataset

# If you are using custom Tokenizer
python3 Experiment_AudioCaps.py <Experiment_name> <vocab_size> # AudioCaps Dataset
python3 Experiment_Clotho.py <Experiment_name> <vocab_size> # Clotho Dataset
python3 Experiment_FusionDataset.py <Experiment_name> # AudioCaps&Clotho Dataset

Evaluate the model

Update(23.12.6): Please use Evaluation_network.ipynb for evaluation. The evaluation methods were incorporated in that .ipynb file.

# If you use gpt2 that was pre-trained by Huggingface
python3 Evaluation_AudioCaps.py <model_name> <epoch_number>
python3 Evaluation_Clotho.py <model_name> <epoch_number>

# If you use a custom tokenizer that was trained by us
python3 Evaluation_AudioCaps.py <model_name> <epoch_number> <vocab_size>
python3 Evaluation_Clotho.py <model_name> <epoch_number> <vocab_size>

Inference(Generate the caption using the model in the paper's table 1)

python3 Inference.py <table_num> <setting_num> <audio_file_path>

# table_num = 1 : Evaluation on Clotho
# table_num = 2 : Evaluation on AudioCaps

# setting_num = 1 : train dataset == test dataset
# setting_num = 2 : train dataset != test dataset
# setting_num = 3 : overall datasets(Clotho & AudioCaps) <- need to test by using compressed audio

# Example
python3 Inference.py 1 1 ./test.wav

Citation

@inproceedings{kim2023prefix,
        title={Prefix tuning for automated audio captioning},
        author={Kim, Minkyu and Sung-Bin, Kim and Oh, Tae-Hyun},
        booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
        pages={1--5},
        year={2023},
        organization={IEEE}
      }

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
AAC_Prefix		AAC_Prefix
AudioCaps		AudioCaps
Clotho		Clotho
Train_record		Train_record
assets		assets
coco_caption		coco_caption
make_vocabulary		make_vocabulary
.DS_Store		.DS_Store
.gitattributes		.gitattributes
Evaluation_network.ipynb		Evaluation_network.ipynb
Experiment_AudioCaps.py		Experiment_AudioCaps.py
Experiment_Clotho.py		Experiment_Clotho.py
Experiment_FusionDataset.py		Experiment_FusionDataset.py
FusionDataset.py		FusionDataset.py
Inference.py		Inference.py
README.md		README.md
Train.py		Train.py
eval_metrics.py		eval_metrics.py
util.py		util.py

minguinho26/Prefix_AAC_ICASSP2023

Folders and files

Latest commit

History

Repository files navigation

About the source code

Model preparation

Downloading the audio encoder pre-trained on AudioSet

Downloading the pre-trained GPT2 from Huggingface

Downloading the pre-trained model for this work

Dataset download

download the Clotho dataset

download the AudioCaps Dataset

Download the audio caption evaluation tools

Train the model

Evaluate the model

Inference(Generate the caption using the model in the paper's table 1)

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages