Skip to content

William-N-Havard/VGS-dataset-metadata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

VGS-dataset-metadata

Japanese dataset and metadata

The Japanese dataset we used for our experiments is now avaiblable on Zenodo (DOI) This dataset consists of synthetically spoken captions for the STAIR dataset. Following the same methodology as Chrupała et al. (see article | dataset | code) we generated speech for each caption of the STAIR dataset using Google's Text-to-Speech API.

This dataset was used for visually grounded speech experiments (see article accepted at ICASSP2019).

@INPROCEEDINGS{8683069, 
author={W. N. {Havard} and J. {Chevrot} and L. {Besacier}}, 
booktitle={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
title={Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese}, 
year={2019}, 
volume={}, 
number={}, 
pages={8618-8622}, 
keywords={information retrieval;natural language processing;neural nets;speech processing;word processing;artificial neural attention;human attention;monolingual models;part-of-speech tags;nouns;neural models;visually grounded speech signal;English language;Japanese language;word endings;cross-lingual speech-to-speech retrieval;grounded language learning;attention mechanism;cross-lingual speech retrieval;recurrent neural networks.}, 
doi={10.1109/ICASSP.2019.8683069}, 
ISSN={2379-190X}, 
month={May},}

The dataset comprises the following files :

  • mp3-stair.tar.gz : MP3 files of each caption in the STAIR dataset. Filenames have the following pattern imageID_captionID, where both imageID and captionID correspond to those provided in the original dataset (see annotation format here)
  • dataset.mfcc.npy : Numpy array with MFCC vectors for each caption. MFCC were extracted using python_speech_features with default configuration. To know to which caption the MFCC vectors belong to, you can use the files dataset.words.txt and dataset.ids.txt.
  • dataset.words.txt : Captions corresponding to each MFCC vector (line number = position in Numpy array, starting from 0)
  • dataset.ids.txt : IDs of the captions (imageID_captionID) corresponding to each MFCC vector (line number = position in Numpy array, starting from 0)
  • Splits (splits exactly correspond to those used by Chrupała et al.)
    • test
      • test.txt : captions comprising the test split
      • test_ids.txt: IDs of the captions in the test split
      • test_tagged.txt : tagged version of the test split
      • test-alignments.json.zip : Forced alignments of all the captions in the test split. (dictionary where the key corresponds to the caption ID in the STAIR dataset). Due to an unknown error during upload, the JSON file had to be zipped...
    • train
      • train.txt : captions comprising the train split
      • train_ids.txt : IDs of the captions in the train split
      • train_tagged.txt : tagged version of the train split
    • val
      • val.txt : captions comprising the val split
      • val_ids.txt : IDs of the captions in the val split
      • val_tagged.txt : tagged version of the val split

Synthetically spoken COCO metadata

Metadata of the synthetically spoken COCO dataset (by Chrupała et al.) is to be found in this repository: synth-coco-metadata.json

Metadata format

Metadata was structured as follows:

Each JSON file contains a dict where the key corresponds to the caption ID in each dataset. The value associated with each dict is the following:

English

{
 'audio_length': 5.0, #length of WAV file (in seconds)
 'forced_alignments': {
                       # alignments at phone level
                       'phones': [{'end': 0.03, 'label': 'sil', 'start': 0.0},
                                  ...
                                  {'end': 3.46, 'label': 'sp', 'start': 3.19}],
                                  
                       # alignments at word level
                       'words': [{'end': 0.18, 'label': 'a', 'start': 0.03},
								 ...
                                 {'end': 3.19, 'label': 'wall', 'start': 2.85}],
                                 
                       # POS tag of each word
                       'tags': [{'end': 0.18, 'label': 'DT', 'start': 0.03},
                                ...
                                {'end': 3.19, 'label': 'NN', 'start': 2.85}]}
}

Japanese

{
 'audio_length': 3.744, #length of WAV file (in seconds)
 'forced_alignments': {
                       # alignments at phone level
                       'phones': [{'end': 0.18, 'label': 'i', 'start': 0.0},
                                  ...
                                  {'end': 3.46, 'label': 'u', 'start': 3.38}],
                       # alignments at word level (X-SAMPA)
                       'words': [{'end': 0.51, 'label': 'imai', 'start': 0.0},
                                 ...
                                 {'end': 3.46, 'label': 'ru', 'start': 3.31}],
                       
                       # Hiragana transcription of each word          
                       'hiragana': [{'end': 0.51, 'label': 'いまい', 'start': 0.0},
                                    ...
                                    {'end': 3.46, 'label': 'る', 'start': 3.31}],
                                    
                       # Original text (Kanji/Hiragana/Katakana/Romaji)
                       'kanji': [{'end': 0.51, 'label': '今井', 'start': 0.0},
                                 ...
                                 {'end': 3.46, 'label': 'る', 'start': 3.31}],
                                 
                       # POS tag of each word (Kytea)
                       'tags': [{'end': 0.51, 'label': 'N', 'start': 0.0},
                                ...
                                {'end': 3.46, 'label': 'TAIL', 'start': 3.31}]
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published