Skip to content

Latest commit

 

History

History
115 lines (67 loc) · 6 KB

datasheet.md

File metadata and controls

115 lines (67 loc) · 6 KB

Datasheet

This is inspired from the Datasheets for datasets paper.

Motivation

Q1) For what purpose was the dataset created ? Was there a specific task in mind ? Was there a specific gap that needed to be filled ?

Ans. This is a dataset for Intent classification from (Indian English) speech, and covers 14 coarse-grained intents from the Banking domain. While there are other datasets that have approached this task, here we provide a much largee training dataset (>650 samples per intent) to train models in an end-to-end fashion. We also provide anonymised speaker information to help answer questions around model robustness and bias.

Q2) Who created the dataset and on behalf of which entity ?

Ans. The (internal) Operations team at Skit was involved in the generation of the dataset, and provided their information for (anonymous) release. Unnati was involved in the curation of utterance templates, and Kriti and Manas were involved in the planning and collection of utterances - using an internal tool called sandbox. These contributors worked on this dataset as part of the Conversational UX and ML teams at Skit.

Q3) Who funded the creation of the dataset ?

Ans. Skit funded the creation of this dataset.

Composition

Q4) What do the instances that comprise the dataset consist of ?

Ans. The intent dataset is split across train.csv and test.csv. In both, individual instances consist of the following fields:

  • id
  • intent_class
  • template
  • audio_path
  • speaker_id

You can trace more information on the intents, using the shared intent_class field in intent_info.csv:

  • intent_class
  • intent_name
  • description

You can trace more information on the speakers, using the shared speaker_id field in speaker_info.csv:

  • speaker_id
  • native_language
  • languages_spoken
  • places_lived
  • gender

Q5) How many instances are there in total (of each type, if appropriate) ?

Ans. In all there are 11845 samples, across the train and test splits:

  • test.csv has a total of 1400 samples, with exactly 100 samples per intent
  • train.csv has a total of 10445 samples, with atleast 650 samples per intent

The 11 speakers are distributed across the dataset, but unequally. However:

  • each intent has data from all speakers
  • the speakers are stratified across the train and test split - for each intent independently

Some statistics on the speakers are provided below. More granular information can be found in speaker_info.csv:

  • Native languages: Hindi(4), Bengali(3), Kannada(2), Malayalam(1), Punjabi(1)
  • Languages spoken: Hindi, English, Bengali, Odia, Kannada, Punjabi, Malayalam, Bihari, Marathi
  • Indian states lived in: Bihar, Odisha, Karnataka, West Bengal, Punjab, Kerala, Jharkhand, Maharashtra

Q6) Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set ?

Ans. For each intent, our Conversational UX team generated a list of templates. These are meant to be a (satisfactory) representation of all the variations in utterances, seen in human speech. These templates were used as a guide by the speakers when generating data. So, this dataset is limited by the templates and the variations that speakers added (spontaneously).

Q7) Are there recommended data splits (e.g., training, development/validation, testing) ?

Ans. The recommended split into train and test sets is provided as train.csv and test.csv respectively.

Q8) Are there any errors, sources of noise, or redundancies in the dataset?

Ans. There could be channel noise present in the dataset, because the data was generated through telephone calls. However, background noise will not be as prevalent as in real-world scenarios, since these telephone calls were made in a semi-controlled environment.

Q9) Other comments.

Ans. Speakers were responsible for generating variations in utterances, using the template field as a guide. So, there could be some unintentional overlap across the content of utterances.

Collection Process

Q10) How was the data associated with each instance acquired ?

Ans. Members of the (internal) Operation team generated each utterance - using the associated template field as a guide, and injecting their own variations into the speech utterance.

Q11) Who was involved in the data collection process and how were they compensated ?

Ans. The data was generated by the (internal) Operations team and they are/were full-time employees.

Q12) Over what timeframe was the data collected ?

Ans. This data was collected over a time period of 1 month.

Q13) Was any preprocessing/cleaning/labelling of the data done ?

Ans. Audio instances in the dataset were auto-labelled with their associated intent and template fields. For more information on this, refer to the documentation of sandbox.

Recommended Uses

Q14) Has the dataset been used for any tasks already ?

Ans. It has been used to benchmark models for the task of intent classification from speech.

Q15) What (other) tasks could the dataset be used for ?

Ans. We provide speaker characteristics. So, this dataset could be used for alternate classification tasks from speech - like, gender or native language.

Distribution and Maintenance

Q16) Will the dataset be distributed under a copyright or other intellectual property (IP) license ?

Ans. This dataset is being distributed under a CC BY NC license.

Q17) Who will be maintaining the dataset ?

Ans. The research team at Skit will be maintaining the dataset. They can be contacted by sending an email to ml-research@skit.ai.

Q18) Will the dataset be updated in the future (e.g., to correct labelling errors, add new instances, delete instances) ?

Ans. Incase there are errors, we will try to collate and share an updated version every 3 months. We also plan to add more instances and variations to the dataset - to make it more robust.