Skip to content

Latest commit

 

History

History
278 lines (206 loc) · 11.6 KB

CHANGELOG.md

File metadata and controls

278 lines (206 loc) · 11.6 KB

Changelog

Current release:

Dataset Changes

  • added unvalidated_sentences.tsv and validated_sentences.tsv
  • unvalidated_sentences.tsv contains sentences that do not have any votes yet, the columns are: sentence_id, sentence, sentence_domain and source
  • validated_sentences.tsvcontains sentences that have at two up votes, it has two additional columns: is_used and clips_count
  • is_used: indicates whether or not the sentence is used on the speak page
  • clips_count: the number of clips that are associated with the sentence
  • add sentence_id and sentence_domain to the Corpora Creator files
  • the following sentence domains are supported

Past releases

Dataset Changes

  • changed times.txt to clip_durations.tsv for consistency
  • clip_durations.tsv contains two columns: clip and duration[ms]

Dataset Changes

  • added times.txt containing mp3 filename and duration in ms

Dataset Changes

Dataset Changes

  • introduced delta segments
  • delta segment tar file naming is cv-corpus-{releaseNumber}-delta-{YYYY-MM-DD}-{locale}.tar.gz
  • delta segments contain the same files except for the training splits, i.e. dev.tsv, test.tsv, train.tsv

Regularly scheduled dataset release Q1 2022.

  • Date released: 27 April 2022
  • Clip cut-off date: 07 April 2022
  • Total hours: 20,217
  • Total validated hours: 14,973
  • Number of languages: 93

New languages since last major release: Tigre, Taiwanese (Minnan), Meadow Mari, Bengali, Toki Pona and Cantonese.

Regularly scheduled dataset release.

  • Date released: 26 January 2022
  • Clip cut-off date: 19 January 2022
  • Total hours: 18,243
  • Total validated hours: 14,122
  • Number of languages: 87

New languages since last major release: Igbo, Marathi, Danish, Norwegian Nynorsk, Central Kurdish, Malayalam, Swahili, Erzya, Moksha, Macedonian and Santali (Ol Chiki).

Note: minor variations in the validated hours of minor dot releases reflects the fact that labeling/validation happens on a different schedule than recording. In the timespan between dot releases the community will usually have performed additional validations, even if the clip cut-off date remains the same.

Regularly scheduled dataset release for H1 of 2021.

  • Date released: 28 July 2021
  • Clip cut-off date: 21 July 2021
  • Total hours: 13,905
  • Total validated hours: 11,192
  • Number of languages: 76

New languages since last major release: Basaa, Slovak, Northern Kurdish, Bulgarian, Kazakh, Bashkir, Galician, Uyghur, Armenian, Belarusian, Urdu, Guarani, Serbian, Uzbek, Azerbaijani, Hausa

Dataset Changes

  • changed tar file naming from cv-corpus-{releaseNumber}-{YYYY-MM-DD}_{locale}.tar.tar cv-corpus-{releaseNumber}-{YYYY-MM-DD}_cv-corpus-{releaseNumber}-{YYYY-MM-DD}-{locale}.tar.gz, e.g. cv-corpus-7.0-2021-07-21_cv-corpus-7.0-2021-07-21-tr.tar.gz`

Update to Singleword Segment 6.1

  • Date released: 28 July 2021
  • Clip cut-off date: 21 July 2021
  • Total hours: 141
  • Total validated hours: 82
  • Number of languages: 34

Correction to Corpus 6.0, which had a bug that did not properly attribute demographics information.

  • Date released: 22 Dec 2020
  • Clip cut-off date: 11 Dec 2020
  • Total hours: 9,283
  • Total validated hours: 7,335
  • Number of languages: 60

Correction to Corpus 6.0, which had a bug that did not properly attribute demographics information.

  • Date released: 22 Dec 2020
  • Clip cut-off date: 11 Dec 2020
  • Total hours: 131
  • Total validated hours: 77
  • Number of languages: 31

Regularly scheduled dataset release for H2 of 2020.

  • Date released: 22 Dec 2020
  • Clip cut-off date: 11 Dec 2020
  • Total hours: 9,261
  • Total validated hours: 7,327
  • Number of languages: 60

New languages since last major release: Hindi, Lithuanian, Luganda, Thai, Finnish, Hungarian

Update to Singleword Segment 5.1

  • Date released: 22 Dec 2020
  • Clip cut-off date: 11 Dec 2020
  • Total hours: 131
  • Total validated hours: 77
  • Number of languages: 31

Correction to Corpus 5.0, which unintentionally altered the column order of the test/train/dev sets, and included some redundant metadata entries for clips that didn’t actually have valid audio.

  • Date released: 14 July 2020
  • Clip cut-off date: 22 June 2020
  • Total hours: 7,226
  • Total validated hours: 5,671*
  • Number of languages: 54

Correction to Singleword Segment 5.0, which was still optimizing for no repeated sentences during segmentation and thus resulted in disproportionately small test/dev/train sets.

  • Date released: 16 September 2020
  • Clip cut-off date: 22 June 2020
  • Total hours: 120
  • Total validated hours: 64
  • Number of languages: 18

Regularly scheduled dataset release for H1 of 2020. This release introduced sha256 checksum values for each dataset, which you can find on the datasets page for each language, or in the datasheet files.

  • Date released: 30 June 2020
  • Clip cut-off date: 22 June 2020
  • Total hours: 7,226
  • Total validated hours: 5,591
  • Number of languages: 7,226

New languages since last major release: Upper Sorbian, Romanian, Frisian, Czech, Greek, Romansh Vallader, Polish, Assamese, Ukranian, Maltese, Georgian, Punjabi, Odia, and Vietnamese

Dataset Changes

  • changed archive folder structure: dataset release archive contains now a locale folder before:
      cv-corpus-3_tr
      ├── clips
      ├── dev.tsv
      ├── invalidated.tsv
      ├── other.tsv
      ├── test.tsv
      ├── train.tsv
      └── validated.tsv
    
    now:
    cv-corpus-5.1-2020-06-22
    └── tr
        ├── clips
        ├── dev.tsv
        ├── invalidated.tsv
        ├── other.tsv
        ├── reported.tsv
        ├── test.tsv
        ├── train.tsv
        └── validated.tsv
    
  • added reported.tsv containing sentences that have been reported by the community
  • added locale and segment columns to the Corpora Creator files

This contains all of the voice data collected as part of the [Common Voice pilot target segment effort] collecting single-word utterances for a benchmark experiment.

  • Date released: 30 June 2020
  • Clip cut-off date: 22 June 2020
  • Total hours: 120
  • Total validated hours: 64
  • Number of languages: 18

Regularly scheduled dataset release for H2 of 2019.

  • Date released: 14 Jan 2020
  • Clip cut-off date: 10 Dec 2019
  • Total hours: 4,257
  • Total validated hours: 3,401
  • Number of languages: 40

New languages since last major release: Abkhazian, Arabic, Chinese (Hong Kong), Indonesian, Interlingua, Japanese, Latvian, Portuguese, Romansh (Sursilvan), Tamil, and Votic.

Dataset Changes

  • changed tar file naming from cv-corpus-{releaseNumber}_{locale}.tar.tar to cv-corpus-{releaseNumber}-{YYYY-MM-DD}_{locale}.tar.tar, e.g. cv-corpus-4-2019-12-10_tr.tar.tar`

Minor update to Corpus 2 to correct an issue with file-naming.

  • Date released: 24 June 2019
  • Clip cut-off date: 24 June 2019 (est)
  • Total hours: 2,454
  • Total validated hours: 1,979
  • Number of languages: 29

New languages since last major release: Persian

Regularly scheduled dataset release for H1 of 2019.

  • Date released: 11 June 2019
  • Clip cut-off date: 11 June 2019 (est)
  • Total hours: 2,366
  • Total validated hours: 1,872
  • Number of languages: 28

New languages since last major release: Basque, Spanish, Chinese (Mandarin), Mongolian, Yakut, Divehi, Kinyarwandan, Swedish, Russian

First multilingual release.

  • Date released: 25 February 2019
  • Clip cut-off date: 25 Feburary 2019 (est)
  • Total hours: 1,368
  • Total validated hours: 1,096
  • Number of languages: 19

New languages since last major release: German, French, Welsh, Breton, Chuvash, Turkish, Tatar, Kyrgyz, Irish, Kabyle, Catalan, Chinese (Taiwan), Slovenian, Italian, Dutch, Hakka Chin, Esperanto, Estonian

Dataset Structure

  • the dataset release folder structure is as follows:
      cv-corpus-1_tr
      ├── clips
      ├── dev.tsv
      ├── invalidated.tsv
      ├── other.tsv
      ├── test.tsv
      ├── train.tsv
      └── validated.tsv
    
  • to get more information about the files included in the dataset release, please see Corpora Creator
  • in general the files dev.tsv, test.tsv, train.tsv, validated.tsv, invalidated.tsv and other.tsv are generated by the Corpora Creator
  • they contain the following columns: client_id, path, sentence, up_votes, down_votes, age, gender, accent

English Corpus 1

  • Date released:
  • Clip cut-off date:
  • Total hours:
  • Total validated hours:
  • Number of languages: 1