Skip to content

Latest commit

 

History

History
67 lines (39 loc) · 6.46 KB

SENTENCES.md

File metadata and controls

67 lines (39 loc) · 6.46 KB

Sentences on Common Voice

As Common Voice is a read dataset, sentences are our currency. You can help by adding new sentences to our dataset for other contributors to read, helping with bulk sentence extractions, or reporting problematic sentences.

In a few words...

📝 Sentence collector is a tool for contributors to upload public domain sentences through a website which then can get reviewed and are exported to the Common Voice database. Once imported into the Common Voice website, they will show up for contributors to read out aloud. This is a good place to start for newcomers to this project.

📘 Contributors who want to bulk upload thousands of sentences, like for books, should check out the Bulk Submission guidelines below. There is no dedicated repository for this.

🖥️ For automatic extraction of data sources, the Sentence Extractor is dedicated for extracting from sources such as Wikipedia, Wikisource or raw files.

Sentence Collector

The Sentence Collector is a website for crowdsourcing sentences for Common Voice. You can either:

  • Add sentences for your language
  • Validate sentences that other contributors have added

Each sentence requires at least two upvotes from human validation to be considered valid.

Every week, validated sentences from the Sentence Collector will be exported and added to the Common Voice repository, and will be available at the next release of the Common Voice website.

For more detailed explanations, see the README file of the Sentence Collector.

Automatic extraction

The Sentence Extractor is a tool that can scrape public domain data sources for sentences. There are multiple sources integrated into the Sentence Extractor, such as Wikipedia and Wikisource. Please see this post for detailed guidance on how to use the Sentence Extractor.

Bulk submission

If you know of a public domain corpus of sentences with more than 10k sentences, you can manually submit a pull request to add this as a bulk dataset. However, you will need to manually perform QA (quality assurance) to make sure the sentences are valid and high-quality.

This Discourse post has a more detailed guide for how to do manual QA, but in brief:

  • You need 2-3 native speakers to review a random sample of sentences to verify their correctness
  • The sentences should be spelled correctly.
  • The sentences should be grammatically correct.
  • The sentences should be speakable (also avoiding non-native uncommon words)

We're looking for less than 5% of error rate on the random sample. You can use this tool with a confidence level of 99% and a margin of error of 2% to determine the sample size you need to review.

Feel free to set up this QA however makes most sense for you, but here's a sample Google Spreadsheets template.

Once the review is complete, submit a pull request with the # of sentences submitted, a link to the manual QA results, and the % error rate. Here's an example PR. Please make sure the sentences are in a plain .txt file with one sentence per line.

QA that applies (or not) to the different inputs and outputs

Depending of the process, different automated transformations are applied:

To help you for example with the 'not supervised, do your own QA' bulk submission, you may find these resources interesting.

Correcting existing data

Some methods don't go through automated cleanup/validation/rules, and they are not unified. Thus, there is a process to remove old data that might need to be discarded.

Flagging (and removing) problematic sentences already in the Common Voice database

If you notice sentences that need to be deleted, first check what the source of the sentence is.

Search for the source of the sentence within the data folder. Folders are split up by language. If the sentence is found in sentence-collector.txt, that means it was automatically exported from Sentence Collector. In that case, please file an issue with a plaintext file of all problematic sentences, listed one sentence per line. Note that in this case removing it from the file through a Pull Request will not help, because it will automatically be added again with the next export.

If the sentence is from a different source, you can file a pull request that modifies the text file directly. If possible, also attach a separate plaintext file that has all of the problem sentences, with one sentence per line.