Common voice wrong metadata added to supervision set #1325

Roagen7 · 2024-04-18T17:30:35Z

In common_voice.py in _parse_utterance for SupervisionSegment.text audio_info[2] (sentence id) is being set rather than audio_info[3] (sentence).

For reference: recently downloaded common_voice_pl dataset has the following columns (in .tsv files):

client_id
path
sentence_id
sentence
sentence_domain
up_votes
down_votes
age
gender
accents
variant
locale
segment

Is this a bug or am I missing something?

The text was updated successfully, but these errors were encountered:

daniel-dona · 2024-04-19T23:49:46Z

I found the same problem, fixed changing the _parse_utterance function. Probably at some release of the corpus they changed the number of parameters.

def _parse_utterance(
    lang_path: Path,
    language: str,
    audio_info: str,
) -> Optional[Tuple[Recording, SupervisionSegment]]:
    audio_info = audio_info.split("\t", -1)
    audio_path = lang_path / "clips" / audio_info[1]

    if not audio_path.is_file():
        logging.info(f"No such file: {audio_path}")
        return None

    recording_id = Path(audio_info[1]).stem
    recording = Recording.from_file(path=audio_path, recording_id=recording_id)

    segment = SupervisionSegment(
        id=recording_id,
        recording_id=recording_id,
        start=0.0,
        duration=recording.duration,
        channel=0,
        language=language,
        speaker=audio_info[0],
        text=audio_info[3].strip(),
        gender=audio_info[8],
        custom={
            "age": audio_info[7],
            "accents": audio_info[9],
        },
    )
    return recording, segment

Roagen7 · 2024-04-21T17:10:18Z

Ok so I guess it's broken at the moment. For now I just use this

cut -f 3 --complement file.tsv

Although a PR with your change will fix the inconvenience, I suppose they might be changing the column order of the dataset in the future and it would have to be done over and over again. Wouldn't it be better to make it parameterized?

pzelasko · 2024-04-23T14:10:16Z

Thank you for your help in fixing this. Will merge the fix as soon as the PR is ready, parsing the rows into dicts and referring to them by column names is definitely the way to go.

daniel-dona mentioned this issue Apr 23, 2024

In CommonVoice corpus, use .tsv headers to parse and not column index #1328

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Common voice wrong metadata added to supervision set #1325

Common voice wrong metadata added to supervision set #1325

Roagen7 commented Apr 18, 2024 •

edited

daniel-dona commented Apr 19, 2024

Roagen7 commented Apr 21, 2024

pzelasko commented Apr 23, 2024

Common voice wrong metadata added to supervision set #1325

Common voice wrong metadata added to supervision set #1325

Comments

Roagen7 commented Apr 18, 2024 • edited

daniel-dona commented Apr 19, 2024

Roagen7 commented Apr 21, 2024

pzelasko commented Apr 23, 2024

Roagen7 commented Apr 18, 2024 •

edited