Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common voice wrong metadata added to supervision set #1325

Open
Roagen7 opened this issue Apr 18, 2024 · 3 comments
Open

Common voice wrong metadata added to supervision set #1325

Roagen7 opened this issue Apr 18, 2024 · 3 comments

Comments

@Roagen7
Copy link

Roagen7 commented Apr 18, 2024

In common_voice.py in _parse_utterance for SupervisionSegment.text audio_info[2] (sentence id) is being set rather than audio_info[3] (sentence).

For reference: recently downloaded common_voice_pl dataset has the following columns (in .tsv files):

  1. client_id
  2. path
  3. sentence_id
  4. sentence
  5. sentence_domain
  6. up_votes
  7. down_votes
  8. age
  9. gender
  10. accents
  11. variant
  12. locale
  13. segment

Is this a bug or am I missing something?

@daniel-dona
Copy link
Contributor

I found the same problem, fixed changing the _parse_utterance function. Probably at some release of the corpus they changed the number of parameters.

def _parse_utterance(
    lang_path: Path,
    language: str,
    audio_info: str,
) -> Optional[Tuple[Recording, SupervisionSegment]]:
    audio_info = audio_info.split("\t", -1)
    audio_path = lang_path / "clips" / audio_info[1]

    if not audio_path.is_file():
        logging.info(f"No such file: {audio_path}")
        return None

    recording_id = Path(audio_info[1]).stem
    recording = Recording.from_file(path=audio_path, recording_id=recording_id)

    segment = SupervisionSegment(
        id=recording_id,
        recording_id=recording_id,
        start=0.0,
        duration=recording.duration,
        channel=0,
        language=language,
        speaker=audio_info[0],
        text=audio_info[3].strip(),
        gender=audio_info[8],
        custom={
            "age": audio_info[7],
            "accents": audio_info[9],
        },
    )
    return recording, segment

@Roagen7
Copy link
Author

Roagen7 commented Apr 21, 2024

Ok so I guess it's broken at the moment. For now I just use this

cut -f 3 --complement file.tsv

Although a PR with your change will fix the inconvenience, I suppose they might be changing the column order of the dataset in the future and it would have to be done over and over again. Wouldn't it be better to make it parameterized?

@pzelasko
Copy link
Collaborator

Thank you for your help in fixing this. Will merge the fix as soon as the PR is ready, parsing the rows into dicts and referring to them by column names is definitely the way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants