-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common voice wrong metadata added to supervision set #1325
Comments
I found the same problem, fixed changing the _parse_utterance function. Probably at some release of the corpus they changed the number of parameters. def _parse_utterance(
lang_path: Path,
language: str,
audio_info: str,
) -> Optional[Tuple[Recording, SupervisionSegment]]:
audio_info = audio_info.split("\t", -1)
audio_path = lang_path / "clips" / audio_info[1]
if not audio_path.is_file():
logging.info(f"No such file: {audio_path}")
return None
recording_id = Path(audio_info[1]).stem
recording = Recording.from_file(path=audio_path, recording_id=recording_id)
segment = SupervisionSegment(
id=recording_id,
recording_id=recording_id,
start=0.0,
duration=recording.duration,
channel=0,
language=language,
speaker=audio_info[0],
text=audio_info[3].strip(),
gender=audio_info[8],
custom={
"age": audio_info[7],
"accents": audio_info[9],
},
)
return recording, segment |
Ok so I guess it's broken at the moment. For now I just use this
Although a PR with your change will fix the inconvenience, I suppose they might be changing the column order of the dataset in the future and it would have to be done over and over again. Wouldn't it be better to make it parameterized? |
Thank you for your help in fixing this. Will merge the fix as soon as the PR is ready, parsing the rows into dicts and referring to them by column names is definitely the way to go. |
In
common_voice.py
in_parse_utterance
forSupervisionSegment.text audio_info[2]
(sentence id) is being set rather thanaudio_info[3]
(sentence).For reference: recently downloaded common_voice_pl dataset has the following columns (in .tsv files):
Is this a bug or am I missing something?
The text was updated successfully, but these errors were encountered: