Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with CutSet.from_manifests #1240

Open
juliendespres opened this issue Dec 18, 2023 · 6 comments
Open

Problem with CutSet.from_manifests #1240

juliendespres opened this issue Dec 18, 2023 · 6 comments

Comments

@juliendespres
Copy link

Hi,
I'm having a problem with the from_manifest function in the CutSet class.

I've decomposed a CutSet manifest using the CutSet.decompose() function so as to obtain the 3 files "features", "recordings" and "supervisions", with the aim of modifying the "supervision" file and then regenerating the CutSet file.

The problem occurs when I try to recompose these three files with the CutSet.from_manifests function, I get the following error :
Traceback (most recent call last):
File "local/recompose_manifest.py", line 97, in
main()
File "local/recompose_manifest.py", line 86, in main
cut_set = CutSet.from_manifests(recordings=recordings, supervisions=supervisions, features=features)
File "/home/despres/miniconda3/envs/k2_2312/lib/python3.8/site-packages/lhotse/cut/set.py", line 352, in from_manifests
return create_cut_set_eager(
File "/home/despres/miniconda3/envs/k2_2312/lib/python3.8/site-packages/lhotse/cut/set.py", line 3003, in create_cut_set_eager
recording=recordings[feats.recording_id] if rec_ok else None,
File "/home/despres/miniconda3/envs/k2_2312/lib/python3.8/site-packages/lhotse/audio/recording_set.py", line 389, in getitem
return next(
StopIteration

This function works without a problem if I pass any subset of only 2 files as parameters ("supervision+features", "features+recordings", "supervisions+recording").

Is it a bug, or is this function simply not designed for it?

If not, is there another way of regenerating this CutSet file without having to regenerate the features?

Thank you very much for your time.

@pzelasko
Copy link
Collaborator

I don't think decompose was ever tested in this way, although I would have expected it to work. I'm afraid I don't have enough time right now to look into it myself. Generally you should be able to create a CutSet from 2 components (e.g. features + supervisions) and then manually attach the third one (e.g. recordings) in a for loop. If you happen to find what is the issue, please share it with us.

@juliendespres
Copy link
Author

Thank you for you response.
I'm not sufficiently proficient in Python to do this kind of trick, but I ended up easily replacing the content of the text tag in the jsonl manifest with a simple perl script.

However, this feature seems to me to be essential to avoid having to regenerate features every time you change a comma in the supervision texts, and it would be interesting to be able to do this simply in future Lhotse developments.

@pzelasko
Copy link
Collaborator

Thanks, you're right. I'll keep the issue open for now.

@pzelasko pzelasko reopened this Dec 24, 2023
@RuABraun
Copy link
Contributor

I have the same issue. I'm doing this for the purpose of undoing trim_to_supervisions.

@RuABraun
Copy link
Contributor

RuABraun commented Jan 10, 2024

Seems to be because features doesn't have a recording_id (or anything else that knows what cut it was a part of).

@pzelasko
Copy link
Collaborator

Features does have recording_id field. If you can provide some way to reproduce with a small dataset like yesno or mini Librispeech I can look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants