Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General and speech recognition enhancement #140

Open
JorisCos opened this issue Nov 15, 2020 · 2 comments
Open

General and speech recognition enhancement #140

JorisCos opened this issue Nov 15, 2020 · 2 comments

Comments

@JorisCos
Copy link

About this issue

I open this issue to start a discussion about some limitations that I have encountered when using Scaper for speech recognition in noisy environments.

General limitations

  • It could be nice to have the possibility to have multiple background files to cover the soundscape duration. By that, I mean a succession of background files instead of duplicating the same file over and over.

  • When the background duration is shorter than the soundscape duration, the background is duplicated. The duplication is made "roughly" by using numpy.tile function. A smoother way could be to have an ascending hanning window on the new background file and a descending one the ending background file.

  • A sampling without replacement method could be really useful for choosing background files and source files. By that, I mean that if I have 200 background files and 200 source files and I generate 200 soundscapes then each background file and source file should only be used once.

  • Being able to provide a glob object for choosing those files could be a plus.

  • The file generation function is tied to the writing on disk function which means you can't have a dynamic generator. Having the possibility to generate the soundscape without writing it to the disk would solve that.

Speech recognition related

As mentioned in #1 the duration of a soundscape is fixed. This is a real limitation for speech recognition purposes as the utterances have variable durations. Using scaper force you to choose to generate soundscapes as long as the longest duration of your utterances. Having a parameter to force the soundscape duration to match the utterance duration would avoid post-processing.

Contribution

As mentioned #92 a tutorial for source separation could be a cool thing to have. I would be glad to a tutorial that uses scaper to generate data Asteroid to perform source separation and ESPNet to perform speech recognition.

@justinsalamon
Copy link
Owner

justinsalamon commented Nov 16, 2020

Thanks @JorisCos for your feedback!

Let's discuss:

The file generation function is tied to the writing on disk function which means you can't have a dynamic generator. Having the possibility to generate the soundscape without writing it to the disk would solve that.

Since 1.5.0 Scaper can return audio/annotations in memory (we're on 1.6.4 now).

I would be glad to a tutorial that uses scaper to generate data Asteroid to perform source separation and ESPNet to perform speech recognition.

We have source separation tutorial using Scaper (as a real-time data generator) this ISMIR 2020, and wrote it as an online book: https://source-separation.github.io/tutorial. It focuses on music source separation, but could be directly applied to speech with little or no modifications required I believe.

It could be nice to have the possibility to have multiple background files to cover the soundscape duration. By that, I mean a succession of background files instead of duplicating the same file over and over.

When the background duration is shorter than the soundscape duration, the background is duplicated. The duplication is made "roughly" by using numpy.tile function. A smoother way could be to have an ascending hanning window on the new background file and a descending one the ending background file.

We've had a few requests for more background event controls, some of which are documented in #47. Can you give it a quick look and then add/elaborate on your required functionality? There are some non-trivial considerations to keep in mind, e.g., how we handle/determine ref_db, but we can map out a solution for this.

As mentioned in #1 the duration of a soundscape is fixed.

Yup, this has been a constraint since the beginning because it simplifies a lot of the (fairly complex) logic around soundscape generation. That said, I agree it'd be nice to have more flexibility here. Happy to discuss further via #1.

A sampling without replacement method could be really useful for choosing background files and source files. By that, I mean that if I have 200 background files and 200 source files and I generate 200 soundscapes then each background file and source file should only be used once.

Right now we support this within a single soundscape via allow_repeated_source in generate(), but we don't support this across soundscapes. At a high-level this might fall under issue #35 on high-level controls for Scaper. Please give this issue a quick look to determine if you want to add to the existing thread or open a separate issue.

Well, that's quite a bit to unpack! How would you prioritize these issues? We have limited cycles so it would make sense to tackle this by priority.

The way we work on issues is first discuss them in an issue until we reach consensus about the (1) problem, (2) high-level solution and (3) how to implement the solution. Once we complete 1-3, someone opens a PR.

Cheers

@JorisCos
Copy link
Author

Thank you for your quick reply !
I think the constraint issue across soundscapes does fall under #35, I will add to the existing thread.

IMO the priority goes to the duration issue that we will discuss further in #1.
Then adding more control over the soundscapes' generation.
The background related issue is indeed non-trivial and overall less important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants