General and speech recognition enhancement #140

JorisCos · 2020-11-15T21:43:17Z

About this issue

I open this issue to start a discussion about some limitations that I have encountered when using Scaper for speech recognition in noisy environments.

General limitations

It could be nice to have the possibility to have multiple background files to cover the soundscape duration. By that, I mean a succession of background files instead of duplicating the same file over and over.
When the background duration is shorter than the soundscape duration, the background is duplicated. The duplication is made "roughly" by using numpy.tile function. A smoother way could be to have an ascending hanning window on the new background file and a descending one the ending background file.
A sampling without replacement method could be really useful for choosing background files and source files. By that, I mean that if I have 200 background files and 200 source files and I generate 200 soundscapes then each background file and source file should only be used once.
Being able to provide a glob object for choosing those files could be a plus.
The file generation function is tied to the writing on disk function which means you can't have a dynamic generator. Having the possibility to generate the soundscape without writing it to the disk would solve that.

Speech recognition related

As mentioned in #1 the duration of a soundscape is fixed. This is a real limitation for speech recognition purposes as the utterances have variable durations. Using scaper force you to choose to generate soundscapes as long as the longest duration of your utterances. Having a parameter to force the soundscape duration to match the utterance duration would avoid post-processing.

Contribution

As mentioned #92 a tutorial for source separation could be a cool thing to have. I would be glad to a tutorial that uses scaper to generate data Asteroid to perform source separation and ESPNet to perform speech recognition.

The text was updated successfully, but these errors were encountered:

justinsalamon · 2020-11-16T17:58:30Z

Thanks @JorisCos for your feedback!

Let's discuss:

The file generation function is tied to the writing on disk function which means you can't have a dynamic generator. Having the possibility to generate the soundscape without writing it to the disk would solve that.

Since 1.5.0 Scaper can return audio/annotations in memory (we're on 1.6.4 now).

I would be glad to a tutorial that uses scaper to generate data Asteroid to perform source separation and ESPNet to perform speech recognition.

We have source separation tutorial using Scaper (as a real-time data generator) this ISMIR 2020, and wrote it as an online book: https://source-separation.github.io/tutorial. It focuses on music source separation, but could be directly applied to speech with little or no modifications required I believe.

It could be nice to have the possibility to have multiple background files to cover the soundscape duration. By that, I mean a succession of background files instead of duplicating the same file over and over.

When the background duration is shorter than the soundscape duration, the background is duplicated. The duplication is made "roughly" by using numpy.tile function. A smoother way could be to have an ascending hanning window on the new background file and a descending one the ending background file.

We've had a few requests for more background event controls, some of which are documented in #47. Can you give it a quick look and then add/elaborate on your required functionality? There are some non-trivial considerations to keep in mind, e.g., how we handle/determine ref_db, but we can map out a solution for this.

As mentioned in #1 the duration of a soundscape is fixed.

Yup, this has been a constraint since the beginning because it simplifies a lot of the (fairly complex) logic around soundscape generation. That said, I agree it'd be nice to have more flexibility here. Happy to discuss further via #1.

A sampling without replacement method could be really useful for choosing background files and source files. By that, I mean that if I have 200 background files and 200 source files and I generate 200 soundscapes then each background file and source file should only be used once.

Right now we support this within a single soundscape via allow_repeated_source in generate(), but we don't support this across soundscapes. At a high-level this might fall under issue #35 on high-level controls for Scaper. Please give this issue a quick look to determine if you want to add to the existing thread or open a separate issue.

Well, that's quite a bit to unpack! How would you prioritize these issues? We have limited cycles so it would make sense to tackle this by priority.

The way we work on issues is first discuss them in an issue until we reach consensus about the (1) problem, (2) high-level solution and (3) how to implement the solution. Once we complete 1-3, someone opens a PR.

Cheers

JorisCos · 2020-11-17T13:00:04Z

Thank you for your quick reply !
I think the constraint issue across soundscapes does fall under #35, I will add to the existing thread.

IMO the priority goes to the duration issue that we will discuss further in #1.
Then adding more control over the soundscapes' generation.
The background related issue is indeed non-trivial and overall less important.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General and speech recognition enhancement #140

General and speech recognition enhancement #140

JorisCos commented Nov 15, 2020

justinsalamon commented Nov 16, 2020 •

edited

JorisCos commented Nov 17, 2020

General and speech recognition enhancement #140

General and speech recognition enhancement #140

Comments

JorisCos commented Nov 15, 2020

About this issue

General limitations

Speech recognition related

Contribution

justinsalamon commented Nov 16, 2020 • edited

JorisCos commented Nov 17, 2020

justinsalamon commented Nov 16, 2020 •

edited