New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SincNet vs. STFT #74
Comments
Hi Alex, |
I guess during some time in future I will report 3 experiments:
Would be cool to compare |
@snakers4 I would be very interested to see your results. Particularly recently I was thinking how it would be possible to adapt SincNet to apply sinc operation on spectrogram instead of directly on waveform. This would be a similar to melspectrogram function of pytorch (https://pytorch.org/audio/transforms.html#melspectrogram) but the filterbanks will change depending on tunable params. My current main issue is scaling SincNet on larger inputs (15 seconds of waveform). It takes way too much memory even to get to the max pool layer. I asked for suggestions in PyTorch forum recently (still waiting for a reply): https://discuss.pytorch.org/t/attempting-to-reduce-memory-consumption-by-fusing-conv1d-with-maxpool1d/61448 |
Hi,
last days I worked a bit on something related to that. I have modified the
filter-bank implementation that you find within torch.audio to make the
central frequency and the band of the filters learnable. The difference
with SincNet are the following:
1- we operate on the spectrogram rather than directly on the raw speech
samples
2- We compute the weighted average after processing the spectrum with the
triangular filters (as done for standard FBANKs).
From my very preliminary experiments, it emerges that the model is
extremely compact and better than standard FBANKs, but SincNet still
performs better.
One reason could be that with SincNet we are not performing the average,
and we thus have a much higher temporal resolution. Also, SincNet doesn't
discard the phase. This could be helpful, but I don't think this can
explain the performance difference.
The implementation will be available in the next months under the
speechbrain project (https://speechbrain.github.io/).
Best,
Mirco
…On Tue, 19 Nov 2019 at 04:31, Dilshod Tadjibaev ***@***.***> wrote:
@snakers4 <https://github.com/snakers4> I would be very interested to see
your results. Particularly recently I was thinking how it would be possible
to adapt SincNet to apply sinc operation on spectrogram instead of directly
on waveform. This would be a similar to melspectrogram function of pytorch (
https://pytorch.org/audio/transforms.html#melspectrogram) but the
filterbanks will change depending on tunable params.
My current main issue is scaling SincNet on larger inputs (15 seconds of
waveform). It takes way too much memory even to get to the max pool layer.
I asked for suggestions in PyTorch forum recently (still waiting for a
reply):
https://discuss.pytorch.org/t/attempting-to-reduce-memory-consumption-by-fusing-conv1d-with-maxpool1d/61448
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#74?email_source=notifications&email_token=AEA2ZVWAVVCTJWJTZ7ZIZL3QUOXABA5CNFSM4JN2G3N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEENPKLI#issuecomment-555414829>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEA2ZVVGSQNMZGASOI275HDQUOXABANCNFSM4JN2G3NQ>
.
|
Hi,
This is exactly what the
This is because the stride is very small, I can suggest using analytic filterbanks (from which shift invariant representation can be computed) with larger stride to solve this problem. |
@mpariente, thank you for you comment and links. Do you have any examples how one can convert waveform speech signal to features? I was trying to check out your latest project code but could not figure out how to use |
@antimora Yes the code for the filterbanks is complete but I should write how to use it somewhere, sorry.
I'll soon work on the docs but don't hesitate to open an issue there. |
We are finishing the current set of experiments, and dedicating a couple of weeks to run the comparison of the "first convolutions" seems alluring because I did not do it yet at all =) Now we are just using this STFT implementation (it works on CPU faster than librosa for some reason, also to make it easier for our models to work with TTS voices, because you could just train Tacotron to output STFT and omit the vocoder for now =) ) with these params:
This produces 161-sized feature maps. Many thanks for your code and paper. Meanwhile a couple of basic dumb questions to do some quick and dirty experiments.
Or I guess I can also use this constructor you provided here, right?
Could you please give a very rough and dirty memory and compute comparison of the filter-banks available in your repo? I.e. how to make this table correct?
Also maybe could you provide some guidance on the default params I should start with for each filter-bank given our params above?
Are you referring to the stride used in SincNet by default?
Judging from the code, FreeFB is just Conv1D and STFTFB is just Conv1D initialized with STFT. |
Also forgot to ask. |
The STFT implementation in asteroid is roughly equivalent. Few points where it's different :
Overall, it is very similar.
It is not trainable,
Yes exactly,
This will produce 322 features (the output shape is in
You can use exactly the same parameters for all filterbanks and it makes sense, that's the nice thing about it. So you can have a fair comparison between these filterbanks.
Yes.
Regarding the table, if the same values are used for
|
Following this logic, if I take the nvidia implementation or your implementation and substitute
This is zero because of
The best practical models that we have have a constant width of 512, so it hardly would make a difference in terms of complexity. But for some reason, in all implementations of STT that I saw the phases were always discarded. I wonder why? |
Correct.
Correct. Note that the nvidia implementation will have weird behavior in this case for over-complete STFT (i.e. n_filters > kernel_size) because the filters are padded with a lot of zeros, and these zeros will become parameters so your effective kernel_size will be much bigger in this case. If you always use kernel_size = n_filters, it will be fine.
Because the phase has some non-local properties, it is invariant by global and 2pi rotation which are difficult to model with DNN. The paper I linked to shows that modern DNN can benefit from using the complex representation (depending on the size of the window) because the modelling capabilities are higher now. |
Many thanks for your explanations. After we complete the current batch of experiments, I guess will add the following experiments to the queue:
Each experiment is best when you run it at least for ~24 hours, so it will take between 1 and 2 weeks to test all this. Also it will be very interesting to see how transferring the first convolution from CPU to GPU would affect speed and io - now my networks are a bit slower than my io, but I guess there will be some trade-offs here. |
@snakers4, if you want I can also share with you the tunable filter-banks that I'm designing for the SpeechBrain project. They are extremely compact and these days I'm doing several tests to check their performance. |
Yeah, why not, since I will be doing a grid-search anyway |
Sure,
I would be great for me if you can share the results of your grid search
when you have something.
please contact me privately (mirco.ravanelli at gmail.com) for the fbank
code!
…On Mon, 25 Nov 2019 at 10:21, Alexander Veysov ***@***.***> wrote:
@mravanelli <https://github.com/mravanelli>
Yeah, why not, since I will be doing a grid-search anyway
A friend of mine probably will also contribute something wavelet related
@pollytur <https://github.com/pollytur>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#74?email_source=notifications&email_token=AEA2ZVTIRU3KTVBDJHS3O2TQVPUPHA5CNFSM4JN2G3N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFCX6VA#issuecomment-558202708>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEA2ZVTMRDX6A3IXAAOFCCTQVPUPHANCNFSM4JN2G3NQ>
.
|
Just to confirm that I am using you fbanks correctly
Also I have 2 types of augmentations for spectrograms: Do I understand correctly, that I can use (1) with any type of your fbanks, but I can use (0) only with STFT? |
Most of it looks fine, the last filterbank definition should be
to take into account the filterbank type right? |
correct, just a typo
Hm, correct me here, but if I always discard the "upper" half, i.e. |
Yes, you're right, I meant the learnable STFT, not the fixed one sorry. I edit the answer above to correct for this. |
I have completed my tests and started the learnable frontend experiments. It is very early to draw any formiddable conclusions yet, but:
@mpariente Also there is a small issue (?) with your implementation
|
@snakers4 Thanks for the update ! I think it is related to this issue which was fixed. The code is a bit different. Could you try with the current version and tell me if there is still a problem please? |
I could run the above experiment for ~5-10 additional hours, but it did not change much Also tried a few other things
With / without this normalization (which works well with STFT)
With varying LR / optimizer
For any combinations, the network exploded either outright or after several hundred batches I mostly did not change the other pipeline, so maybe the current pipeline is over-optimized for STFT and I need to start some simpler experiments I have no other direction other than to try all of these experiments on some very eazy dataset (e.g. some small validation subset) |
Thanks a lot for the detailed experiments and results. |
The symptom obviously is NaN losses, but I had no time to check what exactly is causing it.
There is gradient clipping in my pipeline, I set it to some high values (100-200), otherwise the network converges slowly (I arrived at this value after grid-searching). I work around NaNs in losses, but not explicitly in gradients. Could you maybe share a snippet on how you do it for the gradients? Many thanks!
Yeah, basically we are taking 1/2 of the convolution, but I belive the root cause the cause is elsewhere
Our data obviously is "in the wild" and noisy. |
Hi,
We are doing STT / TTS for the Russian language.
We mostly used STFT due to our ignorance in DSP and our understanding that MFCC filters used by everyone may be a bit over-engineered (I have seen no papers actually comparing them properly vs. STFT and similar features for morphologically rich languages).
So, my question is as follows. I understand, that your filters contain a magnitude less parameters than ordinary CNN layers, and in essence are just DSP inspired frequency filters.
In our experience we tried a lot of separable convolutions, and we mostly agree with this paper (i.e. the success of mobile networks shows that the convolutions are overrated, but the mix layer + shift layer can do the job).
Here STFT is imlemented as a convolution (they inherit their kernels from numpy, but I guess they are similar in essence to triangual filters from MFCC). So, my questions are:
Best,
Alex
The text was updated successfully, but these errors were encountered: