fftpack → fft #381

bmcfee · 2024-04-09T15:02:16Z

Fixes #373

This PR implements a number of small improvements to the separation module:

Migrate to the newer scipy.fft API, instead of fftpack
Replace fft() by rfft()
Use the built-in helper for computing fast FFT lengths instead of brute-forcing 2**k
Remove padding when we can

This should altogether get us about a 50% speedup in separation metrics and bring down the test build times considerably. I don't think this will affect accuracy at all. However, we should probably also roll in #376 once the dust settles.

This PR bumps the minimum supported scipy to 1.5.4 (from 1.4.0) to avoid a bug in the next_fast_length implementation for real-valued inputs. 1.5.4 was released in Nov. 2020, while 1.4.0 was Dec. 2019. These are both far enough in the past that I don't consider it a big deal, but I am noting it here in case it warrants discussion.

bmcfee · 2024-04-09T19:30:06Z

This is interesting. Tests are essentially all passing, except for framewise image on dataset number 5. I believe this case has been a heisenbug in the past, at least on OSX, but it's now affecting all platforms.

What's strange is that the deviation is extremely non-uniform: some platforms it's off by about 7db, some 10, some 23. This leads me to believe the underlying cause is numerical instability in a least squares estimation, so that we're highly sensitive to random initialization and floating point ops, and this example just happens to be particularly good at triggering it. I'm not yet 100% certain, but that seems like the most likely explanation here since all other pasts are passing with the specified 0.01dB tolerance.

bmcfee · 2024-04-09T20:16:30Z

Interestingly, dataset06 also fails on my dev machine, but only by 0.02dB (#376 issue).

bmcfee · 2024-04-10T16:06:20Z

Ok, I've dug into this failure a little more. It's definitely a numerical stability issue.

The offending case is ref07 (numbering doesn't line up with fixture numbering since we removed a couple of test cases some years back). In this data, the second reference source goes dead silent in the center of the recording, but with a negative DC:

When divided up into frames, we get numerical equivalency in framewise metrics everywhere except for one frame:

(proposed rfft branch)

>>> imageframe_scores[key]
[[1.756411590484389,
  1.7597473682072609,
  1.7602869182177234,
  1.7594335093388715,
  1.7595170180901967,
  1.759276007646068,
  1.7591737609125007,
  1.759391260702998,
  1.760218424402917,
  1.7600362671204706],
 [1.6992851561218936,
  -8.959569954271648,
  1.7050670369298637,
  1.704666147437159,
  1.7049902335582687,
  1.7049335634150566,
  1.70516553136305,
  1.7047425405295848,
  1.704666154642886,
  1.7048499625553668]]

compared to our target pre-computed scores:

>>> expected_image_frames[test_data_name]
[[1.7564115904843982,
  1.7597473682073619,
  1.7602869182174874,
  1.7594335093388682,
  1.7595170180900648,
  1.7592760076460983,
  1.759173760912185,
  1.7593912607029725,
  1.7602184244029835,
  1.7600362671204637],
 [1.699285156121919,
  1.7045012461538642,
  1.7050670369289587,
  1.704666147436396,
  1.7049902335583038,
  1.704933563419315,
  1.7051655313634067,
  1.7047425405294456,
  1.7046661546430155,
  1.7048499625703208]]

The diff comes to:

array([[-9.10382880e-15, -1.01030295e-13,  2.36033415e-13,
         3.33066907e-15,  1.31894495e-13, -3.01980663e-14,
         3.15747428e-13,  2.55351296e-14, -6.63913369e-14,
         6.88338275e-15],
       [-2.55351296e-14, -1.06640712e+01,  9.05053810e-13,
         7.62945263e-13, -3.50830476e-14, -4.25837143e-12,
        -3.56603636e-13,  1.39221967e-13, -1.29452005e-13,
        -1.49540380e-11]])

I expect what's happening here is the silent frame of the reference signal is causing a severe instability in the matrix inversion, but not so severe as to trigger an exception here:

mir_eval/mir_eval/separation.py

Lines 792 to 795 in 6b82dde

    
           try: 
        
               C = np.linalg.solve(G, D).reshape(flen, nchan * nsrc, nchan, order="F") 
        
           except np.linalg.linalg.LinAlgError: 
        
               C = np.linalg.lstsq(G, D)[0].reshape(flen, nchan * nsrc, nchan, order="F")

I put in some debug statements, and we're definitely not hitting the lstsq code path. Replacing the np.linalg.solve call by scipy.linalg.solve confirms this via warnings:

/home/bmcfee/git/mir_eval/mir_eval/separation.py:793: LinAlgWarning: Ill-conditioned matrix (rcond=1.10403e-22): result may not be accurate.
  C = scipy.linalg.solve(G, D).reshape(flen, nchan * nsrc, nchan, order="F")

I'm now testing out a fix that re-casts this linalgwarning as an exception so that we can fall back on lstsq. So far it seems promising, updated push coming shortly.

bmcfee · 2024-04-10T16:09:36Z

Confirmed that the revised linalgwarning catch and fallback fixes the issue in this case, at least on my machine. With any luck, this will resolve #376 and we won't have to raise tolerances further.

Implementing this did require switching out np.linalg for scipy.linalg (due to richer error handling) - I don't think this should be a problem, but I'm noting it here.

bmcfee · 2024-04-10T16:38:13Z

Well, bsseval is now about a million times slower, I'm guessing because it's properly detecting ill-conditioning now where it wasn't before. It's kind of a miracle that it worked previously, IMO.

bmcfee · 2024-04-10T17:41:26Z

Adding the 🐛 label here because I'm becoming convinced that the present implementation shouldn't be trusted. To summarize where we're at:

np.linalg.solve was letting ill-conditioned data through silently instead of triggering the least squares fallback.
Switching out to scipy.linalg fixes this, and we're now much more likely to hit the lstsq code path. This is a good thing for accuracy, as the G matrix is quite likely to be rank-deficient in practice.
Switching to lstsq is a bad thing for efficiency though, which is why we don't use it always.

The bsseval package works around this by explicitly regularizing the problem, which i think I had advocated for several zillion years ago. I'm not sure if we want to go down that road now, but I also don't see any reasonable way to preserve the current behavior.

codecov-commenter · 2024-04-10T18:01:15Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.81%. Comparing base (485a425) to head (5281268).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #381      +/-   ##
==========================================
+ Coverage   95.67%   95.81%   +0.13%     
==========================================
  Files          19       19              
  Lines        2936     2936              
==========================================
+ Hits         2809     2813       +4     
+ Misses        127      123       -4

Flag	Coverage Δ
unittests	`95.81% <100.00%> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

craffel · 2024-04-12T17:49:36Z

Thanks for all this sleuthing. Can we get an additional opinion from a source separation eval expert, maybe @faroit?

bmcfee · 2024-04-12T17:56:06Z

As an aside, I do think we could also significantly accelerate everything here by rewriting it with proper vectorization instead of nested for loops. However, I don't have the fortitude for a yak shave of this magnitude.

faroit · 2024-04-16T14:02:14Z

@bmcfee @craffel Looking into this now. I'm not sure if it's related but I had some similar issues with fft precisions when torchaudio did changes some months ago....

bmcfee · 2024-04-16T14:11:16Z

I had some similar issues with fft precisions when torchaudio did changes some months ago

Yeah, it shouldn't be surprising. TBH I've never done a full deep dive on bsseval, but from the parts I've encountered, many of the quantities involved seem fundamentally plagued with numerical instabilities. They do seem fixable, eg by either regularization (ridge regression) or reconceptualizing the definition to account for rank-deficient projections (least squares), but neither of these lead to compatibility-preserving implementations.

My casual bystander opinion here is that we really shouldn't be striving for preserving compatibility here, since the current implementation is obviously a bit broken. OTOH i'm not sure it's mir_eval's job to go about changing metrics without at least some due diligence (ie research and publications) evaluating the changes.

I don't follow the source separation literature closely (maybe I should?) but I do seem to recall seeing at least a couple of papers proposing revisions to bsseval, maybe from @rabitt or some of the MERL folks? Maybe there's something there we could build on to improve the current implementation?

…1.4. removed more padding

bmcfee · 2024-05-15T16:38:28Z

Circling back on this one today, and it's still pretty ugly. I've had my head down in the code for a little bit now, and am increasingly convinced that this whole thing needs a rewrite. (Relax: I don't want to do that in this PR.)

To recap things, here's where we're at:

The metrics rely on solving linear systems. Sometimes the design matrix is ill conditioned. In these cases, we had logic in place to fall back on a least squares solution (linalg.lstsq instead of linalg.solve).
The fall-back logic had not been working properly, which allowed bad solutions to silently propagate through using only linalg.solve.
Fixing this fall-back logic corrected the solutions, but we're now immensely slower.
Some of the regression tests still fail on certain platforms due to numerical stability issues.

TLDR: fixing things made it both slower and more prone to failure. Is this progress? I don't know. But things are at least marginally more reproducible now, and we don't have silent failures anymore.

So what should we do? I see a few options:

Full YOLO: Roll back this PR entirely, expand the tolerance window so that tests "pass", and cut a release. Behavior is preserved from previous releases, and is definitely still wrong.
Half YOLO: Let it go as is in this PR, but adjust the test tolerance to account for the intrinsic lack of stability in these metrics. Behavior is slightly more correct, incompatible with prior releases, and slower.
Quarter YOLO: Follow the sigsep-mus-eval and add a ridge to the solve calls. If done correctly, this should obviate the need for a least squares fallback entirely, so things should always be in the "fast" regime. Note that their implementation still has the (old, incorrect) fallback code inherited from our implementation, but I don't think it should ever execute those branches. Behavior is different (neither more nor less correct), incompatible with prior releases, and of comparable speed.
( °□°) ︵ ┻━┻: Full YOLO, but deprecate this module entirely in favor of museval. We have essentially parallel code bases here, with similar problems, and if we're going to start mucking around with important changes, I think we'd be inviting trouble to do it one place but not the other.

craffel · 2024-05-15T16:44:54Z

Thanks for the summary. I'm in favor of the ridge solution if it's generally accepted by the community, but I suppose we'd need to be careful about communicating what metric exactly we have implemented here (so that people don't erroneously compare to similar-but-different ridge-free implementations). Secondarily, how did it come to pass that there are parallel codebases?

bmcfee · 2024-05-15T16:59:02Z

I'm in favor of the ridge solution if it's generally accepted by the community, but I suppose we'd need to be careful about communicating what metric exactly we have implemented here (so that people don't erroneously compare to similar-but-different ridge-free implementations).

I totally agree in princple, but I don't think we can hope to avoid erroneous comparisons here. People are going to copypasta results tables from older papers, and there's really nothing we can do to stop them, with the possible exception of leaving the old-and-busted implementations as is and defining new terms for the stabilized implementations. But then again, the museval code is already stabilized and uses the same terms, so 🤷

Secondarily, how did it come to pass that there are parallel codebases?

TBH, I think a good portion of this comes down to us not being good enough at staying on top of maintenance and releases. There's also the issue of adding functionality, eg #68 #272 , which we let slide. Forks are inevitable when these things happen.

Then there are total rewrites for speed / platform differences, eg https://github.com/fakufaku/fast_bss_eval/ https://github.com/fgnt/ci_sdr etc.

craffel · 2024-05-15T17:01:43Z

Hm, yes, maybe based on #68 (comment) it would make most sense to remove it or just call out to museval?

bmcfee · 2024-05-15T17:07:49Z

Hm, yes, maybe based on #68 (comment) it would make most sense to remove it or just call out to museval?

Lol, I'd forgotten that we'd already had this exact conversation before, and apparently agreed that this neither our circus nor our monkeys. 😆

What I think makes sense here is to implement a deprecation cycle, and then remove the module entirely. A shim to museval could be convenient, but I don't like it in general for two reasons:

It now adds a dependency that we didn't have before
museval would then have a dependent (mir_eval) that could lock them into certain APIs or numerical compatibility constraints. This would make it more difficult for museval to evolve in the future.

@faroit what do you think?

bmcfee added the enhancement label Apr 9, 2024

bmcfee added the bug label Apr 10, 2024

bmcfee added 10 commits May 14, 2024 09:42

scipy.fft, separation acceleration. fixes craffel#373

a658f4e

fixing lengths in irfft calls

909cd35

bumped smallest scipy for proper fast fft length computation

8f554ab

tracking down accuracy regression in separation tests

5bf8333

reverting keyword nextfastlen patch to positional arg, support scipy …

9d8b899

…1.4. removed more padding

fixing a transpose error

df25077

fixed error reporting in separation test failures

253c237

blacked the package

881ff7d

better handling of linalg warnings in separation, switch to scipy.linalg

e8c837c

blacked tests

5281268

bmcfee force-pushed the rfft branch from 58b199e to 5281268 Compare May 14, 2024 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fftpack → fft #381

fftpack → fft #381

bmcfee commented Apr 9, 2024 •

edited

bmcfee commented Apr 9, 2024

bmcfee commented Apr 9, 2024

bmcfee commented Apr 10, 2024

bmcfee commented Apr 10, 2024

bmcfee commented Apr 10, 2024

bmcfee commented Apr 10, 2024

codecov-commenter commented Apr 10, 2024 •

edited

craffel commented Apr 12, 2024

bmcfee commented Apr 12, 2024

faroit commented Apr 16, 2024

bmcfee commented Apr 16, 2024

bmcfee commented May 15, 2024

craffel commented May 15, 2024

bmcfee commented May 15, 2024

craffel commented May 15, 2024

bmcfee commented May 15, 2024

fftpack → fft #381

Are you sure you want to change the base?

fftpack → fft #381

Conversation

bmcfee commented Apr 9, 2024 • edited

bmcfee commented Apr 9, 2024

bmcfee commented Apr 9, 2024

bmcfee commented Apr 10, 2024

bmcfee commented Apr 10, 2024

bmcfee commented Apr 10, 2024

bmcfee commented Apr 10, 2024

codecov-commenter commented Apr 10, 2024 • edited

Codecov Report

craffel commented Apr 12, 2024

bmcfee commented Apr 12, 2024

faroit commented Apr 16, 2024

bmcfee commented Apr 16, 2024

bmcfee commented May 15, 2024

craffel commented May 15, 2024

bmcfee commented May 15, 2024

craffel commented May 15, 2024

bmcfee commented May 15, 2024

bmcfee commented Apr 9, 2024 •

edited

codecov-commenter commented Apr 10, 2024 •

edited