Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Ensure lstsq can handle RHS with all sizes. #9976

Merged
merged 1 commit into from Nov 7, 2017

Conversation

mhvk
Copy link
Contributor

@mhvk mhvk commented Nov 6, 2017

This fixes a bug in the creation of workspace arrays for a call to lapack_lite.zgelsd, which led to segmentation faults when a RHS was passed in that had larger size than the size of the matrix.

fixes #9891

This fixes a bug in the creation of workspace arrays for a call
to `lapack_lite.zgelsd`, which led to segmentation faults when
a RHS was passed in that had larger size than the size of the
matrix.
rwork = zeros((lwork,), real_t)
a_real = zeros((m, n), real_t)
bstar_real = zeros((ldb, n_rhs,), real_t)
results = lapack_lite.dgelsd(m, n, n_rhs, a_real, m,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit that I don't understand why this call for size information on the regular, non-complex routine was made. I suspect it was to work around a bug in zgelsd, which may well have been fixed with the update to lapack-lite.

In any case, with my changes, the routine just uses the results of the first call, which should be correct.

@eric-wieser eric-wieser closed this Nov 6, 2017
@eric-wieser eric-wieser reopened this Nov 6, 2017
@eric-wieser
Copy link
Member

Misclick...

This is actually one of the motivations behind the lapacklite upgrade. I have a vectorised lstsq in my working tree from long ago that also fixes this, but we never agreed on the shape of the output (since right now it varies based on the matrix rank, which doesn't make sense for stacked matrices)

@mhvk
Copy link
Contributor Author

mhvk commented Nov 6, 2017

It indeed looked like this could be done more easily; for now, it may be good to at least fix this obvious problem...

@eric-wieser
Copy link
Member

IIRC, the fix for this was in lapack 3.2.2. I don't remember if this is already the minimum version of lapack that numpy supports.

Relatedly, scipy.linalg.lapack.zgelsd has an explicit check for a suitable lapack version.

@mhvk
Copy link
Contributor Author

mhvk commented Nov 6, 2017

I cannot find what minimum version we are supposed to support... @charris, @rgommers: where should one look? I do note that scipy is moving to lapack >=3.3 in 1.1 (scipy/scipy#6051).

@charris
Copy link
Member

charris commented Nov 6, 2017

@mhvk I don't think we have an official minimum. I think scipy was using 3.2 for Accelerate compatibility on the Mac and I don't know the status of Accelerate for this bug. I think that we are going to drop Accelerate at some point, but that needs discussion and preparation. @rgommers, @pv Thoughts?

@charris
Copy link
Member

charris commented Nov 6, 2017

Apparently the lapack version used in accelerate can be found in the system header and was 3.2.1 back in around 2011. I think we should assume that it is at least 3.2.2 these days, although it would be nice if someone could check.

Scipy currently handles this by falling back to gelss if the driver wasn't specified. See https://github.com/scipy/scipy/blob/master/scipy/linalg/basic.py#L1211 .

@charris
Copy link
Member

charris commented Nov 7, 2017

I want to put this an. Anyone opposed?

@eric-wieser
Copy link
Member

Do we want to add something like this, to guard against older versions?

https://github.com/scipy/scipy/blob/master/scipy/linalg/basic.py#L1212-L1219

@charris
Copy link
Member

charris commented Nov 7, 2017

@eric-wieser That would probably be a good idea, although if we are going to require lapack >= 3.2.2 we could just raise an error rather than falling back on gelss.

@eric-wieser
Copy link
Member

eric-wieser commented Nov 7, 2017

Raising an error is indeed what I was intending to suggest

@eric-wieser
Copy link
Member

Here's the comment where I found the another bug that 3.2.2 makes fixing easy.

@rgommers
Copy link
Member

rgommers commented Nov 7, 2017

@mhvk I don't think we have an official minimum. I think scipy was using 3.2 for Accelerate compatibility on the Mac and I don't know the status of Accelerate for this bug. I think that we are going to drop Accelerate at some point, but that needs discussion and preparation. @rgommers, @pv Thoughts?

This all sounds right, don't have much to add. Numpy should probably just follow Scipy in the Accelerate drop (which indeed still needs prep).

@mhvk
Copy link
Contributor Author

mhvk commented Nov 7, 2017

@eric-wieser, @charris, I just noticed that the work-around in scipy you point to is for real data, not complex; for complex, nothing special is done, which suggests my fix should be OK. For real data, indeed we should eventually follow up on the comment linked above and let iwork be calculated rather than not-quite-correctly calculate it ourselves. But I think that is a separate issue.

@charris charris added this to the 1.13.4 release milestone Nov 7, 2017
@charris
Copy link
Member

charris commented Nov 7, 2017

Thanks Marten.

@charris charris merged commit a2228e9 into numpy:master Nov 7, 2017
@mhvk mhvk deleted the linalg-lstsq-complex-dimension-bug branch November 7, 2017 19:11
@eric-wieser
Copy link
Member

eric-wieser commented Nov 8, 2017

After reverting to an old version of lapack_lite (3.0.x), I now get a segfault with this commit. lapack_lite shouldn't be the problem, it's almost certainly an underlying bug in LAPACK.

Perhaps the workspace size should be initialized to -1, and then a check inserted to see if it changed - better to throw an exception so the user know the problem than just to segfault.

Either that, or we add a release note saying that lapack >= 3.2.2 is now required, and add that to the docs too.

@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Nov 26, 2017
@charris charris removed this from the 1.13.4 release milestone Nov 26, 2017
@eric-wieser
Copy link
Member

In this comment: #10156 (comment)

Currently, thanks to Apple Accelerate framework the minimum required LAPACK version is 3.1.ish

Have we tested this on Mac?

@ilayn
Copy link
Contributor

ilayn commented Dec 12, 2017

To be concrete, the exact version is 3.2.1. And the bug fixes for zgelsd came in v3.2.2 and not sure if related but the C interfaces are introduced by MKL team in v3.3.0 http://www.netlib.org/lapack/lapack-3.3.0.html

@ghost
Copy link

ghost commented Dec 12, 2017

To be clear, is the segmentation fault confined to apple accelerate?

@eric-wieser
Copy link
Member

The segfault mentioned above was tested on windows with an old lapack_lite (not an issue in release, which bundles a newer lapack_lite). I'm speculating that it will fail on accelerate, but have no way to find out.

@mhvk
Copy link
Contributor Author

mhvk commented Dec 12, 2017

Somehow I thought we had actually gotten rid of Accelerate support...

@charris
Copy link
Member

charris commented Dec 13, 2017

Somehow I thought we had actually gotten rid of Accelerate support...

It has been discussed, maybe we should move forward on that for the next release. @rgommers Thoughts?

@ghost
Copy link

ghost commented Dec 13, 2017 via email

@ghost
Copy link

ghost commented Dec 13, 2017 via email

@eric-wieser
Copy link
Member

eric-wieser commented Dec 13, 2017

Can we focus on determining whether we need to rollback this change in 1.13.4 and 1.14.0 before we discuss whether we should drop accelerate in 1.15.0?

@ghost
Copy link

ghost commented Dec 13, 2017 via email

@eric-wieser
Copy link
Member

#10087 was, which is a backport

@ghost
Copy link

ghost commented Dec 13, 2017

That PR says that it was closed, not merged. Can you find that commit in the branch?

@ghost
Copy link

ghost commented Dec 13, 2017

The only commits that I can find are a3c489f and a2228e9, but they don't have a 1.13.x tag.

@eric-wieser
Copy link
Member

I hadn't spotted that it was closed not merged. Still, something we need to worry about for 1.14.0

@ghost
Copy link

ghost commented Dec 13, 2017

That's true. However, would reverting this PR on 1.14 re-introduce gh-9891 with the latest LAPACK? If so, I'm not in favor of reverting it.

@eric-wieser
Copy link
Member

It would, but it would be possible to do a hybrid approach by checking the workspace sizes are not unset

@mhvk
Copy link
Contributor Author

mhvk commented Dec 13, 2017

If it is an actual issue, can we drop accellerate for 1.14? We haven't actually released 1.14.0 yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lstsq problem
5 participants