Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failing steadystate tests #1694

Closed

Conversation

hodgestar
Copy link
Contributor

@hodgestar hodgestar commented Oct 21, 2021

Description

We have steadystate tests that fail almost always in CI on Python 3.9 with OpenMP and MKL, and sometimes with just Python 3.9 and MKL.

Issue is currently hard to reproduce locally.

Related issues or PRs

Progress so far

  • Fixed a small issue in steadystate tests so that I can use pytest-repeat to run --count=100 on the steadystate tests in the hopes of reproducing the bug locally.
  • Removed mutable default c_ops arguments for steadystate and liouvillian.
  • Fix reference to method in _pseudo_inverse_sparse.
  • Only set method in pseudo_inverse if one is explicitly defined. (revert)

Changelog
TODO: Write the changelog once we understand properly what is going on.

@hodgestar
Copy link
Contributor Author

@Ericgig I've started this branch specifically to tackle the strange steadystate (and other) test failures.

@coveralls
Copy link

coveralls commented Oct 21, 2021

Coverage Status

Coverage remained the same at 16.832% when pulling 4b027c3 on hodgestar:fix/failing-steady-state-tests into 091574d on qutip:master.

@hodgestar
Copy link
Contributor Author

I think we finally have a concrete error and it's rather mystifying to me how it can happen:

            E = spla.expm(A.toarray())
            if np.isnan(E).any():
                print("A:", A)
                print("A data:", A.indices, A.indptr, A.shape)
                print("A toarray:", A.toarray())
                print("E:", E)
>               raise RuntimeError("NaNs generated by sp_expm.")
E               RuntimeError: NaNs generated by sp_expm.

qutip/sparse.py:408: RuntimeError
----------------------------- Captured stdout call -----------------------------
A:   (0, 1)	(-0.5+0j)
  (1, 0)	(0.5+0j)
  (1, 2)	(-0.7071067811865476+0j)
  (2, 1)	(0.7071067811865476+0j)
  (2, 3)	(-0.8660254037844386+0j)
  (3, 2)	(0.8660254037844386+0j)
  (3, 4)	(-1+0j)
  (4, 3)	(1+0j)
A data: [1 0 2 1 3 2 4 3] [0 1 3 5 7 8] (5, 5)
A toarray: [[ 0.        +0.j -0.5       +0.j  0.        +0.j  0.        +0.j
   0.        +0.j]
 [ 0.5       +0.j  0.        +0.j -0.70710678+0.j  0.        +0.j
   0.        +0.j]
 [ 0.        +0.j  0.70710678+0.j  0.        +0.j -0.8660254 +0.j
   0.        +0.j]
 [ 0.        +0.j  0.        +0.j  0.8660254 +0.j  0.        +0.j
  -1.        +0.j]
 [ 0.        +0.j  0.        +0.j  0.        +0.j  1.        +0.j
   0.        +0.j]]
E: [[nan+nanj nan+nanj nan+nanj nan+nanj nan+nanj]
 [nan+nanj nan+nanj nan+nanj nan+nanj nan+nanj]
 [nan+nanj nan+nanj nan+nanj nan+nanj nan+nanj]
 [nan+nanj nan+nanj nan+nanj nan+nanj nan+nanj]
 [nan+nanj nan+nanj nan+nanj nan+nanj nan+nanj]]

See https://github.com/qutip/qutip/runs/3966808806?check_suite_focus=true#step:6:1646

hodgestar and others added 5 commits October 21, 2021 22:13
Only 3.9 seems to fails so migrate all test there.
Tests sometime passes and sometime don't,
1) If it is to fail, I want it to always fail.
2) Are tries independent?
3) Does size matter?
Tests seems to pass or fail together.
Maybe there is a conflict with some VM configuration or cpus. So I am storing cpu, ram, and distribution info.
Distribution info fail, not pytest.
@hodgestar
Copy link
Contributor Author

@Ericgig tracked this issue down to only occuring on numpy 1.21.X (and not 1.20.X) on CI workers with certain Intel CPUs (8171 and 8272). There are a number of changes in numpy 1.21 which could have caused this, but it might take awhile to track down.

The plan from here is to make a small PR for some of the tiny clean-ups from here that seem good to have anyway, and then to create a new PR off of master to try get us back onto 1.21.X somehow (probably this will require a numpy fix, but maybe there is another work around).

@hodgestar
Copy link
Contributor Author

Minimal script to reproduce the error that only uses numpy:

# On a CPU with AVX512 extensions and numpy 1.21.2:
# (only tested on Ubuntu)
# It works again on numpy 1.21.4 (and maybe 1.21.3 -- I did not check because 1.21.3 was not conda installable)

import numpy as np

L = np.diag([1+0j, 1, 1, 1])
b = np.array([1+0j, 0, 0, 0])

# commenting out the line below makes everything work, with it solve returns nans.
np.exp(0)
# breakpoint()

v = np.linalg.solve(L, b)
np.testing.assert_allclose(v, b)

@hodgestar
Copy link
Contributor Author

Numpy bug report -- numpy/numpy#20356

@hodgestar
Copy link
Contributor Author

Even smaller script for reproducing the issue:

a = np.diag([1+0j, 1])
np.exp(0)
x = np.linalg.det(a)

@Ericgig
Copy link
Member

Ericgig commented Nov 25, 2021

@hodgestar do we close this now or do we wait for numpy's fix to be on conda.

@hodgestar
Copy link
Contributor Author

@Ericgig I'm happy to leave this open until a new numpy is released and we can update the version of numpy used in CI tests. Probably also good to have an issue open in case users encounter this in the wild.

@hodgestar
Copy link
Contributor Author

The bug fix is scheduled to be included in numpy 1.22.0 -- https://github.com/numpy/numpy/milestone/93.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants