Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some unit tests crash when using symengine #12359

Closed
iyanmv opened this issue May 7, 2024 · 15 comments · Fixed by #12392
Closed

Some unit tests crash when using symengine #12359

iyanmv opened this issue May 7, 2024 · 15 comments · Fixed by #12392
Labels
bug Something isn't working

Comments

@iyanmv
Copy link
Contributor

iyanmv commented May 7, 2024

Environment

  • Qiskit version: 1.1.0rc1 (but also in previous stable versions)
  • Python version: 3.12.3
  • Operating system: Arch Linux

What is happening?

Some unit tests fail due to symengine crashes. For example, the following five tests from test_circuit_load_from_qpy.py crash on my system:

  • test_parameter_expression
  • test_evolutiongate_param_expr_time
  • test_parameter_expression_global_phase
  • test_parameter_vector_element_in_expression
  • test_symengine_full_path

From the first test, the exact line when the the test crashes is this:

expr_bytes = obj._symbol_expr.__reduce__()[1][0]

Which produces a src/tcmalloc.cc:304] Attempt to free invalid pointer 0x6146ecca5b60.

How can we reproduce the issue?

Here is a small self-contained script that can replicate the crash described above.

from qiskit.circuit.parameter import Parameter
from qiskit.circuit.parameterexpression import ParameterExpression


theta = Parameter("theta")
phi = Parameter("phi")
sum_param = theta + phi

sum_param._symbol_expr.__reduce__()
# src/tcmalloc.cc:304] Attempt to free invalid pointer 0x645c8e3eaa30 
# [1]    987079 IOT instruction (core dumped)  python

The following running on the same system works:

import symengine

theta = symengine.Symbol("theta")
phi = symengine.Symbol("phi")
sum_param = theta + phi

sum_param.__reduce__()

What should happen?

Ideally, all unit tests would pass in a clean chroot env on Arch Linux.

Any suggestions?

No response

@iyanmv iyanmv added the bug Something isn't working label May 7, 2024
@iyanmv
Copy link
Contributor Author

iyanmv commented May 7, 2024

For some reason, if symengine is imported, there is no crash, i.e.,

import symengine
from qiskit.circuit.parameter import Parameter
from qiskit.circuit.parameterexpression import ParameterExpression


theta = Parameter("theta")
phi = Parameter("phi")
sum_param = theta + phi

sum_param._symbol_expr.__reduce__()

@jakelishman
Copy link
Member

Thanks for the report. Do you know what version of symengine you're using?

Fwiw, it would be really pretty weird if this is caused by Qiskit and not some allocator problem within Symengine - our use of it is very much inline with their public interface, and we don't go messing around in their internals or anything.

@iyanmv
Copy link
Contributor Author

iyanmv commented May 7, 2024

I'm using symengine 0.11.2, and Arch Linux builds the library with the following options:

cmake -B build -S $pkgname-$pkgver \
  -DCMAKE_INSTALL_PREFIX=/usr \
  -DBUILD_SHARED_LIBS=ON \
  -DWITH_TCMALLOC=ON \
  -DWITH_PTHREAD=ON \
  -DWITH_SYMENGINE_THREAD_SAFE=ON \
  -DWITH_ARB=ON \
  -DWITH_ECM=ON \
  -DWITH_LLVM=ON \
  -DWITH_MPFR=ON \
  -DWITH_MPC=ON \
  -DWITH_PRIMESIEVE=ON \
  -DWITH_BOOST=ON \
  -DWITH_COTIRE=OFF \
  -DWITH_SYSTEM_CEREAL=ON
cmake --build build

@jakelishman
Copy link
Member

If you installed symengine from pip or conda, you'd probably have got their pre-packaged wheels, which is the same situation we're usually expecting. If you're using it from Arch's repos in the system Python installation, then this could very well be a difference: I don't think symengine's published wheels use tcmalloc, and it could be some fault there (or perhaps some questionable behaviour that happens to work with glibc's malloc, but tcmalloc is rejecting).

The difference between pre-importing symengine and not probably isn't indicative of everything being super fixed (since qiskit does just import symengine internally) - it's more likely a spooky-action-at-a-distance memory bug.

If you have the means to try, would you be able to see if the error repeats if you drop the -DWITH_TCMALLOC=ON (or manually set it to off) while you're building the symengine your Python installation is linking against?

@iyanmv
Copy link
Contributor Author

iyanmv commented May 7, 2024

I was going to do that experiment next, yes. I will rebuild symengine and python-symengine without tcmalloc and see if that solves the issue.

I just installed qiskit quickly in a venv and when I use the symengine from PyPI I also can't replicate the issue.

@iyanmv
Copy link
Contributor Author

iyanmv commented May 10, 2024

I can confirm those tests don't fail when using symengine without tcmalloc. I will create an issue on Arch Linux.

@jakelishman
Copy link
Member

Thanks so much for looking into it in depth! I guess if the issue is somewhere in Arch / tcmalloc / symengine, we can close this issue on Qiskit? I'll close it, but feel free to re-open if there's more to discuss.

@iyanmv
Copy link
Contributor Author

iyanmv commented May 10, 2024

After solving the crashes due to tcmalloc, I found another error related to symengine (this time a numerical rounding error).

In particular is this test:

def test_parameterized_shift_frequency(self):

It works running on a virtual env, but if fails using system shared libraries. It's a numerical error happening in this line:

return float(real_expr)

Running on my system real_expr has type symengine.lib.symengine_wrapper.RealDouble and value 3140000.0. However, running float(real_expr) returns 3139999.9999999995. I think this possibility was considered here but it doesn't work as expected:

# Remove truncation error and convert the result into Python builtin type.
# Value could originally contain a rounding error, e.g. 1.00000000001
# which may occur during the parameter expression evaluation.
evaluated = np.round(operand, decimals=decimal).item()

The reason is that decimal has a value 10 so this error is not removed with the rounding. Passing decimals=9 fixes the issue for me.

Note also this relevant comment from NumPy (see Notes):

np.round uses a fast but sometimes inexact algorithm to round floating-point datatypes. For positive decimals it is equivalent to np.true_divide(np.rint(a * 10decimals), 10decimals), which has error due to the inexact representation of decimal fractions in the IEEE floating point standard [1] and errors introduced when scaling by powers of ten.
(...)
If your goal is to print such values with a fixed number of decimals, it is preferable to use numpy’s float printing routines to limit the number of printed decimals:
(...)
The float printing routines use an accurate but much more computationally demanding algorithm to compute the number of digits after the decimal point.
Alternatively, Python’s builtin round function uses a more accurate but slower algorithm for 64-bit floating point values:

Perhaps it's better to use round(operand, ndigits=decimal) and change the default value of decimal from 10 to 9? What do you think? I can make a PR if you think it's a good idea (I will check first that there are no other side effects after changing that).

@iyanmv
Copy link
Contributor Author

iyanmv commented May 10, 2024

Sorry, I guess round() is out of the table because of complex numbers.

@iyanmv
Copy link
Contributor Author

iyanmv commented May 10, 2024

Also, after thinking a bit I don't understand why there should be such a rounding error in the first place. Do you know why is that error expected from that comment in the code and why decimal has a default value of 10 instead of 14 for example (which should be fine with double precision I guess)? I also can't replicate the error using pure symengine code. I can try to compile symengine without MPFR as a wild guess... but I don't understand what is going on.

@iyanmv
Copy link
Contributor Author

iyanmv commented May 12, 2024

No difference compiling symengine with -DWITH_MPFR=OFF

@iyanmv
Copy link
Contributor Author

iyanmv commented May 12, 2024

I noticed that the ParameterExpression(f/1000) is multiplied by a float here:

frequency = self.disassemble_value(instruction.frequency) * 1e9

This causes frequency to have the value ParameterExpression(1000000.0*f) in this particular unit test instead of ParameterExpression(1000000*f) which could have be obtained by multiplying by an integer instead, i.e.,

frequency = self.disassemble_value(instruction.frequency) * 10**9

This is what ultimately leads to the numerical error that could have been avoided and has nothing to do with the double precision of the float type.

@iyanmv
Copy link
Contributor Author

iyanmv commented May 12, 2024

And here is a small self-contained example replicating the issue with pure symengine:

import symengine

f = symengine.Symbol("f")
expr = f / 1000

expr_wrong = expr * 1e9
expr_good = expr * 10**9

wrong_value = float(expr_wrong.subs("f", 3.14))
good_value = float(expr_good.subs("f", 3.14))

print(wrong_value)
# 3139999.9999999995
print(good_value)
# 3140000.0

iyanmv added a commit to iyanmv/qiskit that referenced this issue May 12, 2024
Classes in pulse_instruction.py scale frequency values to GHz by
multipliying `ParameterExpression` with float 1e9. This can lead
to numerical errors on some systems using symengine. Instead, this
scaling can be done multiplying by integer 10**9.

See: Qiskit#12359 (comment)
@jakelishman
Copy link
Member

Sorry for the lack of response here - I reopened the issue, started typing a comment, then got distracted and forgot I hadn't posted.

Thanks so much for tracking this down and providing the fix - that's a very tricksy bug to find, and thanks for filing it on symengine.

@iyanmv
Copy link
Contributor Author

iyanmv commented May 16, 2024

No problem ;)

github-merge-queue bot pushed a commit that referenced this issue May 16, 2024
* Avoid lossing precision when scaling frequencies

Classes in pulse_instruction.py scale frequency values to GHz by
multipliying `ParameterExpression` with float 1e9. This can lead
to numerical errors on some systems using symengine. Instead, this
scaling can be done multiplying by integer 10**9.

See: #12359 (comment)

* Add release note

---------

Co-authored-by: Jake Lishman <jake.lishman@ibm.com>
mergify bot pushed a commit that referenced this issue May 16, 2024
* Avoid lossing precision when scaling frequencies

Classes in pulse_instruction.py scale frequency values to GHz by
multipliying `ParameterExpression` with float 1e9. This can lead
to numerical errors on some systems using symengine. Instead, this
scaling can be done multiplying by integer 10**9.

See: #12359 (comment)

* Add release note

---------

Co-authored-by: Jake Lishman <jake.lishman@ibm.com>
(cherry picked from commit 96607f6)
github-merge-queue bot pushed a commit that referenced this issue May 16, 2024
* Avoid lossing precision when scaling frequencies

Classes in pulse_instruction.py scale frequency values to GHz by
multipliying `ParameterExpression` with float 1e9. This can lead
to numerical errors on some systems using symengine. Instead, this
scaling can be done multiplying by integer 10**9.

See: #12359 (comment)

* Add release note

---------

Co-authored-by: Jake Lishman <jake.lishman@ibm.com>
(cherry picked from commit 96607f6)

Co-authored-by: Iyán <me@iyanmv.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants