Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meaningless RuntimeWarning by clang-compiled np.float32.__mul__ #9007

Closed
dimpase opened this issue Apr 27, 2017 · 53 comments
Closed

meaningless RuntimeWarning by clang-compiled np.float32.__mul__ #9007

dimpase opened this issue Apr 27, 2017 · 53 comments

Comments

@dimpase
Copy link
Contributor

dimpase commented Apr 27, 2017

In Sagemath we encounter in our ticket #22799

RuntimeWarning: invalid value encountered in multiply

while multiplying a numpy.float32 number with a non-numpy data; i.e. numpy ought to fail to do this multiplication silently, and indeed it does if one builds with gcc, or if instead of np.float32 it is np.float or np.float128.

More precisely, one gets the warning from the Python call

type(numpy.float32('1.5')).__mul__(numpy.float32('1.5'), x)

where x is a Sagemath univariate polynomial with coefficients in Sagemath's RealField type. (and only this particular type of data triggers this).
That is, potentially, such meaningless warnings can be emitted outside of Sagemath; we can reproduce it on OSX 11.12 with its stock cc (some derivative of clang 3.8), as well as on Linux with clang 4.0 and on FreeBSD 11.0 with clang 4.0 or clang 3.7.

Potentially we should be able to produce a way to reproduce this outside of Sagemath, although we'd need some tips where in numpy code this __mul__ is actually implemented, to see what functions are applied to x...

We see this on numpy 1.11 and on 1.12, too.

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

same picture with numpy.float32('1.5').__mul__(x) as well as __add__ and __sub__.

@eric-wieser
Copy link
Member

This type of error is typical of arrays that contain nan or infinity. What does np.array(x) return?

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

@eric-wieser : it returns array(x, dtype=object), no warnings.

@eric-wieser
Copy link
Member

eric-wieser commented Apr 27, 2017

Does np.multiply(np.float32('1.5'), x) give the same warning?

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

numpy.multiply(numpy.float32('1.5'), x) gives the same warning.

@eric-wieser
Copy link
Member

What about type(x).__rmul__(numpy.float32('1.5'), x)?

@eric-wieser
Copy link
Member

Also, if you could run warnings.filterwarnings('error'), then you'd get a full stack trace

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

type(x).__rmul__(numpy.float32('1.5'), x)

TypeError: descriptor '__rmul__' requires a 'sage.structure.element.Element' object but received a 'numpy.float32'

x.__rmul__(numpy.float32('1.5')) goes through just fine.

@eric-wieser
Copy link
Member

eric-wieser commented Apr 27, 2017

Seems I forgot how rmul works. I meant type(x).__rmul__(x, numpy.float32('1.5')), but I imagine that does the same thing as x.__rmul__, unless x is really weird

@eric-wieser
Copy link
Member

Does this also fail? np.multiply(np.array(1.5, dtype=object), x) (this time with filterwarnings, please)

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

type(x).__rmul__(x,numpy.float32('1.5')) goes through without a warning.

And, by the way, setting warnings.filterwarnings('error') does not give me anything interesting,

---------------------------------------------------------------------------
RuntimeWarning                            Traceback (most recent call last)
<ipython-input-50-b3ece847d318> in <module>()

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

sage: np.multiply(np.array(1.5, dtype=object), x)
---------------------------------------------------------------------------
RuntimeWarning                            Traceback (most recent call last)
<ipython-input-52-706823a0b5a2> in <module>()
----> 1  np.multiply(np.array(RealNumber('1.5'), dtype=object), x)

RuntimeWarning: invalid value encountered in multiply

@eric-wieser
Copy link
Member

Hmm, sage did something I didn't expect there. Same behaviour with float('1.5') though, I'd guess?

@eric-wieser
Copy link
Member

eric-wieser commented Apr 27, 2017

Ok, so here's what I think is happening:

  • numpy is correctly using the object loop in the ufunc, which ends up just calling PyNumber_Multiply
  • Within sage, something is setting the error flag in the FPU (bug in sage?)
  • numpy is doing its normal fpu flag check on exit from the ufunc (bug in object loops?), and finding the error left their by sage

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

sage: float(1.5).__mul__(x)
NotImplemented
sage: np.float(1.5).__mul__(x)
NotImplemented
sage: np.float32(1.5).__mul__(x)
/usr/home/dima/Sage/sage/src/bin/sage-ipython:1: RuntimeWarning: invalid value encountered in multiply
  #!/usr/bin/env python
1.50000000000000*x
sage: np.float64(1.5).__mul__(x)
/usr/home/dima/Sage/sage/src/bin/sage-ipython:1: RuntimeWarning: invalid value encountered in multiply
  #!/usr/bin/env python
1.50000000000000*x

@eric-wieser
Copy link
Member

eric-wieser commented Apr 27, 2017

Worth noting that np.float is float for compatibility reasons

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

why doesn't np.float32(1.5).__mul__(x) return NotImplemented ?

@eric-wieser
Copy link
Member

eric-wieser commented Apr 27, 2017

Because it know that it can handle it as np.multiply with an object loop, and then tries again with float * x inside that loop. Unfortunately, the wrapper around that loop is picking up FPU flags set by sage.

If you look closely, you'll find that x.__rmul__ is still being called deeper in the stack

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

sage: np.float32(1.5)*x
/usr/home/dima/Sage/sage/src/bin/sage-ipython:1: RuntimeWarning: invalid value encountered in multiply
  #!/usr/bin/env python
1.50000000000000*x
sage: np.float128(1.5)*x
1.50000000000000*x
sage: np.float64(1.5)*x
/usr/home/dima/Sage/sage/src/bin/sage-ipython:1: RuntimeWarning: invalid value encountered in multiply
  #!/usr/bin/env python
1.50000000000000*x

so it looks as if np.float128 is OK, but

sage: np.float128(1.5).__mul__(x)
/usr/home/dima/Sage/sage/src/bin/sage-ipython:1: RuntimeWarning: invalid value encountered in multiply
  #!/usr/bin/env python
1.50000000000000*x

This is weird. A compiler bug, perhaps (as we never see it on gcc), but at which place?

@eric-wieser
Copy link
Member

The FPU flag is for some reason being set on clang but not gcc within the sage code, it would seem. Numpy is to blame for making noise about it, but I highly doubt is to blame for setting it in the first place.

Unfortunately, numpy doesn't expose any way to explicitly ask for the FPU flags - that'd be very helpful for bisecting the issue within sage.

@eric-wieser
Copy link
Member

I assume this causes the same warning (needs numpy 1.12 to do so, I think)

mul = np.vectorize(x.__rmul__)
mul(float('1.5'))

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

not quite the same, but close:

/usr/home/dima/Sage/sage/local/lib/python2.7/site-packages/numpy/lib/function_base.py:2652: RuntimeWarning: invalid value encountered in __rmul__ (vectorized)
  outputs = ufunc(*inputs)
array(1.50000000000000*x, dtype=object)

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

OK, good, this seems to indicate that there is nothing specific with np.float32 or np.float64, it's more generic numpy mechanism to generate warnings kicking in here.

@njsmith
Copy link
Member

njsmith commented Apr 27, 2017

I don't know that the compiler authors would consider it a bug. The way that warning works is, there are some magic status flags that the processor keeps track of, that automatically get set whenever the corresponding event occurs. Numpy clears them before starting the computation, and then checks them again again at the end. So somewhere in between those points, the assembly generated by clang is doing some calculation that involves a NaN. But it's hard to track down (since the actual flag setting is done entirely in hardware), and most of the time people don't worry about how their code affects the fpu flags. (Libm implementations are also notoriously inconsistent about whether they set these flags.) And the exact results depend a lot on the exact asm being generated, so it's not surprising that you only see it in specific configurations and not others.

@eric-wieser
Copy link
Member

eric-wieser commented Apr 27, 2017

Yep, that confirms my suspicions, and provides you with a way to debug. This code

def check_fpu(f):
    @functools.wraps(f)
    def wrapped(*args, **kwargs):
        excluded = list(range(len(args))) + list(kwargs.keys())
        fvec = np.vectorize(f, excluded=excluded)
        return fvec(*args, **kwargs)
    return wrapped

Applied to a python function, allows you to isolate the warnings to within that chunk of code.

@njsmith
Copy link
Member

njsmith commented Apr 27, 2017

I mean, it is pretty weird that it's happening at all; compilers don't usually invent and then throw away NaNs for no reason.

If you're trying to track it down, then you should probably look at the code in sage that implements multiplication for those polynomials – it's likely that the weird flag setting is probably happening all the time, and numpy's only involvement is to make that visible.

There's also a pretty good argument that numpy shouldn't even try to check these flags on object loops. (Or integer loops for that matter, but that's tricky because the way we report integer overflow is kinda gross and uses the fpu flags.) That's the only thing I can think of that numpy could do here.

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

check_fpu() has a typo, it should be fvec = np.vectorize(f, excluded=exclude) there.
And, we're on python2: import functools32 as functools.

@eric-wieser
Copy link
Member

eric-wieser commented Apr 27, 2017

functools.wraps doesn't need python 3, does it?

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

I get an error if I use python2's functools, in setattr() call

AttributeError: 'method-wrapper' object has no attribute '__module__'

@rkern
Copy link
Member

rkern commented Apr 27, 2017

Yeah, I'm guessing it's whatever multi-precision library implementing the arithmetic for the coefficients in RealField that is setting the FPU flag. Are the underlying libraries compiled with the same compiler as numpy in each of the different circumstances? Or is only numpy being rebuilt with the different compilers?

@jdemeyer
Copy link
Contributor

Yeah, I'm guessing it's whatever multi-precision library implementing the arithmetic for the coefficients in RealField

That's MPFR for the record.

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

We try to port Sagemath to clang+gfortran (mostly on OSX and FreeBSD, platforms where clang is the primary compiler), so that building and running it on OSX is easier and faster (FreeBSD is more of a tool to get a similar environment without the hassle of OSX and Apple hardware).

All the comparisons I report here are for complete builds with clang/clang+++gfortran as opposed to gcc/g+++gfortran.

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

the wrapper seems to tell us that x.__rmul__ sets up the FPU flag

check_fpu(x.__rmul__)(np.float32('1.5'))

prints the warning, while x.__rmul__(np.float32('1.5')) does not.

@eric-wieser
Copy link
Member

Indeed - my assumption was that x.__rmul__ was written in python, and that its source code can be bisected to find which bit specifically sets the flag

@dimpase
Copy link
Contributor Author

dimpase commented Apr 27, 2017

x.__rmul__ is in Cython, but still it's a small piece of code to investigate.

@jdemeyer
Copy link
Contributor

If there would be a simple way to change the warning to an error, you would get a traceback (Cython generates tracebacks for errors but not for warnings).

@dimpase
Copy link
Contributor Author

dimpase commented Apr 28, 2017

@jdemeyer IMHO numpy warning is issued much later in the code path, i.e. it's result of an explicit check of FPU flags, not an interrupt set.

numpy does provide an interface to change this warning to an error, but all you get is that you get back to the main iPython interpreter loop, without any backtrace whatsoever.

@dimpase
Copy link
Contributor Author

dimpase commented May 1, 2017

@jdemeyer would Cython code between sig_on() / sig_off() from cysignals throw an exception if an FPU flag is raised? Or it depends upon the flag?

@jdemeyer
Copy link
Contributor

jdemeyer commented May 2, 2017

cysignals would throw an exception is SIGFPE is raised, which may happen if an FPU flag is raised, depending of the FPU configuration. But by default it does not.

@dimpase
Copy link
Contributor Author

dimpase commented May 3, 2017

A similar warning: RuntimeWarning: invalid value encountered in greater is
coming from np.float64(5)>e. Here e is the Sagemath constant specifying the base of natural logarithm 2.71828..., and so on the way to evaluating this to True it has to be "converted" (certainly, e "knows" its numerical approximation, it's e.n()) to a number.
This approximation is of the type RealField already mentioned above (so perhaps this warning is closely related).

Again, the question is: what does numpy do to evaluate np.float64(5)>e ?
Or equivalently, the same warning pops up from np.float64(5).__gt__(e), so one can just as well start from there.

Note that type(e) is sage.symbolic.constants_c.E; it's basically some (almost) dummy class
wrapping Sagemath's symbolic expressions (SR).

There are no warnings from np.float64(5).__gt__(e.n()) or np.float64(5)>e.n().
Essentially the same (same warning/no warning pattern) happens if you replace e by pi (with obvious pi.n()==3.1.415...).
pi has type SR, i.e. sage.symbolic.expression.Expression.

@eric-wieser
Copy link
Member

eric-wieser commented May 3, 2017

The answer is the same here - numpy calls np.greater with an object loop. At the bottom level, that calls e.__lt__(5.0). But once again, it checks the FPU flags before and after, and notices that something is amiss.

Most of the ndarray arithmetic/logical operators (with the exception of - and divmod) delegate to ufuncs. When invoked with sage objects, this will call the O (object) loops for these ufuncs. These object loops will loop over the array (which in this case is 0d), run the normal python operator on the elements, but check for FPU flags when it does so.

So once again, sage is setting these flags. Perhaps this is a sign of a bug, perhaps its not.

I think there's a good argument here that numpy should not be checking fpu flags for these cases. @njsmith, do you think we should go ahead with removing the check for object types?

@dimpase
Copy link
Contributor Author

dimpase commented May 3, 2017

As a matter of fact, e.__lt__(5.0) is a symbolic expression:

sage: type(e.__lt__(np.float32(5.0)))
<type 'sage.symbolic.expression.Expression'>
sage: e.__lt__(np.float32(5.0))
e < 5.0
sage: bool(e.__lt__(np.float32(5.0)))  # this is how it's evaluated
True

and thus I really doubt that it is called in the end, for one does get True. Also, your check_fpu wrapper from above does not make it print warnings, i.e. the following just works.

sage: check_fpu(e.__lt__)(np.float32(5.0))
e < 5.0

@dimpase
Copy link
Contributor Author

dimpase commented May 17, 2017

I was able to pin our problem in Sagemath down to a particular C extension using fpectl Python module (which is somewhat, but not totally, broken on FreeBSD). It was actually very quick once I managed to get it installed.

IMHO fpectl is so useful that it ought to be fixed; perhaps even used in numpy instead of, or in addition to, np.seterr(), as it provides better granularity on compiled components.

@njsmith
Copy link
Member

njsmith commented May 17, 2017

The difference between fpectl's approach and np.seterr is:

np.seterr runs the ufunc loop, and then checks to see if any flags are set.

fpectl does some magic to make it so that whenever an operation occurs that causes one of the flags to be set, the hardware raises an interrupt, the kernel converts this into a SIGFPE delivered to the process, and it installs an SIGFPE handler that longjmps directly out of the signal handler into the error handling code.

Some downsides of the fpectl approach are: (a) it doesn't work at all on Windows, (b) it breaks for code that internally causes one of these flags to be temporarily set and then clears it (this is legal and I expect there are libm's that do it), (c) longjmp is incredibly fragile; basically you're risking a segfault every time you do it. It certainly can't be a general solution for arbitrary user-defined ufuncs.

Given all this, I don't think numpy is going to switch.

In any case, it sounds like the original issue is solved, so closing this – feel free to open a new issue if you want to make a case for changes in seterr.

@njsmith njsmith closed this as completed May 17, 2017
@eric-wieser
Copy link
Member

eric-wieser commented May 17, 2017

In any case, it sounds like the original issue is solved

Are we sure we don't want to disable checking of the FPU flags for object loops? That would seem like a pretty sensible change to numpy.

@njsmith
Copy link
Member

njsmith commented May 17, 2017

@eric-wieser: oh, that's an interesting idea, yeah. maybe it's worth opening an issue for that :-). The "right thing" is pretty complicated though – ideally we shouldn't be special-casing the object dtype (think user dtypes), and integer loops also shouldn't use it either (this may be a real optimization on some architectures where checking/clearing the FPU flags is extremely slow), but integer loops do need a way to explicitly signal integer errors, which they currently do by explicitly setting the FPU flags, ... I'm not sure this is a case where there's easy low-hanging fruit?

Or did I misunderstand, and sage has only identified the problem, and they still need a numpy change to actually fix it?

@dimpase
Copy link
Contributor Author

dimpase commented May 18, 2017

@njsmith: I do not understand why you say it won't work on Windows. (This would be correct in pre-C99 era, though). Modern FPU-handling functions (fenv) are available as soon as your C compiler is C99-standard compliant. Apart from fenv all it needs is setjmp/longjmp (again, standard C feature).

I am also curious to hear about a libm that causes one of FE exceptions in the course of a normal operation.
(Unless it is classified as a bug).

@njsmith
Copy link
Member

njsmith commented May 18, 2017

@dimpase: You also need SIGFPE support, which is not specified in C99. (Well, C99 says that there should be a SIGFPE, but that's for divide-by-zero – it doesn't specify any way to hook it up to floating point exceptions.) That said, it looks like I misremembered, and though Windows doesn't support signals, MSVCRT emulates SIGFPE using structured exception handling, and provides the non-standard _control_fp function to enable it for particular fp exceptions, so Windows support is not actually a barrier. OTOH it doesn't matter that much since longjmp is definitely not happening without an extremely good reason :-)

And FWIW, if a libm caused an FE exception and then cleared it again, I can't see why they would consider that a bug. I'm not sure that any such implementations exist, but it's plausible, and if they do then the way we would find out is b/c someone tells us that numpy is broken on platform X and the only fix would be to revert the change you suggested.

Can you answer the the question I asked at the end of my previous comment?

@dimpase
Copy link
Contributor Author

dimpase commented May 18, 2017

@njsmith : if a libm (or any other user code) needs to cause an FE exception and process it, it would set up its own FE exception handler, saving the previous one, and restoring the previous one upon exiting.
So it is not a problem if the underlying code is playing by the rules.

Regarding MS support for this, they ship fenv.h since Visual C(++) 2013 or so.
It is specifically meant to be useful with setjmp/longjmp (how exactly it is done under the hood should not worry one too much, I hope).

Regarding numpy's RuntimeWarning:

  • if you think that the user code can play fast and loose with FP flags, then these warnings should be optional at most.
  • else, they can be kept standard, but at least a way to get to the bottom of the place they come from is crucial for them being useful; (improved) fpectl is a quick way to achieve the latter. Using external tools (such as debbuggers allowing you to instrument the code to checking something after every instruction) is harder, slower, and more error-prone---e.g. it's not-unusual that the bug only pops up on an optimised binary, and goes away as soon as you try to find it on a well-debuggable binary.
  • in any case, the RuntimeWarning thing ought to be possible to turn off.

Regarding this issue in Sage - still fixing (hopefully it's limited to some issues in MPFR only).

@njsmith
Copy link
Member

njsmith commented May 18, 2017

Sorry, this is going in circles and I need to move on to other things, so unless something new comes up on the fenv/sigfpe issue this will be my last message on the topic. (I'm still interested in if there's anything numpy needs to do for the sage bug).

if a libm (or any other user code) needs to cause an FE exception and process it, it would set up its own FE exception handler, saving the previous one,

What you're proposing is to take an operation that normally does not cause a signal handler to fire, and configure the processor in a non-standard mode where it does cause a signal handler to fire. It's totally reasonable for code to be performing this operation and expecting that it won't trigger a signal handler at all.

Regarding MS support for this, they ship fenv.h since Visual C(++) 2013 or so.
It is specifically meant to be useful with setjmp/longjmp

I can't figure out what you're talking about here. Afaict, the standard functionality in fenv.h is only useful for implementing numpy-style functionality, and MS sticks to the standard. I don't see any functions in there that could be used with setjmp/longjmp at all.

user code can play fast and loose with FP flags, then these warnings should be optional at most.

Carefully clearing a flag set by an intermediate calculation is the exact opposite of playing fast and loose with them. Also, the warnings are optional.

a way to get to the bottom of the place they come from is crucial for them being useful; (improved) fpectl is a quick way to achieve the latter.

You're literally the first person in something like a decade to need SIGFPE to debug this kind of issue, and looking again at the sage bug comments, it looks like you didn't actually get fpectl working? It's not supposed to cause a core dump. (It looks like cysignals is overriding the fpectl code so it doesn't even run.)

If this comes up again, what you need to do is make one C call to enable SIGFPE, then use a debugger to get a stack trace. You don't need a debug build to get a stack trace; all you need to do is not strip the symbols. And hey, now we know in case this does come up again.

I understand this was really frustrating to debug, but it's not helpful to insist that other projects need to change or maintain basic infrastructure when you can't even explain clearly what this will accomplish. (I actually have no idea how you think numpy changing something here would even help you find this kind of bug faster – the whole idea of longjmp is to destroy the more precise information you say you want.)

@dimpase
Copy link
Contributor Author

dimpase commented May 19, 2017

Finally, it turns out that it boils down to a long-standing bug in clang C compiler. Basically, in a certain range of double an assignment as in

unsigned long a;
double b;
...
a = b;

raises FE_INVALID (and sometimes FE_INEXACT); b.t.w. this is also affecting other types of float data. Great MPFR (MPFR has to copy doubles into their arbitrary precision floats, certainly) people provided a workaround, fixing this for sage.

Incidentally, a related even more long-standing (since 2010, with a dozen closed duplicates) clang bug 8100 says that there is no hope of using clang's fenv.h for making fpectl properly working at the moment. Clang is not fully C99-compliant in this respect, and is being really shy about it.

Not sure whether numpy wants to do anything about it; perhaps a remark that the RuntimeWarning might be merely due to a compiler bug (citing "clang bug 8100", it is a quite prominent example) might be helpful.

@njsmith
Copy link
Member

njsmith commented May 19, 2017

bug 8100 isn't relevant; that's for the C99 pragmas to disable floating point optimizations, and no mainstream compilers support those. numpy seems to (mostly) work anyway :-)

@dimpase
Copy link
Contributor Author

dimpase commented May 19, 2017

The spirit of the bug 8100 is that clang does not care about FP operations being correctly compiled; although a lawyer might disagree. :-)

OK, already mentioned bug 17686 is relevant for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants