Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We need persistent floating-point rounding mode control #1384

Open
KloudKoder opened this issue Oct 11, 2020 · 59 comments
Open

We need persistent floating-point rounding mode control #1384

KloudKoder opened this issue Oct 11, 2020 · 59 comments

Comments

@KloudKoder
Copy link

There's some history to this. Bear with me.

Typical floating-point rounding policies are: chop, round-negative, round-positive, and round-nearest-or-even. Intel made the dumb decision of encoding this policy in the floating-point control word, which means that if you want to change it frequently, you need to manipulate the control word (FLDCW) all the time, which is expensive.

nVidia realized that this was a nightmare for interval arithmetic, which has risen in prominence for scientific computing in the past several years, mainly because it requires constant changes in rounding policy (because, usually, the lower limit goes down while the upper limit goes up after every floating-point instruction). So nVidia did things the right way: they encoded the rounding policy inside the instruction itself.

Looking at your spec, it seems like round-nearest-or-even is the only supported mode in most cases. That makes interval math effectively impossible, which will only limit the utility of WASM for scientific applications.

Please allow us to prefix or otherwise alter floating-point instructions so as to specify the rounding mode. On Intel/AMD, you can just optimize the prefix away if it's the same one as before (and no intervening branch targets exist). If the prefix is never used, then everything is backward-compatible to round-nearest-or-even, so the control word never needs to be written; this is easily checked during code validity authentication. But on nVidia, the prefix gets sucked into a bitfield in the instruction.

Please implement this. It will enable more and more useful applications than you might currently imagine.

@tlively
Copy link
Member

tlively commented Oct 12, 2020

The best way to fit new rounding behavior into WebAssembly would be as new instructions (rather than an instruction prefix or a persistent rounding mode or something like that). Could you list the new rounding instructions you would like to see?

@KloudKoder
Copy link
Author

KloudKoder commented Oct 14, 2020

@tlively Fair enough. I've never been a huge fan of prefixes, either.

To be clear, there's rounding in the sense of "round an existing float to an integer", and then rounding in the sense of "nudge the units-of-last-place (ULP) according to the infinitely precise result".

In the first case, we can get by with the usual suspects:

a. Round nearest-or-even.
b. Round toward negative infinity.
c. Round toward positive infinity.
d. Chop toward zero (truncate).

In the second case, we can apply the same rounding policies, but to the ULP instead of the integer part. Implicitly, we first of all have to know the maximum error bounds on the infinitely correct result, because if they're too big, then the ULP is still undetermined. For arithmetic operations, this is straightforward and already done right in hardware. For operations which may produce irrational results, it's first necessary to know the error bounds on the truncated portion (due to series truncation). One hopes that the CPU designers did this right. I have serious doubts about that, but it might be correct for fsqrt, which AFAIK is your only such operation. You probably can't ever support other transcendental functions in a well-defined and cross-platform way because the CPU designers probably didn't really nail the error terms. (Said functions would need to be implemented by the programmer in terms of your existing arithmetic operations. Slow, but well-defined, cross-platform, and futureproof. Lots of cores, lots of free cycles, no worries. The only other option would be to hope that all CPU vendors implement the same transcendental functions in exactly the same way. Things get really complicated if that happens to be true because they're all wrong in the same way, which is why I think the software emulation approach is a necessary evil.)

Obviously, the first case can be done after the fact because we just need to dispose of the fractional part. But in the second case, everything needs to happen in one instruction because once an instruction completes, any extra bits of precision have been forgotten. So if you want individual instructions, you need stuff like "multiply and round the ULP toward negative infinity" and "square root and truncate everything after the ULP". The good news is that, if you do it this way, then it will fit like a glove with nVidia, which is probably more relevant going forward than AMD64. The bad news is that you'll have to employ some smarts to avoid touching the control word too often on the latter. And by the way, all your existing arithmetic instructions would now to be suffixed with "and round nearest-or-even".

I'll pause here. Are we on the same page?

@tlively
Copy link
Member

tlively commented Oct 14, 2020

Yes, the philosophy here sounds good. We've had similar discussions over in the SIMD proposal. The next part is making sure that the proposed instructions have (reasonably) efficient lowerings on relevant architectures (at least x86/x86-64/ARM). I'm not sure how nvidia fits into the picture, though. WebAssembly doesn't typically run on GPUs. It would also be good to mention what kind of languages or applications would benefit from having these instruction in order to motivate working on them.

@KloudKoder
Copy link
Author

"WebAssembly doesn't typically run on GPUs." -- It certainly does, in 2029.

But allow me to suggest a potentially better solution for the second case. It seems to me that the only practical application for ULP rounding is interval math, full stop. Therefore, you could cut down the explosion of instructions and improve performance by providing native interval support. So "divide 64-bit floats" becomes "divide 64-bit float intervals". Yes, there are lots of corner cases, but by doing it this way, you would solve all those problems once and for all, allowing higher-level languages to translate interval types directly into WASM. I am entirely sympathetic to the need to avoid CISC nonsense, but intervals are so powerful and so universal that I think it's worth being a little CISCy in this case.

The other reason to implement native intervals is that higher-level languages do not, to my knowledge, have instructions for ULP manipulation. In many languages, it wouldn't even mean anything because floats are undefined garbage anyway. But it's easy to imagine adding macros to support basic interval functions in popular languages, which then filter straight to the WASM layer. No chance for creating super subtle bugs due to misguided ULP manipulation.

If I recall, an Arianne 5 rocket blew up back in the naughts due to floating-point imprecision which would have been prevented with intervals. And if I'm not mistaken, the safety of the Large Hadron Collider was proven with interval analysis, as well. But it's not just about safety. Computation can be exponentially accelerated, in some cases, using intervals, simply because lower precision is often acceptable, and the time savings propagate to many levels of the function tree. This facilitates much faster raytracing, for instance. Fluid dynamics stand to benefit as well. (This is all abstractly related to the Shrinking Bullseye algorithm.)

I'm intrigued by your comments regarding SIMD, but I'll leave that for another thread.

@tlively
Copy link
Member

tlively commented Oct 14, 2020

It's not clear to me why interval arithmetic should be a part of the WebAssembly ISA as opposed to a software library that compiles to existing WebAssembly instructions. Do native architectures support interval arithmetic directly? If not, there's probably no benefit to building it into WebAssembly.

@KloudKoder
Copy link
Author

KloudKoder commented Oct 14, 2020

These are the reasons that intervals should be supported in WASM:

  1. It's almost impossible to support them in higher-level languages due to the lack of ULP manipulation primitives. What support does exist tends to be extremely slow because the correct result can still be obtained, but only through convoluted means involving type casting.

  2. WASM is the missing link we need in order to support widespread computing-on-demand, precisely because it's well-defined down to the bit level, so it's easy to validate that outputs from untrusted nodes are identical and thus credible. Without well-defined behavior, we end up with approaches which require the programmer to do manual result comparison in a code-dependent way. It just doesn't scale.

  3. Good science and engineering needs interval math, but it's not being exploited much because it's been bolted onto higher-level languages instead of built in at the bottom. nVidia solved this problem, but there's a lack of decent IR code to access it from higher-level languages. WASM could give us that missing IR.

  4. The future of distributed computing is WASM, but only if you can make it scalable. I realize that most people think of it as a browser extension just for running faster web pages, but it could rule the distributed computing world by 2030 if you play your cards right. I can't emphasize this enough. Without intervals, we get scalable ambiguity rather than scalable clarity.

If the only acceptable solution is 4 different rounding flavors of your existing primitives, then I'll take it, but the path to fewer bugs is just to implement intervals at the bottom. This could be extended to SIMD at some point, as well.

@titzer
Copy link

titzer commented Oct 14, 2020

Wasm is generally designed as a minimal abstraction over hardware, so the design pressure is to expose (standardized) versions of what is in hardware, and not build the next higher-level concept unless it is impossible not to. In this case it would seem that rounding modes would be both closer to hardware and the more minimal addition. It's my understanding that the rounding mode control design in IEEE 754 is meant for this purpose, and has other purposes too, outside of interval arithmetic.

@KloudKoder
Copy link
Author

If you do it that way, then at least the compilation process would be straightforward, along with extension to SIMD. This also means that interval implementation will need to be done by porting these rounding-aware arithmetic primitives primitives to higher-level languages, which is more of an uphill battle. But OK, what's the next step?

@bvibber
Copy link

bvibber commented Oct 14, 2020

If I understand correctly, you have the same problem with high-level languages for native compilation, so I don't think that's a Wasm-specific problem that Wasm needs to solve.

@KloudKoder
Copy link
Author

KloudKoder commented Oct 14, 2020

@Brion Well, as I said above, I'm happy enough with the idea of just implementing rounding-aware arithmetic primitives in WASM.

The general problem with ULP manipulation, whether you solve it with such primitives or full blown interval support, is that WASM is likely to become a method whereby distributed computing is facilitated for HLLs. So in effect, it is the hardware. You can't go directly from HLL to CPU/GPU machine code and have it run on anonymous targets all over the world. (Imagine trying to run a distributed computing network which mandates that you need nVidia XYZ GPU in order to even participate.) So you have to go to middleware first, of which the obvious (and perhaps only) choice is WASM. So if the HLL programmer wants a feature that is efficiently implementable in hardware, but WASM doesn't support it, then he's stuck having to do nonperformant emulation.

@Yaffle
Copy link

Yaffle commented Nov 13, 2020

@KloudKoder , can f64.fma help with the performance of interval arithmetic libraries at least for f64?

@KloudKoder
Copy link
Author

@Yaffle Can you explain your reasoning as to why multiply-add would help? For instance, how would this accelerate the process of adding 2 intervals, given that presumably it's still going to do nearest-or-even?

@Yaffle
Copy link

Yaffle commented Nov 16, 2020

it helps to find the error of the multiplication of two f64 numbers or division of f64 numbers:
sum = a + b; error = -sum + absmax(a, b) + absmin(a, b);
product = a * b; error = fma(a, b, - product);
quotient = a / b; error = -fma(quotient, b, -a);
if you know the error, you can round value by incrementing (for example, when error > 0 and rounding to positive infinity) or decrementing (for example, when error < 0 and rounding to negative infinity) the value by the smallest amount.
This way, the resulting interval will be tightest.
Although, If you don't need tightest interval, you could just increment/decrement both ends...
Btw, there is a website with some benchmark - http://verifiedby.me/kv/rounding/index-e.html#:~:text=Benchmark Hm...

@KloudKoder
Copy link
Author

KloudKoder commented Nov 17, 2020

@Yaffle Wait a second, are you talking about fma (which I just realized isn't actually in the spec, if the search function is to be believed) or fmax/fmin?

Just starting with addition, we have:

(a, b) + (c, d) = (RoundNegative(a + c), RoundPositive (b + d))

How that would involve a multiplication, I don't presently see. And multiply-add takes 3 parameters, not 1. Clearly, I'm misinterpreting your notation.

@Yaffle
Copy link

Yaffle commented Nov 17, 2020

@KloudKoder , fma that is fused-multiply add, I have updated my notation

@KloudKoder
Copy link
Author

@Yaffle I think I finally understand your point. What you mean is that Intel's various FMA instructions compute:

a * b + c

to infinite precision, and then round nearest-or-even. (They don't round after just computing the product.)

This means, for one thing, that:

product = FMUL(a, b)
error = FMA(a, b, -product)

is equivalent to:

error = NearestOrEven(ab - FMUL(a, b))

which by definition is off by no more than half a ULP, so you need only nudge the result by 1 ULP in order to obtain an upper bound on the error.

This is a valid approach, but ULP nudging requires intensive overhead: you need to first treat the result like an integer, then add your ULP (as an unsigned 1), then see if you wrapped into the exponent field, or perhaps created a NaN, etc., then potentially do a rollback and update the result to infinity or whatever. There are other ways to do this, but AFAIK also involving painful overhead. And do note that I haven't demanded the tightest interval -- just a "tight enough" one for practical precision requirements. So do you have a fast way to nudge? If not, then unfortunately it's back to my original point, which is that intervals are nonperformant under the current ABI, relative to what's actually possible on real hardware.

Also, note that I've written FMA above, which alludes to the Intel instruction set. Webassembly's fma, by contrast, doesn't seem to exist. Do you have a link? It doesn't come up in the doc search on the website.

Finally, from your link:

"addition/subtraction/multiplication/division/sqrt including the rounding mode specification to the command itself"

That's precisely what this proposal is all about. It's great to have that capability, but we need a direct way to exploit it.

I'm happy to have my pessimism disproven...

@Yaffle
Copy link

Yaffle commented Nov 18, 2020

@KloudKoder ,
https://github.com/mskashi/kv/blob/2ec8bac6216d25b226c09e1740487837a4fc5fd5/kv/rdouble-nohwround.hpp#L97

  • this is another variant of "nudge" function, seems, it only does x + ABS(x) * (2**-53+2**-105) when abs(x) > 2**-969, which covers most of the f64 values...

f64.fma is mentioned at https://github.com/WebAssembly/design/blob/master/FutureFeatures.md#:~:text=fma
but if you don't need the "tighest interval", you could just nudge by 1 ulp both ends... although, this could be bad (?).

anyway, you are right, that the overhead could be big

@KloudKoder
Copy link
Author

KloudKoder commented Nov 19, 2020

@Yaffle I now see f64.fma under "Additional floating point operators". I'm willing to assume that it will be adopted as an instruction. And definitely I wouldn't demand the tightest interval, which is unlikely to offer any practical advantages. Now, looking at your alternative nudge:

x + ABS(x) * (2**-53+2**-105)

it's clear that the point of this is to add the result (x) to: itself shifted down so that its leading 1 appears in the ULP. I'm not entirely sure why we need the (2**(-105)) -- maybe to break the tie case of nearest-or-even in favor of a nonzero nudge -- but in any event, here's what we would need to do:

  1. Given f64 result x.

  2. If x is +/- 0.0, then nudge it up/down to the smallest positive/negative number and quit.

  3. Subtract 53 from the exponent field. Call the result y. (In some architectures, this might require transport between float and integer registers. Let's just assume there's no cost for that.)

  4. Use a conditional set (flag-to-integer) instruction to remember if Essentialpostv1 future #3 resulted in a wrap of the exponent (usually not). This involves several integer instructions, however you do it, so far as I can see.

  5. OR the result of Add a few features. #4 to a something-failed flag, f, which is extremely likely to end up as 0.

  6. Let z, the final result, be the sum of x and y. Just simply nearest-or-even rounding will ensure that z is a valid nudge of x.

  7. At some future checkpoint prior to commit, verify that f is 0, and fail out if not.

This does in fact work. It also disrupts the pipeline badly, resulting in a long series of essentially serialized instructions. This could be partially overcome by weaving together a whole bunch of interval operations which can occur in parallel.

Thus it still seems that supporting instructions with builtin rounding modes is far cheaper. One could also just ignore the trend toward integer arithmetic, and thereby make WebAssembly less useful in scientific computing. That would be a shame, though.

Thanks for helping to demonstrate the scope of this problem so thoroughly. Where to go from here?

@KloudKoder
Copy link
Author

What if I write the Intel/AMD support for you? Would that offer you maintainers enough motivation to port it to other platforms and make it part of the spec? We can argue over the specifics and corner cases, but there's no point in even discussing them if my proposal here isn't sufficiently compelling.

@tlively
Copy link
Member

tlively commented Mar 19, 2021

@KloudKoder thanks for offering to help implement :) Unfortunately there's a lot more than implementation that goes into getting a new proposal standardized, so this is unlikely to advance without more of the community showing interest. See one of my previous comments on why nice-to-have features often don't make the cut, even if they don't seem to have any technical downsides.

@KloudKoder
Copy link
Author

@tlively I read your previous comment and I understand why new opcodes are a hard sell.

In this case, that's particularly vexing because intervals have suffered this chicken-and-egg adoption problem since the day they were first proposed for use with IEEE floats. Basically there's a great temptation to avoid using them in high level languages because they're so slow, but no one wants to do the dirty work of optimizing their performance by creating intrinsic support because (surprise!) no one finds them appealing to work with (unless they're desperate for error containment). At the end of the day, making intervals efficient and trivially accessible is a matter of ensuring data security -- just in the analog instead of the digital realm. We're wasting nVidia's (and perhaps others') great contribution of interval-ready floating-point instructions.

One glimmer of hope is that there are already public specs on the expected behavior, available both as a comprehensive body of axioms and also a "diet" version for basic arithmetic. So at least some aspects of their expected behavior are already defined for us.

@tlively
Copy link
Member

tlively commented Mar 22, 2021

In this case adding interval arithmetic to WebAssembly wouldn't help that chicken-and-egg problem because it would not be any faster than an interval arithmetic library compiled to WebAssembly unless there is underlying hardware support. I recommend going the easier route of writing such a library for now, and if common hardware natively supports interval arithmetic in the future, it will make sense to revisit adding it to WebAssembly as well.

@KloudKoder
Copy link
Author

@tlively I get that, but it's not accurate. nVidia has hardware interval primitives, in the sense of floating-point instructions with builtin rounding modes. They aren't accessible via WebAssembly.

The Intel way involves setting a control word first, which means you need to do a bunch of high limits in series, then a bunch of low limits, in order to minimize control word manipulation. Writing a library doesn't help because this is something that needs to occur on the compiler level. For example, I could create macros in a library which do interval operations in AVX2 or X87, but unless I control the compiler, I can't get it to do those reorderings so as to avoid all the crippling control word manipulation.

You may see WebAssembly as a way to accelerate websites. While I don't disagree, I also see it as a global distributed computing platform just waiting to happen. Golem, for instance, has already enabled that it some hacky way, but it could be so much more if it ensured analog security by including this type of support. This is particularly important when attempting to ensure that results obtained via different platforms (including those not using WebAssembly) are consistent with one another, as floating-point results of complicated computations are seldom exactly equal. No intervals means this effectively can't happen and WebAssembly never reaches its true potential. So there's a lot more to this than just "please make this fast".

@tlively
Copy link
Member

tlively commented Mar 23, 2021

Oh I see, I didn't realize this was already supported by Intel hardware. How widespread is the support?

@penzn
Copy link

penzn commented Mar 23, 2021

It is pretty much ubiquitous on CPUs, not only on x86, it is what fegetround and fesetround library functions are for. The tricky part is that rounding mode is a "global" (per-thread) state which affects FP operations.

@KloudKoder
Copy link
Author

KloudKoder commented Mar 25, 2021

@tlively

Oh I see, I didn't realize this was already supported by Intel hardware. How widespread is the support?

@penzn Is correct. This type of rounding support has been present on Intel for decades. As he indicated, it's a terrible design because changing the rounding mode requires a floating-point pipeline stall. nVidia had the foresight to integrate rounding mode directly into its instructions, so for instance you can "add and round toward positive infinity" or "multiply and chop toward zero".

What I'm advocating here -- if not comprehensive interval primitives -- is at least a set of opcodes and/or prefixes which will allow efficient access to the relevant nVidia instructions. Those opcodes and/or prefixes can then be at least somewhat efficiently mapped to Intel by grouping same-rounding-type instructions into dense sequences, so as to minimize rounding mode manipulation. However, this could be done as a performance enhancement; the first release could just change rounding modes literally as specified by the WASM code.

@penzn
Copy link

penzn commented Mar 25, 2021

As he indicated, it's a terrible design because changing the rounding mode requires a floating-point pipeline stall.

I don't think I indicated that ;) It is the reality of CPU FP environment, though maybe it could have been designed better.

This type of rounding support has been present on Intel for decades.

If I understand things correctly, it exists on Arm as well, at least the same fenv functions work there. As usual, there might be discrepancies between the two on how different instruction groups interact with rounding mode.

There is a discussion about better IEEE 754-2019 support in Future Features:

To support exceptions and alternate rounding modes, one option is to define an alternate form for each of add, sub, mul, div, sqrt, and fma. These alternate forms would have extra operands for rounding mode, masked traps, and
old flags, and an extra result for a new flags value.

This is similar to original idea for "releaxed SIMD" fpenv: #1401 (comment).

Disadvantage of first class interval support is that it would require runtime optimization on CPU, while we don't yet have a viable GPU implementation where it could be organically supported. However the idea in "Future Features" suffers from the same issue.

Advantage of interval math is that it can reduce instruction count (two for price of one), but to be useful it might need to be relatively high-level. For example, usable interval math library requires more operations than listed above and additional operations might not be easy to implement without them also becoming instructions. Another issue is scheduling - to do a few low or high limits in a row parts of different interval instructions would need to be scheduled together, which requires even more optimizations in runtime or a new execution model.

@KloudKoder
Copy link
Author

KloudKoder commented Mar 29, 2021

@penzn Yeah my wording wasn't good. I meant that you mentioned the control word architecture, which in my opinion is a terrible design. The bottom line is that a modest percentage increase in native Intel instruction count due to changing the rounding mode results in a multifold explosion in execution time due to the pipeline stalls. As you clearly understand, the only partial mitigation for that would be to group instructions by rounding mode to the extent possible.

As much as I'd be thrilled to see complete interval support in WASM, it's unnecessary. The nVidia building blocks would be enough for practicality. (I wonder how Apple M1 does it, for that matter.) I understand that there's no GPU support yet, but I would expect that to be in WASM's future evolution. Why do you think that interval support would need to be more high-level? Even transcendental functions (e.g. log) can be straightforwardly computed from basic interval arithmetic.

As to the Future Features proposal, it would be a great help toward efficient interval implementation. I am, however, opposed to anything that would set result flags (e.g. zero, infinity, NaN, etc.) unless that step can be bypassed somehow, such as by writing the flags result to a write-only trash register. The reason is that it adds major overhead to evaluate those flags, even on a hardware level. I suspect that modern FPU pipelines know when they can skip this step, such as when 2 floating-point instructions occur in a row, eliminating the need to set flag bits based on the results of the first one.

@penzn
Copy link

penzn commented Mar 30, 2021

What do you think is the minimal set of interval operations which would be implementable and beneficial on CPU as well?

@KloudKoder
Copy link
Author

KloudKoder commented Mar 31, 2021

@penzn I can hardly think of a more relevant question for this thread.

First of all, there are really 2 basic architecture choices:

  1. Look like Intel. In other words, have a "set rounding mode" instruction. ("Get rounding mode" isn't necessary because competent software should know what it already set. But no harm if people want that.)

  2. Look like nVidia. In other words, have 4 forms of each instruction: (a) round nearest-or-even (which is what we already have in WASM), (b) chop (inwards toward zero), (c) round toward negative infinity, and (d) round toward positive infinity.

I would gravitate strongly toward option #1 because (a) it will be simpler to optimize for Intel (even though theoretical performance will be identical in both cases) and (b) I don't think anyone wants a massive zoo of similar animals with slight differences in their stripes.

So assuming that we go with this option, then I would strongly caution against the use of any "set floating-point environment" (SFPE) instruction which simultaneously affects exception masking. This will burden rounding mode changes with lots of other logic. Separate these functions across different instructions. (SFPE instructions are OK for the sake of compatibility, but they should be clearly marked as nonperformant.) Critically, neither SFPE nor set-rounding-mode should accept variable inputs. (Imagine having to ask "What's the rounding mode?" prior to every instruction so you know what to execute on nvidia!)

Then the question becomes: what is the minimal set of instructions to be worth discussing here? Probably:

  1. Set round nearest-or-even.
  2. Set chop (round inward).
  3. Set round negative infinity.
  4. Set round positive infinity.
  5. Load ULP.
  6. Nudge inward.
  7. Nudge outward.
  8. Get sign bit (slightly different than "get sign").
  9. Xor sign bits, output to integer or flag (or maybe just "branch if sign bits (don't) differ").
  10. Copy sign bit to another floating-point value.

Just for sake of consistency, we might want the first 4 opcodes above to be assigned in a manner consistent with Intel's RC bits in the FPU control word.

#5 takes a floating-point (FP) value, then returns the smallest-magnitude FP value which, when added to the input with chop rounding, would change its value. (Implicitly, ULP(X) has the same sign as X.) If the input was either sign of zero, then the output shall be FLOAT_MIN (the smallest denormal) of the same sign. If the input was either sign of infinity, then the output shall be unchanged. ULP comes up all over the place in interval arithmetic, so there's a fundamental justification for this instruction. For example, it would be helpful in case one needs to multiply ULP by another FP value in order to compute a compound error term, as would occur when efficiently computing (Y log X) via (truncated) Taylor series. It's also used all the time in order to know when to stop computing such a series.

#6 Decrements the mantissa (without changing the sign). Decrements the exponent as well if the mantissa wraps. May result in a denormal value. If the input was either sign of FLOAT_MIN, then the output shall be zero of the same sign. If the input was either sign of zero or infinity, then the result shall be unchanged.

#7 Increments the mantissa (without changing the sign). Increments the exponent as well if the mantissa wraps. If the input was either sign of zero, then the result shall be FLOAT_MIN of the same sign. If the input was either sign of infinity, then the result shall be the same. If the input was either sign of FLOAT_MAX, then the result shall be infinity of the same sign.

#8 Returns the sign bit, which is important if there's a semantic difference between negative and positive zero, which would not be captured by a traditional "get sign of FP operand" instruction. Note that this deals with the sign bit, which in the aforementioned case actually differs from the (arithmetic) sign.

#9 is important for interval division. For example, the reciprocal of [2, 3] is [(1/3), (1/2)] (at least (1/3) and at most (1/2)). But the reciprocal of [-2, 3] is [(1/3), (-1/2)], meaning at most (-1/2) OR at least (1/3). (Inverted limits imply an open interval, wherein a finite segment is missing from the real line.) The reason is that [-2, 3] contains zero. Moreover, it contains points both to the left and to the right of zero, so the reciprocal approaches infinity of either sign. We need to know if the signs are opposite so we can determine whether or not to exchange the limits. Without this instruction, it's lots of branching with sign extraction just for the basic arithmetic function of division.

#10 This comes up all the time, although good examples escape me at the moment. Without this instruction, it's a lot of pipeline stalls.

I don't mind if the above instructions are undefined for NaN operands, or if they behave as though the NaN is a normal FP value. No point in accelerating broken math. Of course, any undefined operation will potentially leak information about the underlying hardware, and will defeat the possibility of result verification through comparison.

@KloudKoder
Copy link
Author

Speaking of experts in interval arithmetic, this is the video that got me thinking about the importance of analog security way back in 2007. The issues raised in the video are even more relevant today because, frankly, so little progress has been made. It's time to change that. The speaker is G. Bill Walster. I'll cut right to the chase, but it's worth watching the full video:

https://www.youtube.com/watch?v=U_Cq_h1vrD0&t=680s

I would reach out to him but am not able to due to security constraints on my end. Perhaps one of you would kindly inform him of this thread. He would make an ideal reviewer.

https://www.researchgate.net/profile/G-Walster

@penzn
Copy link

penzn commented Apr 6, 2021

Disregard my previous comment about denormals, I think we can fully embrace them here, since that is what Wasm does overall (also see WebAssembly/simd#2).

The operations you've listed map to standard math operations, we should probably just support them. I suspect CPU-side interval arithmetic libraries build on those today anyway.

@penzn "just one variant"? By "variant" do you mean "alternative proposed means of encoding rounding mode" or redundant supported means of the same?

I mean new instructions in addition to existing instructions in the spec. Let's say for multiplication we will need to add a new instruction (say) mul_round, taking rounding mode parameter; then repeat that for the rest of instructions dependent on rounding. This would be the most spec-neutral way to do it, though that is obviously redundant as new instructions would include support for existing rounding mode. I personally see value in sticky rounding mode since that is what happens for native targets, though that may be a hard sell. Another hypothetical option is the encoding optional rounding parameter within existing instructions, but (1) it is unclear how to do that and (2) I don't think we have any precedent of that.

I think there is enough material for at least a CG discussion, to go over the operations in #1384 (comment) and decide which way of encoding rounding mode is preferable. @KloudKoder can you attend one of the meetings (see CG ones here)? You will need to create a Community Group account for that and dial in via Zoom - the meetings are open to people with interest to participate.

Speaking of experts in interval arithmetic, this is the video that got me thinking about the importance of analog security way back in 2007. The issues raised in the video are even more relevant today because, frankly, so little progress has been made. It's time to change that. The speaker is G. Bill Walster. I'll cut right to the chase, but it's worth watching the full video:

https://www.youtube.com/watch?v=U_Cq_h1vrD0&t=680s

Thank you for the link - the video is very informative! I am not sure what is W3C process for requesting external reviews, we can bring this up as well.

@KloudKoder
Copy link
Author

KloudKoder commented Apr 6, 2021

@penzn OK on denormals. Horrible legacy, but let's continue with it if it already works in WASM.

Another hypothetical option is the encoding optional rounding parameter within existing instructions

How about this: Instead of creating 3 new instructions for every existing roundable instruction, we just create 3 instruction prefixes (round negative, round positive, and chop). No prefix implies nearest-or-even, so it's backward compatible. Redundant prefixes, prefixes before nonroundable instructions, attempts to define return or jump targets immediately following a prefix, or code which terminates in a solitary prefix will all get busted by the linter during compile. (I'm just wary of the current and prospective explosion in instruction definitions implied by 4 different rounding modes, especially when we consider the incessant flood of new SIMD that seems to be a reliable industry dynamic.) But this is not a red line for me. If the team wants to handle multiple instruction flavors in lieu of implementing prefixes, then so be it.

You will need to create a Community Group account

Link, please. Zoom is a giant security hole, but I'll see what I can do. I do understand the value proposition, though.

@KloudKoder
Copy link
Author

KloudKoder commented Apr 6, 2021

I just realized that we also need:

  1. Nudge positive.
  2. Nudge negative.

The reason is that, when computing a Taylor series for the sake of estimating a transcendental function, one often encounters a situation wherein the error due to truncation is known to be smaller than a ULP in magnitude, but of unknown sign. In such cases, we need to subtract a ULP from the low limit and add a ULP to the high limit. (By definition, we don't need to consider rounding implications when adding or subtracting a ULP because the result is exact, although some corner cases exist, e.g. the ULP of zero is epsilon, the smallest denormal.) This requires signed nudges rather than zeroward nudges.

To those who, like me, hate adding instruction definitions just for the sake of aesthetics, I will preemptively admit that all nudges are technically unecessary because one can simply load a ULP (instruction #5), then add it to or subtract it from some other float. But that requires explicit use of a precious register and several instructions. Furthermore, nudges are so fundamental that I believe they deserve their own instructions. It would not at all surprise me to see them in hardware as the semiconductor industry finally wakes up to the relevance of intervals in physical security. That would speed things up and also get rid of the need to waste time and a register evaluating the ULP.

Note also that nudges usually come in pairs. This means that, under the hood, there's an implicit "Load ULP" occurring over and over again. Be careful! This isn't necessarily one and the same ULP, so despite appearances, they cannot in general be elided away.

@penzn
Copy link

penzn commented Apr 7, 2021

Link to join CG: https://www.w3.org/community/webassembly/

You need to add an agenda item (open a PR) to one of the CG meetings, 4/27 seems to be quite open, there are also instructions on getting an invite to the call in the meeting files: https://github.com/WebAssembly/meetings/tree/master/main/2021

Let me know if you have any questions, I'd be glad to help.

To those who, like me, hate adding instruction definitions just for the sake of aesthetics, I will preemptively admit that all nudges are technically unecessary because one can simply load a ULP (instruction #5), then add it to or subtract it from some other float. But that requires explicit use of a precious register and several instructions. Furthermore, nudges are so fundamental that I believe they deserve their own instructions.

Separate operations allow for additional flexibility in how they can be implemented, though there are other considerations as well, for example performance gains (or lack of thereof). At any rate, this can probably be turned into a proposal first and then we can sort out exact instruction list.

@KloudKoder
Copy link
Author

@penzn Thanks. I'm waiting on a reply to my CG signup request. I will post back to this issue when I have a meeting date.

@KloudKoder
Copy link
Author

@penzn FYI still no reply to my request.

@penzn
Copy link

penzn commented Apr 11, 2021

We might need to ask @dschuff about it.

@KloudKoder
Copy link
Author

@penzn Thank you, it's in process now, pending the resolution of a technical issue with signing up. Will report back.

@KloudKoder
Copy link
Author

@penzn The person that was assisting me seems to have disappeared. As of now, I finally have a W3C account, but I'm wondering how to schedule a meeting. Do I make an edit to one of the MD files here, or what?

https://github.com/WebAssembly/meetings/tree/master/main/2021

This has got to be the most awkward way of scheduling a meeting I've ever seen, but I get it: we're dealing with a standards body and have to be painfully formal.

@penzn
Copy link

penzn commented Apr 16, 2021

@KloudKoder yep, that is correct - edit one of the files and submit a PR with changes. Did you join the community group? The link is https://www.w3.org/community/webassembly/join (can be found on CG page. I don't remember if you get added to CG meetings automatically or still need to email the chair.

@dtig, is there space for this in April 27 meeting or would it be completely consumed by discussions on layering and linking? Also, if you have access to meeting invites, can you please take a look at @KloudKoder's situation?

@KloudKoder
Copy link
Author

@penzn Looks like we're set for May 11, 2021, assuming I did this PR properly?

@penzn
Copy link

penzn commented Apr 19, 2021

I think you need to update PR with how much time you are planning to take to present, or comment on it, then it will be official: WebAssembly/meetings#756 (comment)

@KloudKoder
Copy link
Author

Notes for the CG meeting today (nothing etched in stone):

LOAD ULP (LULP)

Load units-of-the-last-place into an f32 or f64. (This instruction works analogously on f32 and f64.) It's important for interval operations whose error bounds are a function of the last term computed in a Taylor series. So for instance one could multiply that last term by ULP, then widen the interval by that much on each side. It's very expensive to compute with logic instructions and appears all over the place in routine interval calculations.

There are a few different cases:

  1. +-0.0 -> +-epsilon (smallest-magnitude nonzero float).

  2. +-infinity -> +-infinity. There's really no choice about this because the next logical option is NaN, which is worse.

  3. +-small -> +-epsilon where small is anything less than 1.0x2^(-125) for f32 or 1.0x2^(-1021) for f64.

  4. +-other -> +-X where X is the smallest positive float that, when added to other under chop rounding, would change its value.

LOAD HALF ULP (LHULP)

There are distinctly different roles for LULP and LHULP. LULP is used to compute final error bounds, but LHULP is used to test the current Taylor term for triviality. Once the current term becomes smaller than half a ULP, it's often known that the maximum error is either a full ULP, or the product of a full ULP and the current term. It has different corner cases due to discontinuity of exponential domains, which is why it's very expensive to compute from the programmer's level of abstraction.

  1. +-0.0 -> +-epsilon.

  2. +-infinity -> +-infinity.

  3. +-small -> +-epsilon where small is anything less than 1.0x2^(-124) for f32 or 1.0x2^(-1020) for f64.

  4. +-other -> +-(X/2) where X is the smallest positive float that, when added to other under chop rounding, would change its value.

NUDGE INWARD (NIN)

This instruction is used to expand an open interval by moving one of its bounds inward toward zero by a ULP. (Open intervals are implied by their bounds being listed in reverse. They mean "everything on the real line except for this interval".) Note that [-0.0, -0.0], [+0.0, +0.0], and [-0.0, +0.0] all mean "only zero" while [+0.0, -0.0] means "everything except zero".

  1. +-infinity -> +-infinity.

  2. +other -> max(+0.0, +other -LULP(+other)) where (-0.0)<(+0.0).

  3. -other -> min(-0.0, -other -LULP(-other)) where (-0.0)<(+0.0).

As you can see, it's quite expensive.

NUDGE OUTWARD (NOUT)

This instuction is used to expand (closed) intervals by moving one of its bounds outward from zero by a ULP.

  1. +-infinity -> +-infinity.

  2. +-other -> +-other +LULP(+-other).

NUDGE POSITIVE (NPOS)

This instruction is used to increase an interval bound by the upper bound of an error term known to be positive but less than than the absolute value of a ULP. This instruction may seem trivial, but it's not because there are fast ways to add or subtract a ULP without going to the trouble of creating a ULP its own register.

  1. +-infinity -> +-infinity.

  2. +other -> +other +LULP(+other).

  3. -other -> -other -LULP(-other).

NUDGE NEGATIVE (NNEG)

This instruction does the opposite of NPOS.

  1. +-infinity -> +-infinity.

  2. +other -> +other -LULP(+other).

  3. -other -> -other +LULP(-other).

GET SIGN BIT (GSB)

This is equivalent to returning the arithmetic sign of a float, except that it considers (-0.0) as a negative number. This is intended to help interval code figure out which way it needs to nudge. It's debatable whether this bit should be sent to an integer register or a flag. Perhaps a branch version is as good or better.

SET SIGN BIT (SSB)

Copies the sign bit from an integer's bit zero, or perhaps the carry flag.

XOR SIGN BITS (XSB)

Xor the sign bit of one float to the sign bit of another. Useful for avoiding lots of branches during interval inversion, such as can occur during interval division.

COPY SIGN BIT (CSB)

Copy the sign bit of one float to the sign bit of another. (Is this the same as CopySign? I think we need an instruction for the sign bit, and another for the arithmetic sign.)

ROUND NEAREST-OR-EVEN, CHOP, ROUND NEGATIVE, ROUND POSITIVE

Are these prefixes, instruction flavors, or part of a global variable describing rounding policy? Prefixes provide most backward compatibility with least future instruction expansion. Need to discuss.

@conrad-watt
Copy link
Contributor

I may not be able to make the meeting today so just getting my opinions out

  • knee-jerk +1 to the "instruction flavor" option

  • prefixes would be fine but a can of worms for other parts of the language (we've resisted using them as solutions in the past)

  • having "rounding policy" global state/variable seems like it would create unnecessary design questions/issues wrt module compositionality, so would prefer one of the other two approaches

@KloudKoder
Copy link
Author

For the record, the presentation of this issue to the community group is being rescheduled, tentatively for June 8. This is due to cascading technical failures on my part which prevented me from participating today, despite a valiant attempt by the hosts to assist. I know how to prevent this from occurring next time.

@KloudKoder
Copy link
Author

@penzn @dschuff This was all discussed in the latest WASM CG meeting. I must say I was really impressed with the quality of the feedback and the ability of everyone to stay on topic.

The bottom line was that the group wants to see some sort of performance comparison in a practical case. I've been tasked with creating an apples-to-apples comparison in some performant language (probably C), not necessarily compiling down to WASM. In particular, the experiment is suggested to involve 2 independently optimized operating modes: (1) an implementation involving the 4 rounding modes above (via rounding mode manipulation macros) and (2) the current state of affairs with nearest-or-even rounding, using whatever hacks necessary to make it produce binary-identical results.

There are sort of 3 levels of difficulty:

  1. Easy: Evaluate transcendental functions where we know the sign and the plausible range of outputs up front based on a quick inspection of the parameter. We can therefore optimize for minimal switching among rounding modes. This often comes up in physics simulations.

  2. Medium: Products of random matrices where all the values are positive, so we don't need to inspect signs in order to know how to round. This might come up in portfolio analysis, where asset allocations and prices are both positive; or datacenter power management, where power and cost per watt are also both positive.

  3. Hard: Same matrix products but biased to have a zero mean, so that signs must be inspected in real time. This forces either constant rounding mode thrashing, or a 2-pass approach which processes positive intermediates separately from negative ones, which exacerbates memory accesses. This could come up in AI where we have both rewards and punishments.

I would say it's guaranteed that the harder the problem on this scale, the less the advantage of interval enhancements, simply because there's so much more overhead involved with sign inspection per operation. "Hard" would probably come in with marginal speedup but the others are an open question.

It's also worth noting that as the VM overhead approaches infinity, the performance ratio approaches the instruction count ratio. Perhaps I could also add fake VM overhead so we can see how this transition occurs as it worsens.

I will get to work on all this in my spare time and report back when I have a test loop to share.

@KloudKoder
Copy link
Author

KloudKoder commented Jun 9, 2021

Thinking aloud: It seems to me that we could forget about ULP (and its int manipulation, which is really costly) if the following were more performant:

When adding/subtracting intervals, just multiply the upper limit by (1+delta) and the lower limit by (1-delta), where delta is the ULP of 1.0 and thus constant. Then do the opposite. Sort the results (which should be possible in a single 4-way SIMD instruction). Take the most negative as the new lower limit and the most positive as the new upper limit. Or don't sort and just take min and max in successive instructions. (Then again, we might be better off using that SIMD power to do ULP construction on 4 floats at once.)

For multiplication, compute all 4 possible products, then all 8 possible delta-scalings. Sort them and do the same thing as above.

For division, it's basically the same unless either interval straddles zero, then we end up with open (as opposed to closed) intervals, and that's hell no matter what.

@KloudKoder
Copy link
Author

@conrad-watt @penzn @tlively

Sorry this has taken so long. Allow me to share a bit of my recent thinking. Let:

E = epsilon (smallest positive)
S = small (previously "delta": something like (2^(-20)) for f32 or (2^(-50)) for f64)

Now you need to multiply [A, B] by [C, D] (division being even more complicated):

  1. Use SIMD to compute AC, AD, BC, and BD.

  2. Use SIMD to find the min X (most negative) and max Y of these 4 values.

  3. Use SIMD to compute (X*(1-S)), (X*(1+S)), (Y*(1-S)), and (Y*(1+S)).

  4. Use SIMD to find the min P and max Q of these 4 values.

  5. Issue product [P, Q].

As I suggested previously, the idea here is that you pick S just big enough to overcome the misdirecting "drag" induced by nearest-or-even in all cases, so you don't need any custom rounding modes at all. Genius, right?

Unfortunately... it doesn't work! Consider the case where, say, the exact value of AC is just slightly less than E. Then nearest-or-even will bump it to exactly E. You try to correct this by multiplying by (1-S), but the nearest-or-even result is still E due to quantization error. The same holds for most denormals.

So in addition to the above, one could have an ongoing tracker which remembers the smallest nonzero absolute value ever generated. If it falls below some threshold, then you know that the (1-S) trick didn't work, so you need to do a rollback and compute everything all over again using some totally different algorithm. Not. Fun.

So where does this leave us? As far as I can see, we're still in dire need of custom rounding modes because without them everything is just grotesque (ULP manipulation with exponent extraction and comparison, integer manipulation of floats while checking for wrap, checking for sufficient mantissa space to compensate for nearest-or-even drag, etc.).

In other words, I'm still failing to see how to even construct a fair test of with-and-without custom rounding modes in some language and have it be anywhere near fair to the "without" case while still being roughly optimal in both cases.

Thoughts?

@KloudKoder
Copy link
Author

Forgot to mention: one tempting way out of the (1-S) problem is to just to add or subtract some multiple of E from the min and max, then take min and max again in order to deal with signs properly. If I'm not mistaken, this does in fact work. However, we then get into a situation where purely positive intervals can have products that intrude on the negatives. That opens a whole other can of worms because generally such behavior would be presumed impossible, regardless of precision. It has real codepath effects when it then potentially gives rise to open intervals due to taking reciprocals (of closed intervals that span zero). We could avoid this issue, in turn, by testing limits and manually dealing with corner cases, but then I think we're back to "grotesque".

@KloudKoder
Copy link
Author

I would like some feedback from stakeholders on where the performance ratio threshold should be in order to justify passing these proposed instructions into the standard. I'm asking because it's been months, and we still have no proposal from anyone, including me, that even suggests less than a 2X performance ratio. Based on the best design so far, this would also seem to be a conservative lower bound. Implementing a test case seems rather challenging because, depending on the particular strategy, it might involve invoking SIMD builtins specific to a specific compiler, and in any event, some floating-point environment magic. Doing full floating-point emulation wouldn't require any of this, but (1) it would be a huge project in itself and (2) probably wouldn't result in a realistic comparison. My sense at this point, having gone down numerous rabbit holes only to arrive at grossly nonperformant alternative implementations, is that these proposed instructions, and above all rounding mode flavors, offer a performance ratio advantage far greater than has been used to justify many other instructions that WASM employs and have a basis in hardware. In other words, do we really need to go through the motions of a major demo project just to prove what is now, in my view, obvious? Of course, you can argue that no one would ever need interval arithmetic, but I think we're past that stage. The question at hand is: can it be done efficiently, via whatever creative tricks, under WASM as is? And if not, how much acceleration would compel the community to standardize the instructions in question? And finally, in light of that threshold, is it already obvious in the absence of an explicit demo that we will exceed it? We're not in 20% land and never were.

@cmuratori
Copy link

(Apologies for being late to this party...)

It is pretty much ubiquitous on CPUs, not only on x86, it is what fegetround and fesetround library functions are for. The tricky part is that rounding mode is a "global" (per-thread) state which affects FP operations.

I wanted to point out, if this topic is still being considered, that AVX-512 introduces explicit rounding mode selection per instruction. So, when deciding what to do with WASM here, it is worth considering the possibility that typical x64 math code may slowly transition away from global rounding modes to per-instruction rounding modes, based on the adoption rate of AVX-512 over time.

The way Intel exposed this in C is by adding additional versions of all the intrinsics with a "_round" suffix, and they take an additional constant integer rounding mode enumerant.

- Casey

@KloudKoder
Copy link
Author

@cmuratori Good contribution! And sorry for the delayed response. I'm honestly kind of exhausted from trying to demonstrate that water is wet.

It seems increasingly inevitable that WASM will need to include support for rounding mode flavors of all applicable floating-point instructions. AVX512 is -- thank you -- yet another reason. My problem is that I'm unable to a prove a negative, in the sense that I can't prove conclusively that there's no creative way of abusing nearest-or-even instructions in order to emulate other rounding flavors. I think we're at the point, though, where it's far more likely that not having this support is going to result in a future development crunch as the team struggles to catch up with real world use cases, than that some genius will emerge from the woodwork and prove that all flavors are performance-equivalent to nearest-or-even.

@cmuratori
Copy link

Well the problem for the development team is cross compatibility. The "correct" way for an instruction set to work is to specify the rounding mode per operation, but not all chips support that, so if WASM were to make that standard, what would it do on chips that don't have those instructions? It can't very well generate FP code that changes the MXCSR or equivalent every other instruction!

So I think whatever happens here it probably has to be an extension, like SIMD128. You would have to go, "does this WASM platform support per-instruction rounding modes?" and if the answer is yes, then you run that stream. If the answer is no, then you do not.

But I do think it would be smart to have that extension, because although I could be wrong, I get the sense that people realize now that global rounding modes were a bad idea :)

- Casey

@KloudKoder
Copy link
Author

KloudKoder commented Oct 16, 2021

@cmuratori "what would it do on chips that don't have those instructions? It can't very well generate FP code that changes the MXCSR or equivalent every other instruction!" -- Correct, which is why I propose to (1) detect whether alternative rounding flavors are used at all (usually not) and optimize accordingly and (2) if they are used, then reorder instructions as much as possible, so as to minimize the need to touch the Rounding Control bits. At the end of the day, frankly, it's on the chip designers to suffer the consequences of their own bad architecture. Let's not encumber the smart ones in order to level the playing field for the dinosaurs.

I don't think code should ever need to ask "is this supported" because WASM has to check for illegal instructions anyway, so it might as well just emulate as needed (especially in cases like this where emulation is of trivial complexity -- i.e. just change Rounding Control -- despite carrying a hefty performance penalty). I should emphasize that version 1.0 could just change Rounding Control with impunity and not bother regrouping instructions. The point is that such accelerations could happen later if the community ever cared enough for the dinosaurs (and maybe it wouldn't, which is just as well).

It comes down to 2 things that aren't going away, namely: hardware support for native rounding flavors and the increasing employment of interval arithmetic in engineering.

To clarify my previous post, I very much doubt that anyone will find a way to emulate alternative rounding flavors by some ingenious semantic abuse of nearest-or-even plus trivial overhead. If nothing else, the overhead required to properly deal with the denormal case in an alternative rounding flavor is nontrivial, as in, it easily doubles execution time. Ultimately, I think 5X execution time is realistic, although it's hard to pin down an average in practice.

Perhaps it will soon be time to force a vote somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants