enable quad precision floating point component extraction #336

yboettcher · 2023-04-28T12:34:04Z

This PR adds the extraction of floating point components for 128 bit long doubles.

Regarding the changes:

value.hpp: Depending on which floating point configuration is used, the fraction_without_hidden_bit is either 64 bit or 128 bit in size. To allow for this distinction, the creation of the fraction_without_hidden_bit variable was moved into the if expression along with the call to extract_fp_components().
A new condition was added to handle 16 byte long doubles with more than 64 mantissa bits. This uses the unirsals internal::uint128 type to hold the mantissa, but works the same otherwise. One difference is that the uint128 cannot be turned into a string directly, so the output for _trace_value_conversion is modified slightly.
extract_fp_components.hpp: An assertion was added to the existing long double version to ensure it's not called when the mantissa does not fit into a 64 bit container. A new overload was added that uses universals internal::uint128 type as mantissa container. This version uses memcpy instead of reinterpret_cast because the long double could not be reinterpret_casted into the uint128. Assuming that the memory layout of uint128 is correct, this should be right.
Also, instead of masking away the upper bits, I assert that this function is only called when the long double has more than 64 mantissa bits (i.e. the only removed bits are in the upper part of the uint128) and then use two shifts to remove the upmost bits. The value used for shifting is a part I am not certain about, as the computed shift value emits a lot of compiler warnings when building on architectures, where the long double has fewer mantissa bits.

/mnt/RamDisk/patch/universal/./include/universal/native/nonconstexpr/extract_fp_components.hpp:57:18: warning: shift count >= width of type [-Wshift-count-overflow]
        _fraction.upper <<= shift;
                        ^   ~~~~~
/mnt/RamDisk/patch/universal/./include/universal/native/nonconstexpr/extract_fp_components.hpp:58:18: warning: shift count >= width of type [-Wshift-count-overflow]
        _fraction.upper >>= shift;

propenv.cpp: This also directly accesses the extract_fp_components function and needed a little modification to use uint128 when appropriate. Because the fraction variable is used later, everything between its creation and final use is moved into the if expression.

Ravenwater

LGTM

yboettcher · 2023-04-28T13:26:13Z

Is there a more extensive test suite I can run on the Pi, other than the normal ctest?

I just added the -DBUILD_CI=ON switch and am re-building right now. This should give me some more tests.

Ravenwater · 2023-04-28T13:29:55Z

if you are running the full BUILD_CI that is the full regression suite across all number systems. That just completed successfully in CI so it was regressed in Windows/MSVC, MacOS/Clang, Ubuntu/gcc, so the rest of the lib is validated.

I had forgotten that I had that little struct hidden away in bitblock. bitblock was the foundation of the early posit implementation, but now, all the other number systems are based on a limb-based system instead of the bit-oriented bitblock to gain some performance.

I will see if I can elevate the uint128 into its own swimlane so that it becomes more universally useful as a building block for other algorithms inside the library.

yboettcher · 2023-04-28T13:32:35Z

Alright building with BUILD_CI now, will report back whether the tests pass on the Pi. As far as I can tell, the CI jobs don't actually test the new additions, since they only run when the long double is 128 bits, which is most likely not the case for these jobs.

Ravenwater · 2023-04-28T13:52:35Z

That is correct, but your propenv should at least tell you if it hits your code path.

we can add a quick test case to the posit<128,4> and posit<256,5> test suites to see if your code moves the extra bits into the posit. Having a quad precision floating-point will enable all kinds of testing for even posit<32,2> and posit<64,3> as they need more mantissa bits than double precision provides to correctly round the reference values.

Ravenwater · 2023-04-28T13:58:56Z

One quick idea, you can use cfloat<128, 15, std::uint32_t, true, false, false> to generate arbitrary bit patterns that you can load into the long double. That way you can generate a set of precision bit patterns that you could test.

One good test that uses the standard universal floating-point API would be to do something like this:

quad a{1.0l},b{1.0l}, c{1.0l}; // this is a cfloat<128,15,uint32_t, true, false, false>
--a; // move 1ULP down from 1.0
b -= a; // get the ULP value at 1.0
long double la, lb;
la = (long double)a;
lb = (long double)b;
posit<128,4> ref(1.0),pa(la), pb(lb), pc;
pc = pa + pb;
if (pc != ref) std::cout << "FAIL\n";

For the regression tests we would want to do the decrement
--a;
with the native ieee754 API
la = nextafter(1.0l, 0.5l);

that way the posit regressions stay independent of cfloat.

to make quick progress with this, maybe best is to add a separate conversion test for long double.

yboettcher · 2023-04-28T14:25:30Z

So the posit_truncate test fails apparently:

202/321 Testing: posit_truncate
202/321 Test: posit_truncate
Command: "/mnt/RamDisk/universal/build/static/posit/posit_truncate"
Directory: /mnt/RamDisk/universal/build/static/posit
"posit_truncate" start time: Apr 28 16:04 CEST
Output:
----------------------------------------------------------
posit truncate function validation: results only
trunc                                                        posit<6,2>() FAIL 1 failed test cases
round                                                        posit<6,2>() FAIL 1 failed test cases
floor                                                        posit<6,2>() FAIL 1 failed test cases
ceil                                                         posit<6,2>() FAIL 1 failed test cases
posit truncate function validation: FAIL
<end of output>
Test time =   0.01 sec
----------------------------------------------------------
Test Failed.
"posit_truncate" end time: Apr 28 16:04 CEST
"posit_truncate" time elapsed: 00:00:00
----------------------------------------------------------

Although to me, this looks totally unrelated to my changes.
I assume that the test is this file, which never converts any float/double/long double into anything. It simply uses p.setbits() and then converts back to float...
https://github.com/yboettcher/universal/blob/main/static/posit/math/truncate.cpp

Ravenwater · 2023-04-28T19:35:35Z

This is passing in CI:

posit truncate function validation: results only
trunc                                                        posit<6,2>() PASS
round                                                        posit<6,2>() PASS
floor                                                        posit<6,2>() PASS
ceil                                                         posit<6,2>() PASS
posit truncate function validation: PASS

You can trace out which test case is failing by setting a boolean in the test main:

int main()
try {
	using namespace sw::universal;

	std::string test_suite  = "posit truncate function validation";
	std::string test_tag    = "truncate failed: ";
	bool reportTestCases    = false;
	int nrOfFailedTestCases = 0;

	ReportTestSuiteHeader(test_suite, reportTestCases);
...

if you set reportTestCases to true, the test will print out which case is failing.

Signed-off-by: yboettcher <39460066+yboettcher@users.noreply.github.com>

yboettcher · 2023-05-03T09:18:30Z

First up: I now got your example from #336 (comment) to succeed. I just messed up the ordering of source and destination for memcpy.
The CI test is still failing though, so I would assume that it is unrelated to the component extraction. This concerns me a little though. As wrong as my previous implementation was, I am surprised that none of the tests actually caught this.

I actually ran into some other strange behaviour when trying out the nextafter function.
In a 128 bit environment, a 1.0 should be 0b0_011111111111111_0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
with the next smaller one being 0b0_011111111111110_1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
(underscores after 1 and 15 bits to separate components).
Using nextafter however, I get
0b0_011111111111110_1111111111111111111111111111111111111111111111111111000000000000000000000000000000000000000000000000000000000000
which seems like it totally ignores the lower 60 bits.
The arithmetic itself does work just fine with 128 bits: subtracting the 1 and the next smaller number (with the bit pattern manually specified) does give me the same answer as the 128 bit cfloat.
This also happens when I want to set a variable to 1.3 which should be a repeating pattern, but again, the lowest 60 bits are blank.

As for the failing CI test, this is the result of reportingTestCases:

  10   │ posit truncate function validation: report test cases
  11   │ trunc trunc                       nar !=                     65536 reference value is                         0 0b0.00000.. vs 0b0.11111..
  12   │ trunc                                                        posit<6,2>() FAIL 1 failed test cases
  13   │ round round nar                       != 65536                     reference value is 0                         0b0.00000.. vs 0b0.11111..
  14   │ round                                                        posit<6,2>() FAIL 1 failed test cases
  15   │ floor floor nar                       != 65536                     reference value is 0                         0b0.00000.. vs 0b0.11111..
  16   │ floor                                                        posit<6,2>() FAIL 1 failed test cases
  17   │ ceil ceil nar                       != 65536                     reference value is 0                         0b0.00000.. vs 0b0.11111..
  18   │ ceil                                                         posit<6,2>() FAIL 1 failed test cases
  19   │ posit truncate function validation: FAIL

So, if I read this right, only one case (per function) fails and that is NaR. Could it be that the conversion of the NaR to a float (or the subsequent floor/ceil/etc) is handled differently on (some) ARM CPUs? You said it worked on the M1 though... I'll try to look into this.

OTOH: I would assume that converting a posit into a float is not related to this PR. It seems like to_float delegates to to_double which extracts the posit components and turns them into a double which is then casted to a float, so there is no long double involved.

yboettcher · 2023-05-03T12:29:19Z

When I try to just convert a NaN into an int, I get different results on the two platforms. No posits or anything involved, just pure C++ floats and ints:

int main() {
    // sanity check
    assert(sizeof(float) == sizeof(int));

     // nan bit pattern
    int nan_pattern = 0x7fc00000;

     // create float from pattern
    float nan_float;
    std::memcpy(&nan_float, &nan_pattern, sizeof(float));

    // use normal int conversion on the floor
    int result = int(nan_float);

    // check results
    std::cout << "float: " << nan_float << std::endl;

    std::cout << "result: " << result << std::endl;
    std::cout << "result_pattern: " << std::bitset<32>(result) << std::endl;

    return 0;
}

Gives me this output on x86:

float: nan
result: 1970496882
result_pattern: 11011010010010100010111111010011

and this on my raspberry pi:

float: nan
result: 8
result_pattern: 01100111001100000001000000000000

Am I right in assuming that the result of converting a nan into an integer type is just undefined and behaves differently on different platforms? What confuses me as well is that the bit pattern does not match the printed result...

I found this question https://stackoverflow.com/questions/10366485/problems-casting-nan-floats-to-int, which has some interesting information:

if the floating value is infinite or NaN or if the integral part of the floating value exceeds the range of the integer type, then the "invalid" floating- point exception is raised and the resulting value is unspecified.

and

Intel specifies instruction behavior very carefully, and the behavior for converting a floating-point NaN to a 32-bit integer is to return 0x80000000

The first statement would explain what I am seeing here, the second one however leads me to question my test code above. I am definitely not getting 0x80000000 on the x86. I did get that earlier though (when converting into long I think).

In any case, I think this is quite far from the actual topic of the PR and has nothing to do with quad precision floating point numbers.

Ravenwater · 2023-05-03T17:21:05Z

@yboettcher I have found that the NaN and Inf patterns are different among compilers, so this trick to push bits into a float storage might conflict with how the C++ runtime you are using is interpreting the patterns.

Ravenwater · 2023-05-03T17:25:40Z

First up: I now got your example from #336 (comment) to succeed. I just messed up the ordering of source and destination for memcpy. The CI test is still failing though, so I would assume that it is unrelated to the component extraction. This concerns me a little though. As wrong as my previous implementation was, I am surprised that none of the tests actually caught this.

I actually ran into some other strange behaviour when trying out the nextafter function. In a 128 bit environment, a 1.0 should be 0b0_011111111111111_0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 with the next smaller one being 0b0_011111111111110_1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 (underscores after 1 and 15 bits to separate components). Using nextafter however, I get 0b0_011111111111110_1111111111111111111111111111111111111111111111111111000000000000000000000000000000000000000000000000000000000000 which seems like it totally ignores the lower 60 bits. The arithmetic itself does work just fine with 128 bits: subtracting the 1 and the next smaller number (with the bit pattern manually specified) does give me the same answer as the 128 bit cfloat. This also happens when I want to set a variable to 1.3 which should be a repeating pattern, but again, the lowest 60 bits are blank.

As for the failing CI test, this is the result of reportingTestCases:
  10   │ posit truncate function validation: report test cases
  11   │ trunc trunc                       nar !=                     65536 reference value is                         0 0b0.00000.. vs 0b0.11111..
  12   │ trunc                                                        posit<6,2>() FAIL 1 failed test cases
  13   │ round round nar                       != 65536                     reference value is 0                         0b0.00000.. vs 0b0.11111..
  14   │ round                                                        posit<6,2>() FAIL 1 failed test cases
  15   │ floor floor nar                       != 65536                     reference value is 0                         0b0.00000.. vs 0b0.11111..
  16   │ floor                                                        posit<6,2>() FAIL 1 failed test cases
  17   │ ceil ceil nar                       != 65536                     reference value is 0                         0b0.00000.. vs 0b0.11111..
  18   │ ceil                                                         posit<6,2>() FAIL 1 failed test cases
  19   │ posit truncate function validation: FAIL
So, if I read this right, only one case (per function) fails and that is NaR. Could it be that the conversion of the NaR to a float (or the subsequent floor/ceil/etc) is handled differently on (some) ARM CPUs? You said it worked on the M1 though... I'll try to look into this.

OTOH: I would assume that converting a posit into a float is not related to this PR. It seems like to_float delegates to to_double which extracts the posit components and turns them into a double which is then casted to a float, so there is no long double involved.

The fact that the result is 65536 or 0 is interesting. That feels like there is an improper mask at play. The posit API has a setnar() method, so it would be interesting to see this code on your Arduino:

   posit<6,2> p;
   p.setnar();
   ReportValue(p);     // comes from #include <universal/verification/test_suite.hpp>

yboettcher · 2023-05-03T19:29:23Z

I have found that the NaN and Inf patterns are different among compilers

Hmm, in my case I used clang15 in both cases and when printing out the value it did print as "nan" so I think it might be fine...

The posit API has a setnar() method

This example prints 0b1.00001.. : nar on both the x86 and the raspberry pi. That's strange... I would have assumed NaR to be 100000... (by the way, just to avoid any future confusion: this is not about an arduino, but a raspberry pi. Although testing it on arduino could be interesting as well.)

Ravenwater · 2023-05-03T23:49:43Z

@yboettcher you found a bug in the posit printing of binary format. NaR is definitely 0b1.000000.. The bit encoding was correct, just not the printing of the binary format.

I put in the fix in this commit: 40ef3ad

Ravenwater

LGTM

enable quad precision floating point component extraction

f7dcd86

yboettcher mentioned this pull request Apr 28, 2023

Compilation failure on non-x86 architectures #323

Closed

Ravenwater approved these changes Apr 28, 2023

View reviewed changes

fix argument order for memcpy

3de8c24

Signed-off-by: yboettcher <39460066+yboettcher@users.noreply.github.com>

yboettcher marked this pull request as ready for review May 3, 2023 09:19

Ravenwater self-requested a review May 4, 2023 10:48

Ravenwater approved these changes May 4, 2023

View reviewed changes

Ravenwater merged commit 6833046 into stillwater-sc:main May 4, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable quad precision floating point component extraction #336

enable quad precision floating point component extraction #336

yboettcher commented Apr 28, 2023

Ravenwater left a comment

yboettcher commented Apr 28, 2023 •

edited

Ravenwater commented Apr 28, 2023 •

edited

yboettcher commented Apr 28, 2023

Ravenwater commented Apr 28, 2023

Ravenwater commented Apr 28, 2023 •

edited

yboettcher commented Apr 28, 2023

Ravenwater commented Apr 28, 2023

yboettcher commented May 3, 2023

yboettcher commented May 3, 2023

Ravenwater commented May 3, 2023

Ravenwater commented May 3, 2023

yboettcher commented May 3, 2023

Ravenwater commented May 3, 2023 •

edited

Ravenwater left a comment

enable quad precision floating point component extraction #336

enable quad precision floating point component extraction #336

Conversation

yboettcher commented Apr 28, 2023

Ravenwater left a comment

Choose a reason for hiding this comment

yboettcher commented Apr 28, 2023 • edited

Ravenwater commented Apr 28, 2023 • edited

yboettcher commented Apr 28, 2023

Ravenwater commented Apr 28, 2023

Ravenwater commented Apr 28, 2023 • edited

yboettcher commented Apr 28, 2023

Ravenwater commented Apr 28, 2023

yboettcher commented May 3, 2023

yboettcher commented May 3, 2023

Ravenwater commented May 3, 2023

Ravenwater commented May 3, 2023

yboettcher commented May 3, 2023

Ravenwater commented May 3, 2023 • edited

Ravenwater left a comment

Choose a reason for hiding this comment

yboettcher commented Apr 28, 2023 •

edited

Ravenwater commented Apr 28, 2023 •

edited

Ravenwater commented Apr 28, 2023 •

edited

Ravenwater commented May 3, 2023 •

edited