-
-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable quad precision floating point component extraction #336
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Is there a more extensive test suite I can run on the Pi, other than the normal ctest? I just added the -DBUILD_CI=ON switch and am re-building right now. This should give me some more tests. |
if you are running the full BUILD_CI that is the full regression suite across all number systems. That just completed successfully in CI so it was regressed in Windows/MSVC, MacOS/Clang, Ubuntu/gcc, so the rest of the lib is validated. I had forgotten that I had that little struct hidden away in bitblock. bitblock was the foundation of the early posit implementation, but now, all the other number systems are based on a limb-based system instead of the bit-oriented bitblock to gain some performance. I will see if I can elevate the uint128 into its own swimlane so that it becomes more universally useful as a building block for other algorithms inside the library. |
Alright building with BUILD_CI now, will report back whether the tests pass on the Pi. As far as I can tell, the CI jobs don't actually test the new additions, since they only run when the long double is 128 bits, which is most likely not the case for these jobs. |
That is correct, but your propenv should at least tell you if it hits your code path. we can add a quick test case to the posit<128,4> and posit<256,5> test suites to see if your code moves the extra bits into the posit. Having a quad precision floating-point will enable all kinds of testing for even posit<32,2> and posit<64,3> as they need more mantissa bits than double precision provides to correctly round the reference values. |
One quick idea, you can use cfloat<128, 15, std::uint32_t, true, false, false> to generate arbitrary bit patterns that you can load into the long double. That way you can generate a set of precision bit patterns that you could test. One good test that uses the standard universal floating-point API would be to do something like this: quad a{1.0l},b{1.0l}, c{1.0l}; // this is a cfloat<128,15,uint32_t, true, false, false> For the regression tests we would want to do the decrement that way the posit regressions stay independent of cfloat. to make quick progress with this, maybe best is to add a separate conversion test for long double. |
So the posit_truncate test fails apparently:
Although to me, this looks totally unrelated to my changes. |
This is passing in CI:
You can trace out which test case is failing by setting a boolean in the test main:
if you set reportTestCases to true, the test will print out which case is failing. |
Signed-off-by: yboettcher <39460066+yboettcher@users.noreply.github.com>
First up: I now got your example from #336 (comment) to succeed. I just messed up the ordering of source and destination for memcpy. I actually ran into some other strange behaviour when trying out the As for the failing CI test, this is the result of reportingTestCases:
So, if I read this right, only one case (per function) fails and that is NaR. Could it be that the conversion of the NaR to a float (or the subsequent floor/ceil/etc) is handled differently on (some) ARM CPUs? You said it worked on the M1 though... I'll try to look into this. OTOH: I would assume that converting a posit into a float is not related to this PR. It seems like to_float delegates to to_double which extracts the posit components and turns them into a double which is then casted to a float, so there is no long double involved. |
When I try to just convert a NaN into an int, I get different results on the two platforms. No posits or anything involved, just pure C++ floats and ints: int main() {
// sanity check
assert(sizeof(float) == sizeof(int));
// nan bit pattern
int nan_pattern = 0x7fc00000;
// create float from pattern
float nan_float;
std::memcpy(&nan_float, &nan_pattern, sizeof(float));
// use normal int conversion on the floor
int result = int(nan_float);
// check results
std::cout << "float: " << nan_float << std::endl;
std::cout << "result: " << result << std::endl;
std::cout << "result_pattern: " << std::bitset<32>(result) << std::endl;
return 0;
} Gives me this output on x86:
and this on my raspberry pi:
Am I right in assuming that the result of converting a nan into an integer type is just undefined and behaves differently on different platforms? What confuses me as well is that the bit pattern does not match the printed result... I found this question https://stackoverflow.com/questions/10366485/problems-casting-nan-floats-to-int, which has some interesting information:
and
The first statement would explain what I am seeing here, the second one however leads me to question my test code above. I am definitely not getting 0x80000000 on the x86. I did get that earlier though (when converting into long I think). In any case, I think this is quite far from the actual topic of the PR and has nothing to do with quad precision floating point numbers. |
@yboettcher I have found that the NaN and Inf patterns are different among compilers, so this trick to push bits into a float storage might conflict with how the C++ runtime you are using is interpreting the patterns. |
The fact that the result is 65536 or 0 is interesting. That feels like there is an improper mask at play. The posit API has a setnar() method, so it would be interesting to see this code on your Arduino:
|
Hmm, in my case I used clang15 in both cases and when printing out the value it did print as "nan" so I think it might be fine...
This example prints |
@yboettcher you found a bug in the posit printing of binary format. NaR is definitely 0b1.000000.. The bit encoding was correct, just not the printing of the binary format. I put in the fix in this commit: 40ef3ad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR adds the extraction of floating point components for 128 bit long doubles.
Regarding the changes:
fraction_without_hidden_bit
is either 64 bit or 128 bit in size. To allow for this distinction, the creation of thefraction_without_hidden_bit
variable was moved into the if expression along with the call toextract_fp_components()
.A new condition was added to handle 16 byte long doubles with more than 64 mantissa bits. This uses the unirsals internal::uint128 type to hold the mantissa, but works the same otherwise. One difference is that the uint128 cannot be turned into a string directly, so the output for
_trace_value_conversion
is modified slightly.memcpy
instead ofreinterpret_cast
because the long double could not bereinterpret_cast
ed into the uint128. Assuming that the memory layout of uint128 is correct, this should be right.Also, instead of masking away the upper bits, I assert that this function is only called when the long double has more than 64 mantissa bits (i.e. the only removed bits are in the upper part of the uint128) and then use two shifts to remove the upmost bits. The value used for shifting is a part I am not certain about, as the computed shift value emits a lot of compiler warnings when building on architectures, where the long double has fewer mantissa bits.