Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new NaN behavior #22386

Merged
merged 12 commits into from
Jun 6, 2024
Merged

Conversation

rschlussel
Copy link
Contributor

@rschlussel rschlussel commented Apr 1, 2024

Description

This PR contains all the changes to overhaul the NaN operators to conform with the new definition proposed in https://github.com/prestodb/rfcs/blob/main/RFC-0001-nan-definition.md.

According to the new nan definition, Nan is larger than all other numbers and is equal to itself. This PR also changes +0 and -0 to always be considered equal/not distinct, whereas previously there was inconsistency here as well.

I recommend reviewing commit by commit as it is divided into logically distinct pieces that should be easier to review. If it would be helpful I can also split it up into smaller PRs.

Fixes the following issues:
#22040
#21936
#21877
#22679
#13807
#21065
#22716
facebookincubator/velox#9511

It should also fix should also fix #16851, though I'm not sure how to create the file to test it.

Motivation and Context

The motivation for this change is to provide consistent NaN semantics across all of our functions and operators. It is also to ensure that these semantics are consistent with velox as we move to native workers. For more details see the RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0001-nan-definition.md.

Impact

nan will now be treated as greater than all other numbers for all functions and as equal to itself for all functions. This changes the behavior of many existing functions and operators. These differences include (but are not limited to) =, <, >, joins, various distincting functions like set_agg and array_distinct, array_min.

It also fixes #22040, a wrong results bug with map_top_n in the presence of nans.

This PR also changes joins and aggregations to treat +0 and -0 as equal/not distinct.

Test Plan

Added tests for all affected functions.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Change handling of floating point numbers in Presto to consistently treat NaNs as larger than any other number and equal to itself.  It also changes the handling of positive and negative zero to always be considered equal to each other.  Read more here: https://github.com/prestodb/rfcs/blob/main/RFC-0001-nan-definition.md. The new nan behavior can be disabled by setting the configuration property ``use-new-nan-definition`` to ``false``. This configuration property is intended to be temporary to ease migration in the short term, and will be removed in a future release.
* Fix a bug where map_top_n could return wrong results if there is any NaN input
* Fix a bug with array_min/array_max where it would return NaN rather than null when there was both NaN and null input. 

Copy link
Contributor

@elharo elharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot (all?) of these type comparisons look like they should be .equals for everything, not just the real and double types. Type is not an enum. I guess they're supposed to be singletons? But then why doesn't this work for real and double?

@rschlussel
Copy link
Contributor Author

It doesn't work for real and double in this PR because i added a flag to it for the nan migration, so depending on the configuration property, the flag could be true or false. (Also, while you're certainly welcome to review while it's in this state, I'm planning to clean up the commits to be logically independent once I've finished fixing the tests/adding tests and covering all the cases).

@rschlussel rschlussel force-pushed the nan-operators branch 2 times, most recently from d2b923d to 1679716 Compare May 13, 2024 15:58
@rschlussel rschlussel force-pushed the nan-operators branch 18 times, most recently from 158134f to 3a9bd84 Compare May 31, 2024 18:04
@rschlussel rschlussel force-pushed the nan-operators branch 2 times, most recently from 2a25b6b to bd3141d Compare May 31, 2024 19:09
@rschlussel rschlussel changed the title [WIP] Nan operators Implement new NaN behavior May 31, 2024
@rschlussel rschlussel marked this pull request as ready for review May 31, 2024 19:23
sdruzkin
sdruzkin previously approved these changes Jun 3, 2024
jaystarshot
jaystarshot previously approved these changes Jun 4, 2024
Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add property for new nan behavior

return useNewNanDefinition;
}

@Config("use-new-nan-definition")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this the new NaN definition, and by default this is false, or should we refer to the old behavior as the deprecated NaN definition, and by default that is true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see the later commit that enables it. I would have recommended this if it were disabled by default, but I think it's fine as it is.

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add operator double support for new NaN...

Comment on lines 254 to 262
long aBits = doubleToLongBits(a);
long bBits = doubleToLongBits(b);
if (aBits < bBits) {
return -1;
}
if (aBits > bBits) {
return 1;
}
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, isn't it possible for multiple NaNs to have separate values when represented as bits? Supposing this came from a datasource and not from Presto itself. If so, would this work?

Copy link
Contributor Author

@rschlussel rschlussel Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doubleToLongBits coerces all the nans to the same representation. There's a different function Double.doubleToRawLongBits() that doesn't do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment in the code and some tests for this.

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add NaN definition to DoubleType ...

// a time. .equals() comparison is always used against the static DOUBLE
// instance to check if something is double type. this hack is temporary
// and will be removed when we full remove the old nan behavior
return other == DOUBLE || other == OLD_NAN_DOUBLE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just do an instanceof check? It wouldn't require a lengthy explanation. Likewise for float.

tdcmeehan
tdcmeehan previously approved these changes Jun 6, 2024
Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome, thanks for the great work @rschlussel.

Add a boolean field to DoubleType and RealType to determine whether to
use the new nan definition.  If useNewNanDefinition is set to true in
the configuration property, then only doubles/reals with that property
set to true will be created, and if it is false, then only doubles/reals
with that property set to false will be created.  This will be used in
later commits to make decisions about how to handle nans.

Because DoubleType and RealType are now parametrized types, it is no
longer correct to use type == DOUBLE for type checking, as it is no
longer a singleton instance.  All code that was using type == DOUBLE or
type == REAL has been updated to use .equals() comparison
Add support for new nan defintion for =, <>, >, <, >=, <=, between, in,
not in.
This adds support for the new nan definition to =, <>, <, >, <=,>=,
BETWEEN, IN, NOT IN for real types.
This adds support for new nan definition for tuple domains, which are
use for hive filter pushdown.
also fixes when array has nans and nulls
@rschlussel rschlussel merged commit b673668 into prestodb:master Jun 6, 2024
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants