Implement new NaN behavior #22386

rschlussel · 2024-04-01T16:13:20Z

Description

This PR contains all the changes to overhaul the NaN operators to conform with the new definition proposed in https://github.com/prestodb/rfcs/blob/main/RFC-0001-nan-definition.md.

According to the new nan definition, Nan is larger than all other numbers and is equal to itself. This PR also changes +0 and -0 to always be considered equal/not distinct, whereas previously there was inconsistency here as well.

I recommend reviewing commit by commit as it is divided into logically distinct pieces that should be easier to review. If it would be helpful I can also split it up into smaller PRs.

Fixes the following issues:
#22040
#21936
#21877
#22679
#13807
#21065
#22716
facebookincubator/velox#9511

It should also fix should also fix #16851, though I'm not sure how to create the file to test it.

Motivation and Context

The motivation for this change is to provide consistent NaN semantics across all of our functions and operators. It is also to ensure that these semantics are consistent with velox as we move to native workers. For more details see the RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0001-nan-definition.md.

Impact

nan will now be treated as greater than all other numbers for all functions and as equal to itself for all functions. This changes the behavior of many existing functions and operators. These differences include (but are not limited to) =, <, >, joins, various distincting functions like set_agg and array_distinct, array_min.

It also fixes #22040, a wrong results bug with map_top_n in the presence of nans.

This PR also changes joins and aggregations to treat +0 and -0 as equal/not distinct.

Test Plan

Added tests for all affected functions.

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Change handling of floating point numbers in Presto to consistently treat NaNs as larger than any other number and equal to itself.  It also changes the handling of positive and negative zero to always be considered equal to each other.  Read more here: https://github.com/prestodb/rfcs/blob/main/RFC-0001-nan-definition.md. The new nan behavior can be disabled by setting the configuration property ``use-new-nan-definition`` to ``false``. This configuration property is intended to be temporary to ease migration in the short term, and will be removed in a future release.
* Fix a bug where map_top_n could return wrong results if there is any NaN input
* Fix a bug with array_min/array_max where it would return NaN rather than null when there was both NaN and null input.

elharo

A lot (all?) of these type comparisons look like they should be .equals for everything, not just the real and double types. Type is not an enum. I guess they're supposed to be singletons? But then why doesn't this work for real and double?

rschlussel · 2024-05-07T13:36:28Z

It doesn't work for real and double in this PR because i added a flag to it for the nan migration, so depending on the configuration property, the flag could be true or false. (Also, while you're certainly welcome to review while it's in this state, I'm planning to clean up the commits to be logically independent once I've finished fixing the tests/adding tests and covering all the cases).

tdcmeehan

Add property for new nan behavior

tdcmeehan · 2024-06-04T18:42:53Z

presto-main/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java

+        return useNewNanDefinition;
+    }
+
+    @Config("use-new-nan-definition")


Should we call this the new NaN definition, and by default this is false, or should we refer to the old behavior as the deprecated NaN definition, and by default that is true?

Oh, I see the later commit that enables it. I would have recommended this if it were disabled by default, but I think it's fine as it is.

tdcmeehan

Add operator double support for new NaN...

tdcmeehan · 2024-06-04T18:57:06Z

presto-common/src/main/java/com/facebook/presto/common/type/TypeUtils.java

+        long aBits = doubleToLongBits(a);
+        long bBits = doubleToLongBits(b);
+        if (aBits < bBits) {
+            return -1;
+        }
+        if (aBits > bBits) {
+            return 1;
+        }
+        return 0;


In theory, isn't it possible for multiple NaNs to have separate values when represented as bits? Supposing this came from a datasource and not from Presto itself. If so, would this work?

doubleToLongBits coerces all the nans to the same representation. There's a different function Double.doubleToRawLongBits() that doesn't do that.

I added a comment in the code and some tests for this.

tdcmeehan

Add NaN definition to DoubleType ...

tdcmeehan · 2024-06-04T19:03:31Z

presto-common/src/main/java/com/facebook/presto/common/type/DoubleType.java

+        // a time. .equals() comparison is always used against the static DOUBLE
+        // instance to check if something is double type. this hack is temporary
+        // and will be removed when we full remove the old nan behavior
+        return other == DOUBLE || other == OLD_NAN_DOUBLE;


Should we just do an instanceof check? It wouldn't require a lengthy explanation. Likewise for float.

tdcmeehan

This looks awesome, thanks for the great work @rschlussel.

Add a boolean field to DoubleType and RealType to determine whether to use the new nan definition. If useNewNanDefinition is set to true in the configuration property, then only doubles/reals with that property set to true will be created, and if it is false, then only doubles/reals with that property set to false will be created. This will be used in later commits to make decisions about how to handle nans. Because DoubleType and RealType are now parametrized types, it is no longer correct to use type == DOUBLE for type checking, as it is no longer a singleton instance. All code that was using type == DOUBLE or type == REAL has been updated to use .equals() comparison

Add support for new nan defintion for =, <>, >, <, >=, <=, between, in, not in.

This adds support for the new nan definition to =, <>, <, >, <=,>=, BETWEEN, IN, NOT IN for real types.

This adds support for new nan definition for tuple domains, which are use for hive filter pushdown.

also fixes when array has nans and nulls

rschlussel force-pushed the nan-operators branch 2 times, most recently from c0948db to 62da9bf Compare April 3, 2024 19:14

rschlussel mentioned this pull request May 6, 2024

Array index out of bounds error in multimap_agg for nan() keys #22679

Open

elharo reviewed May 7, 2024

View reviewed changes

rschlussel force-pushed the nan-operators branch 2 times, most recently from d2b923d to 1679716 Compare May 13, 2024 15:58

rschlussel force-pushed the nan-operators branch 18 times, most recently from 158134f to 3a9bd84 Compare May 31, 2024 18:04

rschlussel force-pushed the nan-operators branch 2 times, most recently from 2a25b6b to bd3141d Compare May 31, 2024 19:09

rschlussel changed the title ~~[WIP] Nan operators~~ Implement new NaN behavior May 31, 2024

rschlussel force-pushed the nan-operators branch from bd3141d to bfae346 Compare May 31, 2024 19:14

rschlussel marked this pull request as ready for review May 31, 2024 19:23

sdruzkin previously approved these changes Jun 3, 2024

View reviewed changes

jaystarshot previously approved these changes Jun 4, 2024

View reviewed changes

tdcmeehan reviewed Jun 4, 2024

View reviewed changes

rschlussel dismissed stale reviews from jaystarshot, sdruzkin, and steveburnett via 6879f44 June 5, 2024 13:38

rschlussel force-pushed the nan-operators branch 3 times, most recently from fbf6bbc to 623eb7f Compare June 5, 2024 14:02

tdcmeehan mentioned this pull request Jun 5, 2024

Fix min and max for inputs that include NaN values #21893

Closed

6 tasks

tdcmeehan previously approved these changes Jun 6, 2024

View reviewed changes

rschlussel added 12 commits June 6, 2024 14:04

Add property for new nan behavior

6547970

Add double operator support for new NaN definition

4c64569

Add support for new nan defintion for =, <>, >, <, >=, <=, between, in, not in.

Add real operator support for new nan definition

2534c53

This adds support for the new nan definition to =, <>, <, >, <=,>=, BETWEEN, IN, NOT IN for real types.

Add support for new nan defintion to tuple domains

d48dee5

This adds support for new nan definition for tuple domains, which are use for hive filter pushdown.

Change equality defintion for double and real types

82e51b6

Support new nan definition in dynamic filters

e8900c0

Fix array_min/max function for nans

2cd9961

also fixes when array has nans and nulls

Fix Greatest/Least for new NaN definition

170090b

Set use-new-nan-definition to true by default

720e4cc

Add tests for join and distinct with +/-0

18fb9f4

Add documentation for handling of NaNs

8388519

rschlussel dismissed tdcmeehan’s stale review via 8388519 June 6, 2024 18:07

rschlussel force-pushed the nan-operators branch from 623eb7f to 8388519 Compare June 6, 2024 18:07

tdcmeehan approved these changes Jun 6, 2024

View reviewed changes

rschlussel merged commit b673668 into prestodb:master Jun 6, 2024
57 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement new NaN behavior #22386

Implement new NaN behavior #22386

rschlussel commented Apr 1, 2024 •

edited

elharo left a comment

rschlussel commented May 7, 2024

tdcmeehan left a comment

tdcmeehan Jun 4, 2024

tdcmeehan Jun 4, 2024

tdcmeehan left a comment

tdcmeehan Jun 4, 2024

rschlussel Jun 4, 2024 •

edited

rschlussel Jun 5, 2024

tdcmeehan left a comment

tdcmeehan Jun 4, 2024

tdcmeehan left a comment

Implement new NaN behavior #22386

Implement new NaN behavior #22386

Conversation

rschlussel commented Apr 1, 2024 • edited

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

elharo left a comment

Choose a reason for hiding this comment

rschlussel commented May 7, 2024

tdcmeehan left a comment

Choose a reason for hiding this comment

tdcmeehan Jun 4, 2024

Choose a reason for hiding this comment

tdcmeehan Jun 4, 2024

Choose a reason for hiding this comment

tdcmeehan left a comment

Choose a reason for hiding this comment

tdcmeehan Jun 4, 2024

Choose a reason for hiding this comment

rschlussel Jun 4, 2024 • edited

Choose a reason for hiding this comment

rschlussel Jun 5, 2024

Choose a reason for hiding this comment

tdcmeehan left a comment

Choose a reason for hiding this comment

tdcmeehan Jun 4, 2024

Choose a reason for hiding this comment

tdcmeehan left a comment

Choose a reason for hiding this comment

rschlussel commented Apr 1, 2024 •

edited

rschlussel Jun 4, 2024 •

edited