Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparklyr arrange behavior is different than dplyr arrange - order null values after non-null values #3368

Open
mohanrajdtv opened this issue Aug 22, 2023 · 0 comments

Comments

@mohanrajdtv
Copy link

sparklyr arrange behavior is different than dplyr arrange.

dplyr arrange places missing value or null after all non-null values (https://dplyr.tidyverse.org/reference/arrange.html#missing-values)

Sparklyr arrange does not place null values in the end as shown in the generate query below

index_conditions_sparklyr <- index_conditions %>%
arrange(PersonId, RecordedDateTime, EncounterId) %>%
group_by(PersonId) %>%
mutate(row_number = row_number()) %>%
filter(row_number == 1) %>%
show_query()

SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY PersonId ORDER BY PersonId, RecordedDateTime, EncounterId) AS row_number
FROM table_1692663744
) q01
WHERE (row_number = 1.0)

SparkSQL supports ordering null values after non-null values
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-orderby.html

Example:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY PersonId ORDER BY RecordedDateTime NULLS LAST, EncounterId NULLS LAST) AS row_number FROM tbl_index_conditions) t
WHERE row_number = 1

Pyspark supports this - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.asc_nulls_last.html

Requesting to support ordering null values after non-null values in arrange method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants