Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust, python): Return datetime for mean/median of Date column #15679

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mcrumiller
Copy link
Contributor

@mcrumiller mcrumiller commented Apr 16, 2024

Resolves #13598. Part of overall #13599.

This treats dates and datetimes as analagous to integers/floats: the mean of ints is a float, and now the mean of dates is a datetime:

mean(1, 2) = 1.5
mean(2023-01-01, 2023-01-02) = 2023-01-01 12:00.

This is the last temporal mean/median PR, which means a lot of custom logic for the different temporal types has been collapsed/simplified. If this is accepted, we can deprecate dt.mean and dt.median, as mean/median for all temporals have now been implemented, they are now functionally equivalent to series.mean and series.median for all temporal dtypes. I'll do a followup on std/var/quantile.

This PR is almost identical to #13492, but that was pretty old/stale and cluttered, so easier to just make a new one. @ritchie46 to get streaming to work, I had to update the MeanAgg strutc with a flag to indicat the Date -> Datetime conversion. AFAIK Date is the only dtype that requires a value conversion for mean now, so I don't think we need a more general implementation, but you may think otherwise.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Apr 16, 2024
Copy link

codspeed-hq bot commented Apr 16, 2024

CodSpeed Performance Report

Merging #15679 will not alter performance

Comparing mcrumiller:date-mean-median (44bf877) with main (a9c8b59)

Summary

✅ 12 untouched benchmarks

Copy link

codecov bot commented Apr 16, 2024

Codecov Report

Attention: Patch coverage is 94.06780% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 81.34%. Comparing base (a9c8b59) to head (798fda9).
Report is 2 commits behind head on main.

❗ Current head 798fda9 differs from pull request most recent head 44bf877. Consider uploading reports for the commit 44bf877 to get more accurate results

Files Patch % Lines
...src/executors/sinks/group_by/aggregates/convert.rs 81.81% 4 Missing ⚠️
...s-core/src/frame/group_by/aggregations/dispatch.rs 95.34% 2 Missing ⚠️
crates/polars-core/src/series/mod.rs 93.75% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #15679       +/-   ##
===========================================
+ Coverage   39.14%   81.34%   +42.19%     
===========================================
  Files        1364     1374       +10     
  Lines      167705   176486     +8781     
  Branches     3031     2538      -493     
===========================================
+ Hits        65646   143559    +77913     
+ Misses     101592    32448    -69144     
- Partials      467      479       +12     
Flag Coverage Δ
python ?
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


match self.dtype() {
Boolean => self.cast(&Float64).unwrap().agg_mean(groups),
Float32 => SeriesWrap(self.f32().unwrap().clone()).agg_mean(groups),
Float64 => SeriesWrap(self.f64().unwrap().clone()).agg_mean(groups),
dt if dt.is_numeric() => apply_method_physical_integer!(self, agg_mean, groups),
#[cfg(feature = "dtype-date")]
Date => temporal_mean(
&(self.cast(&Int64).unwrap() * (MS_IN_DAY as f64)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we cast this after the aggregation. That should be much cheaper.

dt if dt.is_numeric() => apply_method_physical_integer!(self, agg_median, groups),
#[cfg(feature = "dtype-date")]
Date => temporal_median(
&(self.cast(&Int64).unwrap() * (MS_IN_DAY as f64)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem

@@ -54,6 +54,7 @@ pub(crate) enum AggregateFunction {
SumU64(SumAgg<u64>),
SumI32(SumAgg<i32>),
SumI64(SumAgg<i64>),
MeanDate(MeanAgg<i64>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this for now. We are redesigning the streaming engine and don't want to add stuff atm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it all felt ugly to me anyway.

@ritchie46
Copy link
Member

Thank you @mcrumiller. I have left a few comments. I think we should remove the streaming engine for now. That's something I want to redesign.

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Apr 26, 2024

@ritchie46 removing the streaming code ends up breaking date.mean in streaming, since the streaming engine uses different logic to calculate the mean, but still uses the schema logic to cast the result to a Datetime. The result is that the date's mean value gets directly cast to an i64 with no conversion, so the streaming mean always shows something like 1970-01-01 00:42:63 or some random time (i.e. interpreting the date's value as a datetime's value).

Should I try to find a workaround, or should I remove the date mean/median from the streaming tests?

@ritchie46
Copy link
Member

After thinking of it a bit. We should remove the physical implementations (we could panic there) and do all the casts on the logical side. During type-coercion we can set the proper casts. This will ensure it works on all physical engines. And we panic on the physical side to ensure we don't hit those paths

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Apr 29, 2024

After thinking of it a bit. We should remove the physical implementations (we could panic there) and do all the casts on the logical side. During type-coercion we can set the proper casts. This will ensure it works on all physical engines. And we panic on the physical side to ensure we don't hit those paths

Thanks @ritchie46. I may need a little help here though. Are you referring to logical and physical dtypes (I think you are), or logical plan versus physical plan? Also, this sort of sounds like it might be better off as a separate PR, do you agree?

One of the issues I see is that different logical types have different physical implementations than their pure physical counterpart (at least, now Date does--I can see Time restricting the maximum value in sum and such for others perhaps).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mean and median for pl.Date should return a datetime
2 participants