Issue repartitioning a time series by frequency when loaded from parquet file #10949

pvaezi · 2024-02-24T18:30:24Z

Describe the issue:

When loading a parquet file that has a datetime index, I can't repartition based on frequency, getting the following error:

Traceback (most recent call last):
  File "/Users/.../gitrepos/dask-exp/test_dask_issue.py", line 19, in <module>
    df2 = df2.repartition(freq="1D")
  File "/Users/.../miniconda3/envs/dask/lib/python3.10/site-packages/dask_expr/_collection.py", line 1184, in repartition
    raise TypeError("Can only repartition on frequency for timeseries")
TypeError: Can only repartition on frequency for timeseries

Despite the fact the loaded dataframe from parquet file having datetime64[ns] dtype.

Note that time series generator dataframe can be repartitioned by frequency.

Minimal Complete Verifiable Example:

import dask
dask.config.set({'dataframe.query-planning': True})
import dask.dataframe as dd

df1 = dask.datasets.timeseries(
    start="2000",
    end="2001",
    freq="1h",
    seed=1,
)
df1 = df1.repartition(freq="1ME")
df1.to_parquet("test")

df2 = dd.read_parquet(
    "test/*parquet",
    index="timestamp",
    columns=["x", "y"]
)
print(df2.index.dtype)
df2 = df2.repartition(freq="1D")
print(df2.compute())

Anything else we need to know?:

Looking to repartition data loaded via parquet for efficient time series based queries. Default partition results in larger than need memory bloatedness.

Environment:

Dask version: 2024.2.1
Python version: 3.10.13
Operating System: Mac OSX
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

phofl · 2024-02-24T19:19:05Z

Hi,

thanks for your report. This doesn't work because the divisions of the DataFrame are unknown after read_parquet. That means we can't efficiently repartition by frequency without scanning the whole Index. I am not against making this work more reliably though in the future.

I have a question though: This doesn't work either for the current dask.dataframe implementation for me if you disable query planning. Does this work for you?

pvaezi · 2024-02-24T22:06:24Z

Hi,

thanks for your report. This doesn't work because the divisions of the DataFrame are unknown after read_parquet. That means we can't efficiently repartition by frequency without scanning the whole Index. I am not against making this work more reliably though in the future.

I have a question though: This doesn't work either for the current dask.dataframe implementation for me if you disable query planning. Does this work for you?

Thanks for the response, I tried without query planning, still the same issue.

It does make sense that index needs to be scanned fully to make repartitioning effective. Can you suggest workarounds? Context, I'm trying to resample dataframe by day and aggregate, if partitions are divided by day, it makes my desired resampling and aggregations much easier and less memory intensive.

phofl · 2024-02-24T22:55:27Z

The most effective way depends a little on where you are reading the data from. You can set calculate_divisions=True in the read_parquet call, this will populate the divisions and enable your repartitioning.

The scan can be expensive though, this depends a little how many files are in your dataset and where your data is stored (e.g. local or remote like s3). That will get you there though.

pvaezi · 2024-02-29T01:38:19Z

Thanks, in general I'm looking to replicate a sql query like below with dask, I had trouble with memory consumption of dask workers, if you can guide me how to properly use resampling with timestamp column, that would be great:

   SELECT
        TIME_BUCKET(INTERVAL 1 DAY, timestamp) AS timestamp,
        col1,
        col2,
        col3,
        sum(col4),
        avg(col5)
    FROM df
    GROUP BY TIME_BUCKET(INTERVAL 1 DAY, timestamp), col1, col2, col3;

phofl · 2024-04-04T12:56:12Z

I think you have to calculate the divisions at some point if you want to resample by day, there isn't really a way around this since we need the information.

Not sure if you have to resample though, you could use the dt accessor on your timestamp column and round this to day accuracy. Then you can do a regular groupy instead of resampling, which doesn't need to know anything about the divisions.

Is that helpful?

github-actions bot added the needs triage Needs a response from a contributor label Feb 24, 2024

phofl added dataframe and removed needs triage Needs a response from a contributor labels Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue repartitioning a time series by frequency when loaded from parquet file #10949

Issue repartitioning a time series by frequency when loaded from parquet file #10949

pvaezi commented Feb 24, 2024

phofl commented Feb 24, 2024

pvaezi commented Feb 24, 2024

phofl commented Feb 24, 2024

pvaezi commented Feb 29, 2024 •

edited

phofl commented Apr 4, 2024

Issue repartitioning a time series by frequency when loaded from parquet file #10949

Issue repartitioning a time series by frequency when loaded from parquet file #10949

Comments

pvaezi commented Feb 24, 2024

phofl commented Feb 24, 2024

pvaezi commented Feb 24, 2024

phofl commented Feb 24, 2024

pvaezi commented Feb 29, 2024 • edited

phofl commented Apr 4, 2024

pvaezi commented Feb 29, 2024 •

edited