Dask dataframe's divisions and index mismatch #10460

simonykq · 2023-08-23T21:28:18Z

simonykq
Aug 23, 2023

I know dask's divisions is like an index of a pandas index, that is used to track what each dask's partition contain. But what if user messed with dfs index after a map_partitions or a groupby.apply() call?

Consider the following example:

import dask

# here dask is aligned and partitioned perfectly on `timestamp` as index
ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1H")
ddf2 = ddf.map_partitions(lambda df: df.set_index("name"))

print(dd2.known_divisions)
print(dd2.divisions)
print(dd2.index.compute())

Here, dd2 still have timestamp as their divisions but actually, all its index has been reset to use name columns. What would happen if I use ddf2 to perform groupby(dd2.index.name) or join with other dataframes? Would a shuffle take place?

Same goes for groupby.apply() call. So if I groupby index column, followed by apply(), the result should keep their original indexing and divisions as per in #2999 . However, what if I am really naughty and mess with each dataframe's index inside each of the grouped dataframe?

ddf_apply = ddf.groupby(ddf.index.name).apply(lambda df: df.reset_index())

print(ddf_apply.known_divisions)
print(ddf_apply.divisions)
print(ddf_apply.index.compute())

Here the ddf_apply still same to have the original's timestamp as divisions but actually, its index has been reset and dask dataframe doesn't seem to know about it. So same question here, what would happen if I use ddf_apply to do merge, join and another groupby(ddf_apply.index.name)? Would a shuffle happen if the divisions it is prescribing is no longer reflexing the actual index in the partition's dataframes under the hood?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask dataframe's divisions and index mismatch #10460

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Dask dataframe's divisions and index mismatch #10460

simonykq Aug 23, 2023

Replies: 0 comments

simonykq
Aug 23, 2023