You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I know dask's divisions is like an index of a pandas index, that is used to track what each dask's partition contain. But what if user messed with dfs index after a map_partitions or a groupby.apply() call?
Consider the following example:
import dask
# here dask is aligned and partitioned perfectly on `timestamp` as index
ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1H")
ddf2 = ddf.map_partitions(lambda df: df.set_index("name"))
print(dd2.known_divisions)
print(dd2.divisions)
print(dd2.index.compute())
Here, dd2 still have timestamp as their divisions but actually, all its index has been reset to use name columns. What would happen if I use ddf2 to perform groupby(dd2.index.name) or join with other dataframes? Would a shuffle take place?
Same goes for groupby.apply() call. So if I groupby index column, followed by apply(), the result should keep their original indexing and divisions as per in #2999 . However, what if I am really naughty and mess with each dataframe's index inside each of the grouped dataframe?
Here the ddf_apply still same to have the original's timestamp as divisions but actually, its index has been reset and dask dataframe doesn't seem to know about it. So same question here, what would happen if I use ddf_apply to do merge, join and another groupby(ddf_apply.index.name)? Would a shuffle happen if the divisions it is prescribing is no longer reflexing the actual index in the partition's dataframes under the hood?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I know dask's
divisions
is like an index of a pandas index, that is used to track what each dask's partition contain. But what if user messed with dfs index after amap_partitions
or agroupby.apply()
call?Consider the following example:
Here,
dd2
still havetimestamp
as theirdivisions
but actually, all its index has been reset to usename
columns. What would happen if I useddf2
to performgroupby(dd2.index.name)
or join with other dataframes? Would a shuffle take place?Same goes for
groupby.apply()
call. So if Igroupby
index column, followed byapply()
, the result should keep their original indexing and divisions as per in #2999 . However, what if I am really naughty and mess with each dataframe's index inside each of the grouped dataframe?Here the
ddf_apply
still same to have the original's timestamp asdivisions
but actually, its index has been reset and dask dataframe doesn't seem to know about it. So same question here, what would happen if I useddf_apply
to domerge
,join
and anothergroupby(ddf_apply.index.name)
? Would a shuffle happen if the divisions it is prescribing is no longer reflexing the actual index in the partition's dataframes under the hood?Beta Was this translation helpful? Give feedback.
All reactions