ensure drop_duplicates with keep is stable #10722

fjetter · 2023-12-19T10:59:53Z

phofl · 2023-12-19T13:31:17Z

This isn't stable unfortunately, we can't rely on the index of our original df being sorted:

if __name__ == "__main__":
    client = Client()
    pdf = pd.DataFrame(
        {"x": [1, 2, 3, 4, 5, 6] * 10, "y": np.random.randint(1, 1_000_000, (60, ))},
        index=pd.Series(list(range(0, 30)) * 2),
    )
    df = dd.from_pandas(pdf, npartitions=2, sort=False)
    result_pd = pdf.drop_duplicates(subset=["x"], keep="first")
    assert_eq(df.drop_duplicates(
            subset=["x"],
            keep="first",
            split_out=df.npartitions,
            shuffle="p2p",
        ).compute(), result_pd)

rjzamora · 2023-12-19T14:51:30Z

This isn't stable unfortunately, we can't rely on the index of our original df being sorted

Agree, we should be able to do this if we have sorted divisions. Otherwise, we would need to add a temporary "__order" column.

fjetter · 2023-12-19T15:24:29Z

Yes, you are right, of course.

Otherwise, we would need to add a temporary "__order" column.

I thought about this for a moment but couldn't come up with something that would actually give me this. We'd need internal ordering (can be done with a reset_index) and a global ordering (partition index). Is there an easy way to assign a column with the partition index?

rjzamora · 2023-12-19T15:40:17Z

Is there an easy way to assign a column with the partition index?

Easy-ish:

from dask.blockwise import BlockIndex

def _add_partid(x, val):
    return x.assign(__partid=val[0])

df2 = df.map_partitions(_add_partid, BlockIndex((df.npartitions,)))
df2.head()

   x  y  __partid
0  1  a         0
0  1  a         0
1  2  b         0
1  2  b         0
2  3  d         0

fjetter · 2023-12-19T16:38:12Z

BlockIndex

interesting, thank you.

ensure drop_duplicates keep is stable

506e070

This was referenced Dec 19, 2023

Shuffle-based drop duplicates produces incorrect result with shuffle="p2p" #10708

Closed

Do not run gilknocker in testsuite dask/distributed#8423

Merged

test_split_adaptive_aggregate_files failing on main #10721

Open

fjetter changed the title ~~ensure drop_duplicates keep is stable~~ ensure drop_duplicates with keep is stable Dec 19, 2023

fjetter marked this pull request as draft December 19, 2023 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensure drop_duplicates with keep is stable #10722

ensure drop_duplicates with keep is stable #10722

fjetter commented Dec 19, 2023

phofl commented Dec 19, 2023

rjzamora commented Dec 19, 2023

fjetter commented Dec 19, 2023

rjzamora commented Dec 19, 2023

fjetter commented Dec 19, 2023

ensure drop_duplicates with keep is stable #10722

Are you sure you want to change the base?

ensure drop_duplicates with keep is stable #10722

Conversation

fjetter commented Dec 19, 2023

phofl commented Dec 19, 2023

rjzamora commented Dec 19, 2023

fjetter commented Dec 19, 2023

rjzamora commented Dec 19, 2023

fjetter commented Dec 19, 2023