Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle split_out=None deprecation warning to drop_duplicates #10612

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

charlesbluca
Copy link
Member

Looks like we didn't add handling for split_out=None with #10542, causing us to get an error like:

shuffle = None, split_out = None

    def _determine_split_out_shuffle(shuffle, split_out):
        """Determine the default shuffle behavior based on split_out"""
        if shuffle is None:
>           if split_out > 1:
E           TypeError: '>' not supported between instances of 'NoneType' and 'int'

/datasets/charlesb/micromamba/envs/dask-sql-py39/lib/python3.9/site-packages/dask/dataframe/core.py:202: TypeError

This PR adds a warning for the deprecated use and falls back to the default of split_out=1

cc @rjzamora

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @charlesbluca ! Sorry for missing this.

It looks like the original split_out=None deprecation warning was added to groupby.agg more than a year ago (#9491). Unfortunately, it seems like we have not been warning for other APIs (like drop_duplicates).

Questions:

  1. Should we try to find a simple way to add a temporary deprecation warning for all APIs using split_out?
  2. Should we just move ahead with the deprecation now, and start raising a clear TypeError in _determine_split_out_shuffle?
  • Related question: Is there a reason that dask-sql needs to use split_out=None for now? If so, how much time do you think is needed to address the deprecation?

@charlesbluca
Copy link
Member Author

Should we try to find a simple way to add a temporary deprecation warning for all APIs using split_out?

I think that makes sense - it seems like we have a few unified utility functions used to check and control shuffle/split_out, maybe we can move this logic to one of those functions so we don't have to deal with the burden of adding a unique check to each relevant API?

Related question: Is there a reason that dask-sql needs to use split_out=None for now? If so, how much time do you think is needed to address the deprecation?

Though I caught this in dask-sql's test suite, don't think this should require much work to address - this was just a single test asserting some behavior based on split_out that happened to use the old default. I'm generally okay with moving forward with the deprecation, as it has been a while, though maybe other maintainers have context on why we haven't moved forward yet (cc @jrbourbeau)

@charlesbluca charlesbluca marked this pull request as ready for review November 3, 2023 14:43
@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants