Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update compute_dict_like to get all columns #58452

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

undermyumbrella1
Copy link
Contributor

@undermyumbrella1 undermyumbrella1 commented Apr 27, 2024

@undermyumbrella1 undermyumbrella1 marked this pull request as draft April 27, 2024 15:29
@undermyumbrella1 undermyumbrella1 force-pushed the fix/grouby_agg_dict_input_dup_columns branch from 75f593e to e8f3172 Compare April 28, 2024 14:36
@undermyumbrella1 undermyumbrella1 force-pushed the fix/grouby_agg_dict_input_dup_columns branch from b2ee3b4 to 79a8ea6 Compare April 28, 2024 15:36
@undermyumbrella1 undermyumbrella1 marked this pull request as ready for review April 28, 2024 16:26
pandas/core/apply.py Outdated Show resolved Hide resolved
Comment on lines 479 to 486
for key, how in func.items():
for index in range(df.shape[1]):
col = df.iloc[:, index]
if col.name != key:
continue

series = obj._gotitem(key, ndim=1, subset=col)
result = getattr(series, op_name)(how, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you see if the following works instead:

subset = selected_obj[[key]]
subobj = obj._gotitem(key, ndim=2, subset=subset)
result = getattr(subobj, op_name)(how, **kwargs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using getattr on DataFrame throws exception in this case

df = pd.DataFrame(
        [[1, 2, 3], [1, 3, 4], [2, 4, 5]],
        columns=["a", "b", "c"],
    )

    gb = df.groupby("a")
    result = gb.agg({"c": ["sum", "min", "max", "min"]})
    print(result)

This is due to reconstruct_func in DataFrameGroupby

if not relabeling:
        if isinstance(func, list) and len(func) > len(set(func)):
            # GH 28426 will raise error if duplicated function names are used and
            # there is no reassigned name
            raise SpecificationError(
                "Function names must be unique if there is no new column names "
                "assigned"
            )

pandas/core/apply.py Show resolved Hide resolved
@undermyumbrella1 undermyumbrella1 force-pushed the fix/grouby_agg_dict_input_dup_columns branch from c0ddcdd to 31b05af Compare May 2, 2024 09:15
@undermyumbrella1
Copy link
Contributor Author

Thank you for the review, I have added comments as per the review.

@rhshadrach
Copy link
Member

Thanks for the patience here @undermyumbrella1 - I should be able to get to this by Friday.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in the for loop that I was trying to change looks good. In general it'd be best if we didn't have to split up a DataFrame into Series for performance, but to flexibly handle all the different cases we need to it doesn't seem possible to me.

),
)
gb = df.groupby(level=0)
result = gb.agg({("level1.1", "level2.2"): "min"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add one more like this but with a list instead, e.g. ["min", "max"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby.agg fails when input has duplicate columns and dict input
3 participants