Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

append! fails when a column references another #3376

Open
kafisatz opened this issue Sep 4, 2023 · 13 comments
Open

append! fails when a column references another #3376

kafisatz opened this issue Sep 4, 2023 · 13 comments

Comments

@kafisatz
Copy link

kafisatz commented Sep 4, 2023

Hi

Please see the example below.
The way append! is implemented this fails.

I searched for existing issues, as I expected this has come up in the past (and maybe the behavior is intentional?

using DataFrames
df=DataFrame(a=1:10)

#works
    df.b = deepcopy(df.a)
    append!(df,deepcopy(df))

#works
    df.c = 4*df.a #it would think it is common to assing new columns like this
    append!(df,deepcopy(df))

#fails
    df.c = df.a #I guess this line does NOT make a copy of df.a whereas 4*df.a does?
    append!(df,deepcopy(df))


@bkamins
Copy link
Member

bkamins commented Sep 4, 2023

the fact that df.c = df.a does not make a copy is documented.

The question is, what would you expect/prefer:

  • throwing error (because you are trying to append! to a data frame with aliased columns)
  • performing un-aliasing and making append! work

@nalimilan - what do you think?

@bkamins bkamins added this to the 1.7 milestone Sep 4, 2023
@kafisatz
Copy link
Author

kafisatz commented Sep 4, 2023

The question is, what would you expect/prefer:

  • throwing error (because you are trying to append! to a data frame with aliased columns)
  • performing un-aliasing and making append! work

I have no preference.

@bkamins
Copy link
Member

bkamins commented Sep 4, 2023

@nalimilan - so we currently do the former (throw an error). Do you think we should change this and de-alias in append!/push! and related? (in fact it would make the code simpler, as currently we have a quite complex recovery code in case of alias)

@nalimilan
Copy link
Member

I would find it a bit weird for append! to dealias columns. For example if you did c = df.c before calling append! then it would no longer be the same object. This may can unexpected as it's completely unrelated to the goal of the append! operation.

If we wanted to fix this, it would sound more logical to have df.c = df.a make a copy when aliasing is detected. But that would be slightly breaking.

@bkamins
Copy link
Member

bkamins commented Sep 9, 2023

it would sound more logical to have df.c = df.a make a copy when aliasing is detected

This is a point to consider ideed.
I wonder if anyone could be affected by such behavior. This, in general, would mean that we probably should do de-aliasing in general when changing the columns of a data frame. The question is if we see scenarios in which someone might actively want to store aliased columns in a data frame?

@nalimilan
Copy link
Member

I imagine there can be valid use cases for this, like if you need to store columns which could have different values but happen to be equal under different names. But these sound really marginal, and potentially dangerous. We could ask on Slack.

@jariji
Copy link
Contributor

jariji commented Sep 9, 2023

the fact that df.c = df.a does not make a copy is documented.

This settles it for me. A simple rule is better than a complicated one.

@bkamins
Copy link
Member

bkamins commented Sep 9, 2023

OK. I asked on Slack. Now a comment on the design of this if we wanted de-aliasing. We would keep an IdDict of columns in a DataFrame structure. This would allow us to quickly identify if column is aliased and copy it if needed.

This settles it for me. A simple rule is better than a complicated one.

This was an initial idea - to have a simple rule. But we want with @nalimilan to doublecheck what users prefer. Also - I will probably make an update to the docs warning about aliasing even stronger than we do now if we keep current behavior.

@bkamins
Copy link
Member

bkamins commented Sep 9, 2023

Comment from Slack by: maybe we could somehow indicate aliased columns when showing data frame.

@adienes
Copy link

adienes commented Sep 11, 2023

I think it's good to stay simple/consistent. that df.a = df.b aliases can be a bit surprising, but once you know about it it's fine. I think it would be more surprising if I construct my columns in a way such that I expect them to share memory, but somehow during processing they de-alias. If this option to create a copy and de-alias is implemented, I think it should at least emit a warning to the user.

That said, the usage of append! and push! in the OP could be made to work unambiguously (without de-aliasing) if the new items to be added into the aliased columns are themselves ===

maybe we could somehow indicate aliased columns when showing data frame.

this does generally seems nice, independently of whatever is decided here

@nalimilan
Copy link
Member

I'm not sure indicating aliases when printing will help a lot since with large data frames, most columns are not visible. Could we just improve append! so that in case of errors it checks for the presence of aliases and reports that?

@bkamins
Copy link
Member

bkamins commented Sep 11, 2023

Could we just improve append! so that in case of errors it checks for the presence of aliases and reports that?

This is a good idea. I will implement this

Also for push! and related functions.

@bkamins
Copy link
Member

bkamins commented Sep 15, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants