Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit spreading for AsTable output` #3408

Open
pdeffebach opened this issue Dec 13, 2023 · 6 comments
Open

Revisit spreading for AsTable output` #3408

pdeffebach opened this issue Dec 13, 2023 · 6 comments
Labels

Comments

@pdeffebach
Copy link
Contributor

Here is a use-case where DataFrames.jl compares unfavorably to dplyr.

Basically, the best way to do inter-dependent column transformations is to use AsTable and return a NamedTuple. However in OP's example, they want to return a scalar and a vector simultaneously. So AsTable isn't an option without some awkward spreading of their scalar.

Maybe we can spread scalar outputs when an AsTable is the dest?

@bkamins
Copy link
Member

bkamins commented Dec 14, 2023

and return a NamedTuple.

The design is to:

  • return a NamedTuple if you do not want pseudo broadcasting;
  • return a DataFrame if you want it.

Example:

julia> df = DataFrame(id=repeat([1, 2], 5), val=1:10)
10×2 DataFrame
 Row │ id     val
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     1      5
   6 │     2      6
   7 │     1      7
   8 │     2      8
   9 │     1      9
  10 │     2     10

julia> combine(groupby(df, :id), :val => (x -> (s=sum(x); DataFrame(total=s, frac=x ./ s))) => AsTable)
10×3 DataFrame
 Row │ id     total  frac
     │ Int64  Int64  Float64
─────┼─────────────────────────
   1 │     1     25  0.04
   2 │     1     25  0.12
   3 │     1     25  0.2
   4 │     1     25  0.28
   5 │     1     25  0.36
   6 │     2     30  0.0666667
   7 │     2     30  0.133333
   8 │     2     30  0.2
   9 │     2     30  0.266667
  10 │     2     30  0.333333

The question is how to reflect this in DataFramesMeta.jl.

@pdeffebach
Copy link
Contributor Author

Hmmm... I don't love the performance hit that would come with constructing a DataFrame. With the current implementation of the @astable macro-flag I would have to decide whether a DataFrame or a NamedTuple.

I wonder if it's best for DataFramesMeta.jl to do the broadcasting on their own.

However your response is a bit confusing

return a NamedTuple if you do not want pseudo broadcasting;

since currently returning (a = 1, b = [4, 5, 6]) throws an error. So it's not a broadcasting behavior you can opt in-or-out of.

@bkamins
Copy link
Member

bkamins commented Dec 14, 2023

throws an error.

Yes, because NamedTuple does not do broadcasting (as opposed to DataFrame)

@bkamins
Copy link
Member

bkamins commented Dec 14, 2023

Also I think that the performance hit, although noticeable, for most users would be a minor issue. The major issue, is, as you write, that @astable has to have only one meaning. Maybe @asdf or @asdataframe would be an alternative name?

@pdeffebach
Copy link
Contributor Author

Are there other advantages of making a DataFrame inside fun?

I don't really want to have to introduce both @astable and @asdataframe into the docs and tutorials if the only difference is the spreading behavior. I would just as soon do some spreading inside the anonymous function instead.

@bkamins
Copy link
Member

bkamins commented Dec 15, 2023

Are there other advantages of making a DataFrame inside fun?

I do not think so. The other would be making column names unique, but I guess it is not an issue in your case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants