Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

column names of aggregated DataFrame with anonymous functions #1276

Closed
pfarndt opened this issue Nov 14, 2017 · 5 comments
Closed

column names of aggregated DataFrame with anonymous functions #1276

pfarndt opened this issue Nov 14, 2017 · 5 comments

Comments

@pfarndt
Copy link

pfarndt commented Nov 14, 2017

When using an anonymous functions to generate an aggregated DataFrame the column names are not reproducible and inconvenient for further usage:

using DataFrames

d = DataFrame(g = [1,1,2,2], v=1:4)

println(names(aggregate(d, :g, x->sum(x))))
println(names(aggregate(d, :g, x->sum(x))))

This code produces:

Symbol[:g, Symbol("v_#1")]
Symbol[:g, Symbol("v_#3")]

although the last two commands are identical.

The documentation is promising names like v_\lambda1 at this point. I am using Julia v0.6.0.

IMO the function _fnames in https://github.com/JuliaData/DataFrames.jl/blob/master/src/other/utils.jl should be adjusted for new versions of julia.

@cjprybol
Copy link
Contributor

not reproducible

I was able to reproduce this

julia> using DataFrames

julia> d = DataFrame(g = [1,1,2,2], v=1:4)
4×2 DataFrames.DataFrame
│ Row │ g │ v │
├─────┼───┼───┤
│ 111 │
│ 212 │
│ 323 │
│ 424 │

julia> println(names(aggregate(d, :g, x->sum(x))))
Symbol[:g, Symbol("v_#1")]

julia> println(names(aggregate(d, :g, x->sum(x))))
Symbol[:g, Symbol("v_#3")]

The names of the columns are the function identifiers. Here's another fresh session to show the identifiers of the anonymous functions, which you'll see match the column names, and are again reproducible.

julia> x->sum(x)
(::#1) (generic function with 1 method)

julia> x->sum(x)
(::#3) (generic function with 1 method)

inconvenient for further usage

The previously used lambda syntax has no relation (aside from order) to the actual functions that were used to create the data and hence was removed. Unfortunately, it is not yet possible to extract the original code of an anonymous function by its identifier (see JuliaLang/julia#2625 (comment)), although in principle if that is added as a language feature then using the identifiers of the anonymous functions will provide both a stable identifier as well as a way to recover the function associated with that identifier. Currently, only the "stable identifier" part is supported while the lambda syntax cannot support either.

If you would like the columns to have specific names, you simply need to use named functions. For example, in the case that you provided using the anonymous function isn't recommended and you can simply use the sum function instead of wrapping it into an anonymous function with x -> sum(x).

julia> names(aggregate(d, :g, sum))
2-element Array{Symbol,1}:
 :g
 :v_sum

alternatively, you can retain the lambda naming by giving your anonymous functions that name

julia> λ1(x) = sum(x)
λ1 (generic function with 1 method)

julia> names(aggregate(d, :g, λ1))
2-element Array{Symbol,1}:
 :g
 :v_λ1

You are correct that the documentation for that section is out of date, it will be corrected after #1252 is merged. The _ fnames should have already been removed. If you'd like to contribute a PR to delete it, that would be great!

@nalimilan
Copy link
Member

The presence of a # is really annoying since it makes the symbol non-standard. Maybe we should replace it with another symbol, probably an ASCII one so that it's easy to type (f?).

@pfarndt
Copy link
Author

pfarndt commented Nov 14, 2017

You got the point - the # is annoying.

To change it to something else (e.g. f) only solves half of the problem, since I might call such an aggregate statement (with more than one and more complicated anonymous function than just a sum) several times and use the resulting column further on. Right now (since the numerical identifier is changing from one call to another) I would have to inquire its name by calling names and calculating (since I might want to apply the anonymous function to several columns) its position. Therefore the "previously used lambda syntax" that "has no relation (aside from order) to the actual functions", would be very helpful, because it has all the information one needs, i.e. column name and which of my anonymous functions went over it.

@nalimilan
Copy link
Member

So you mean, just use f1, f2...? Why not.

@bkamins
Copy link
Member

bkamins commented Dec 14, 2018

Closing as #1576 fixed this

@bkamins bkamins closed this as completed Dec 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants