Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling GroupedDataFrames (rand) #3437

Open
quachpas opened this issue Apr 17, 2024 · 5 comments
Open

Sampling GroupedDataFrames (rand) #3437

quachpas opened this issue Apr 17, 2024 · 5 comments
Assignees
Labels
Milestone

Comments

@quachpas
Copy link

quachpas commented Apr 17, 2024

Hello,

Currently, we cannot sample from a GroupedDataFrame directly.

julia> df = DataFrame(rand(100000, 100), :auto);
          gdf = groupby(df, :x1);
         # Code above from #3102
          rand(gdf) # MethodError
Stacktrace

ERROR: MethodError: no method matching Random.Sampler(::Type{TaskLocalRNG}, ::Random.SamplerTrivial{GroupedDataFrame{DataFrame}, Any}, ::Val{1})

Closest candidates are:
  Random.Sampler(::Type{<:AbstractRNG}, ::Random.Sampler, ::Union{Val{1}, Val{Inf}})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:147
  Random.Sampler(::Type{<:AbstractRNG}, ::Any, ::Union{Val{1}, Val{Inf}})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:183
  Random.Sampler(::Type{<:AbstractRNG}, ::BitSet, ::Union{Val{1}, Val{Inf}})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/generation.jl:450
  ...

Stacktrace:
 [1] Random.Sampler(T::Type{TaskLocalRNG}, sp::Random.SamplerTrivial{GroupedDataFrame{DataFrame}, Any}, r::Val{1})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:147
 [2] Random.Sampler(rng::TaskLocalRNG, x::Random.SamplerTrivial{GroupedDataFrame{DataFrame}, Any}, r::Val{1})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:139
 [3] rand(rng::TaskLocalRNG, X::Random.SamplerTrivial{GroupedDataFrame{DataFrame}, Any})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:255
 [4] rand(rng::TaskLocalRNG, X::GroupedDataFrame{DataFrame})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:255
 [5] rand(X::GroupedDataFrame{DataFrame})
   @ Random ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Random/src/Random.jl:260
 [6] top-level scope
   @ REPL[228]:3

One way to circumvent that MethodError is to sample from the idx

julia> df = DataFrame(rand(100000, 100), :auto);
          gdf = groupby(df, :x1);
julia> indices  = rand(1:length(gdf), 10^6)  # Many more indexations than groups.
# Code above is from #3102
julia> getindex.(Ref(gdf), indices) # Sample works

Code: #3102

What would be needed to implement this interface? Or, is it undesirable to do so?

versioninfo and package version

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_REVISE_POLL = 1
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 

(env) pkg> status DataFrames
Status `~/Project.toml`
  [a93c6f00] DataFrames v1.6.1

EDIT: reproducible on v1.7.0 (main)

@bkamins
Copy link
Member

bkamins commented Apr 17, 2024

We could add it. @nalimilan, what do you think about adding:

Random.rand(rng::Random.AbstractRNG, ::Random.SamplerTrivial{<:GroupedDataFrame}) = gdf[rand(rng, 1:length(gdf))]

?

@bkamins bkamins added this to the 1.7 milestone Apr 17, 2024
@nalimilan
Copy link
Member

Duplicate of #2097. It would make sense to define rand on both GroupedDataFrame and DataFrame, as we implement shuffle for it (#3010).

For DataFrame, we could also allow specifying a number of rows to draw. That wouldn't work for GroupedDataFrame but we could print an error message with a hint about what to do. Or we could automatically create a new integer grouping column that would allow repeating a group multiple times if it's been drawn more than once. FWIW, this feature has been requested in dplyr but hasn't been implemented: tidyverse/dplyr#361, tidyverse/dplyr#6518.

@bkamins
Copy link
Member

bkamins commented Apr 18, 2024

Ah - good catch.

So - now I responded positively as rand's API specifies:

Pick a random element or array of random elements

And the key word is array. Which means that with rand(gdf, 10, 10) we would return a 10x10 array of SubDataFrame.

If we also added rand for data frame then writing rand(df, 10, 10) would return a 10x10 array of DataFrameRow.

I am not sure this is useful, but this could work. This is different from shuffle as shuffle does not promise to return an array, but a permuted copy. While rand promises to return a single element or an array.

The question is if users would find it intuitive and useful?

@quachpas
Copy link
Author

Thanks for all the answers! Sorry about the missed duplicate issue.

The question is if users would find it intuitive and useful?

AFAIK the only interface that DataFrames.jl provides for Random is shuffle and shuffle!, which both return a permuted DataFrame. Since DataFrame does not support rand either, I was probably in the wrong to expect GroupedDataFrame to behave like an Array.

As for usefulness, in my case, I was looking to sample groups of data (hence the groupby), and it did feel jarring that I couldn't just sample the GroupedDataFrame. I am not sure it is strictly useful, but it is certainly more straightforward than the following

N = 100
tdf = transform(df, [:x1, :x2] => ByRow(string))
keys = unique(tdf[!, :x1_x2_string])
subset(tdf, :x1_x2_string => ByRow(in(rand(keys, N)))) # DataFrame, have to drop :x1_x2_string

VS

N = 100
gdf = groupby(df, [:x1, :x2])
rand(gdf, N) # Array of GroupedDataFrame? GroupedDataFrame?

I don't think it's intuitive for rand(gdf, 10, 10) to return an array. If shuffle returns a permuted copy, I would expect rand to always return a (Grouped)DataFrame (although that sounds like a lot of work for not much).

P.S.: I did not go into the implementation of GroupedDataFrame in details, but is there a reason why getindex(gd, idxs) does not support duplicates idxs?

@bkamins
Copy link
Member

bkamins commented Apr 18, 2024

but is there a reason why getindex(gd, idxs) does not support duplicates idxs?

This is the same reason why Dict does not allow for duplicate keys. Group ids must be unique.


Adding shuffle and shuffle! to GroupedDataFrame is easy to do - we could add it if you would find it useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants