Add nest, unnest, extract and extract! #3258

bkamins · 2022-12-28T19:42:26Z

The PR adds nest and unnest and introduces scalar kwarg to flatten (which is needed in unnest.

flatten is ready for review.

For nest and unnest requires discussion if we like the proposed API (they work, but maybe we will decide to change API).

Some important decisions I propose:

nest only works on GroupedDataFrame (the reason is to avoid complexity of group order specification); nesting is done always to DataFrame (to keep things simple); another not easy decision is syntax I proposed [:x, :y] => :z which means that columns :x and :y should be nested and stored in column :z (but I would like to confirm that we find it intuitive, as syntax :z => [:x, :y] also could be advocated for).
unnest supports both tables (e.g. DataFrame) and rows (e.g. Tables.AbstractRow) and has two options: flatten=true, when rows of the nested columns are flattened, and flatten=false (when they are left as is - this is probably useful, if we work with rows)

TODO:

add metadata
write tests
update manual

CC @nalimilan @pdeffebach @jariji

src/abstractdataframe/nest.jl

nalimilan · 2023-01-01T12:16:05Z

nest only works on GroupedDataFrame (the reason is to avoid complexity of group order specification); nesting is done always to DataFrame (to keep things simple); another not easy decision is syntax I proposed [:x, :y] => :z which means that columns :x and :y should be nested and stored in column :z (but I would like to confirm that we find it intuitive, as syntax :z => [:x, :y] also could be advocated for).

Agreed. [:x, :y] => :z sounds more logical as the input is on the LHS and the output on the RHS.

unnest supports both tables (e.g. DataFrame) and rows (e.g. Tables.AbstractRow) and has two options: flatten=true, when rows of the nested columns are flattened, and flatten=false (when they are left as is - this is probably useful, if we work with rows)

Let's continue discussion at #3258 (comment).

src/abstractdataframe/nest.jl

nalimilan · 2022-12-30T22:15:29Z

src/abstractdataframe/nest.jl

+`cols` (default `:setequal`) and `promote` (default `true`) keyword arguments
+have the same meaning as in [`push!`](@ref).


Maybe repeat the explanation? Usually we don't require users to read other docstrings like this.

When I started describing it I realized that anything except cols=:union and promote=true is not really useful. We can always add it later. So I opted for a simpler design for now and functions do not take these arguments.

src/abstractdataframe/nest.jl

bkamins · 2023-01-02T07:49:34Z

I propose to discuss this PR step by step.
Let us start with nest. Here the major question is if we need it. The major reason is that Ref is a nesting operator in operator specification syntax.

Start with some data frame:

julia> df = DataFrame(a=[1, 1, 2, 2], b=11:14, c=21:24)
4×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11     21
   2 │     1     12     22
   3 │     2     13     23
   4 │     2     14     24

First, already doing groupby gives us nested SubDataFrames - and maybe this is what is enough for most users:

julia> groupby(df, :a)
GroupedDataFrame with 2 groups based on key: a
First Group (2 rows): a = 1
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11     21
   2 │     1     12     22
⋮
Last Group (2 rows): a = 2
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2     13     23
   2 │     2     14     24

(note that such grouped data frame also supports convenient indexing to get the nested data frames, something which is not easily available if we indeed nest data)

Now, a basic pattern to nest some columns as NamedTuple is:

julia> combine(groupby(df, :a), AsTable([:b, :c]) => Ref => :bc)
2×2 DataFrame
 Row │ a      bc
     │ Int64  NamedTup…
─────┼─────────────────────────────────────
   1 │     1  (b = [11, 12], c = [21, 22])
   2 │     2  (b = [13, 14], c = [23, 24])

The only thing user needs to remember here is that :b and :c fields of named tuples are views, but this is something that might be preferred in many scenarios.

If one wants to nest data frames one can just write:

julia> combine(groupby(df, :a), AsTable([:b, :c]) => Ref∘DataFrame => :bc)
2×2 DataFrame
 Row │ a      bc
     │ Int64  DataFrame
─────┼──────────────────────
   1 │     1  2×2 DataFrame
   2 │     2  2×2 DataFrame

Now the columns are copied (as DataFrame constructor copies data by default)

Finally note that operation specification syntax also works with row nesting (something we have not discussed but is useful):

julia> combine(groupby(df, :a), AsTable([:b, :c]) => ByRow(identity) => :bc)
4×2 DataFrame
 Row │ a      bc
     │ Int64  NamedTup…
─────┼─────────────────────────
   1 │     1  (b = 11, c = 21)
   2 │     1  (b = 12, c = 22)
   3 │     2  (b = 13, c = 23)
   4 │     2  (b = 14, c = 24)

In the implementation in the PR I used a bit different pattern:

julia> combine(groupby(df, :a), x -> (bc = select(x, [:b, :c]),))
2×2 DataFrame
 Row │ a      bc
     │ Int64  DataFrame
─────┼──────────────────────
   1 │     1  2×2 DataFrame
   2 │     2  2×2 DataFrame

but it was only because the implementation should support all cases (and AsTable can be problematic with compilation in case of extremely wide tables).

In summary - my question is. Given these considerations do we need to add nest? Maybe it is enough to add examples in the manual how nesting can be done?

jariji · 2023-01-02T08:29:44Z

combine says

If target_cols is a Symbol or a string then the function is assumed to return a single column. In this case returning a data frame, a matrix, a NamedTuple, or a DataFrameRow raises an error.

e.g.

julia> combine(groupby(df, :a), AsTable([:b, :c]) => DataFrame => :bc)
ERROR: ArgumentError: a single value or vector result is required (got DataFrame)

Is the reason for this error documented somewhere? i.e. why doesn't this just work?

The error message could be a good place to document the Ref/fill trick.

I don't like having to specify the columns twice. If it's okay for :a to be in the nested dataframes too I can just use

combine(groupby(df, :a), AsTable(:) => Ref∘DataFrame => :bc)

but sometimes it's desired that :a not be duplicated there, and I'd rather not have to spell it out. I'm not currently sure how big an issue to make of this.

bkamins · 2023-01-02T08:57:58Z

i.e. why doesn't this just work?

The reason is safety (in other works: to allow for non-breaking change in the future if we decided it is needed). For example if user writes:

AsTable([:b, :c]) => DataFrame

it would be ambiguous if user wants the produced data frame to be stored in one cell of a data frame or expanded (as intuitively users might expect it to be expanded just as vectors are expanded).

This is especially relevant when trying to detect rare cases when some function can either return a table or a scalar (i.e. when the return type of the operation is not type stable).

The error message could be a good place to document the Ref/fill trick.

This is a good point. I will make a PR changing this.

nalimilan · 2023-01-02T10:49:48Z

Yeah nest isn't strictly needed. Its main advantage is that it's easier to discover, but it's not clear that nesting is really useful in DataFrames.jl thanks to the existence of GroupedDataFrame (that dplyr doesn't have).

That said, if we add unnest we should probably have nest for consistency. But the behavior with flatten=false doesn't really match what I would expect from unnest. Maybe the action of splitting a column into several ones would be better named separate or extract similar to the dplyr functions? In dplyr these only allow splitting strings, but we could make it more general and allow creating columns from any collection, including named tuples? The only peculiarity of named tuples (and dicts) is that appropriate column names can be extracted automatically.

bkamins · 2023-01-02T13:29:08Z

Maybe the action of splitting a column into several ones would be better named separate or extract similar to the dplyr functions?

This is what I had in mind. extract seems as a good name.

In dplyr these only allow splitting strings, but we could make it more general and allow creating columns from any collection, including named tuples?

This is something we do not need to add as we already have it. As AsTable as target does exactly this.
The only limitation is that AsTable assumes a fixed schema for all rows.

What we need is a function designed to handle cases when each row potentially has a different schema.
And maybe also (this is not added, but we could add it) allowing for e.g. missing value in a row that would be exapnded to missing values.

The only peculiarity of named tuples (and dicts) is that appropriate column names can be extracted automatically.

Currently dicts would not work as they do not have a defined column order (but we could change this; but then also a change in push! et al. should be introduced for consistency)

jariji · 2023-01-02T21:47:26Z

it's not clear that nesting is really useful in DataFrames.jl thanks to the existence of GroupedDataFrame (that dplyr doesn't have).

I'm missing something, how does GroupedDataFrame substitute for df nesting?

My uninformed impression so far is that GroupedDataFrame partially substitutes for Pandas's row labels but that hierarchical column labels have no equivalent in DFjl and nested dataframes is my workaround for that missing feature.

bkamins · 2023-01-02T22:03:13Z

nested dataframes is my workaround for that missing feature.

Indeed nested column labels are not supported and a work-around for them is needed. However, my question is why do you use data frame for this. Normally a NamedTuple of scalars would be used here like:

julia> df = DataFrame(x=[(a="aa", b="bb"), (a="pp", b="qq")])
2×1 DataFrame
 Row │ x
     │ NamedTup…
─────┼──────────────────────
   1 │ (a = "aa", b = "bb")
   2 │ (a = "pp", b = "qq")

That is why I have said that normally I would expect flatten=false to be needed.

I'm missing something, how does GroupedDataFrame substitute for df nesting?

What we mean is that you can easily index into a GroupedDataFrame to get a portion of the source data frame for certain combination of key column values. Notice that it naturally combines with column nesting of scalars.

The point is that if you nest whole DataFrame you fix the row structure. While if you nest a NamedTuple of scalars you have nested columns but can groupby different columns flexibly later.

jariji · 2023-01-03T00:58:25Z

In my dataframe, each row specifies a regression model and the :df column has the data, including the regression residuals, similar to the broom vignette. What is your opinion about using this style in DFjl?

bkamins · 2023-01-03T06:55:22Z

You mean that "per group" you want to store different objects:

source data frame as one column
estimation results as another column
residuals etc. as yet another column

Then indeed nesting a data frame (or any other object) makes sense.

bkamins · 2023-01-04T18:12:23Z

@jariji - given my last comment. Now I realized that one cannot create such an object by nesting. I.e. the use case when nesting seems to be indeed needed is when you add different objects sequentially (if you did it in one-shot they would have to have the same number of columns). Is this indeed your case, i.e. you nest only columns needed for estimation of the model but the other columns, that are also nested, are only added later?

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

…into bk/nest

bkamins · 2023-01-05T09:24:51Z

OK, I have thought how we should move forward. Although the functions are simple I think we can keep them for user-friendlyness. The functions are:

nest and unnest for working with tables
expand and expand! for expanding row-like sources (here the ! version makes sense to have as this is most likely what user will want to do since number of rows does not change).

For now I gave tentative implementations (to show how things internally would work).
@nalimilan To move forward I need #3245 to me merged first and then I need to merge it to this PR.

nalimilan · 2023-01-08T20:29:47Z

In dplyr these only allow splitting strings, but we could make it more general and allow creating columns from any collection, including named tuples?

This is something we do not need to add as we already have it. As AsTable as target does exactly this. The only limitation is that AsTable assumes a fixed schema for all rows.

What we need is a function designed to handle cases when each row potentially has a different schema. And maybe also (this is not added, but we could add it) allowing for e.g. missing value in a row that would be exapnded to missing values.

The schema isn't fixed either when splitting string columns into one or more columns: in some cases you might have no occurrence of the separator, in some cases one or more occurrences, giving one, two or more columns. That's why it seems to make sense to be able to support both strings and collections in the same function.

Why call it expand rather than separate or extract? expand is something completely different in dplyr.

Otherwise your proposal sounds good to me.

bkamins · 2023-01-09T14:39:42Z

Why call it expand rather than separate or extract?

I meant extract - fixed.

Now regarding:

The schema isn't fixed either when splitting string columns into one or more columns

This does not supported anyway, as currently there is no way to specify that the string should be split in the syntax.

What we provide is the following:

julia> df = DataFrame(x=["a,b", "c,d"])
2×1 DataFrame
 Row │ x
     │ String
─────┼────────
   1 │ a,b
   2 │ c,d

julia> select(df, :x => ByRow(x -> split(x, ',')) => AsTable)
2×2 DataFrame
 Row │ x1         x2
     │ SubStrin…  SubStrin…
─────┼──────────────────────
   1 │ a          b
   2 │ c          d

julia> select(df, :x => ByRow(x -> split(x, ',')) => [:p, :q])
2×2 DataFrame
 Row │ p          q
     │ SubStrin…  SubStrin…
─────┼──────────────────────
   1 │ a          b
   2 │ c          d

with the restriction that every string must be split into the same number of groups.

Maybe then - instead of adding extract and extract! we should change the implementation of => AsTable and => [:p, :q] syntaxes above and allow for varying number of columns to be produced by the expression (in which case we would make the cols=:union equivalent instead?).

jariji · 2023-01-09T22:47:02Z

Related PR for string-splitting with fixed number of splitpoints JuliaLang/julia#43557

bkamins · 2023-02-05T08:17:01Z

OK - flatten is removed from this PR.
We leave it as WIP with unnest, nest, extract and extract!

bkamins · 2023-06-04T17:58:22Z

Self-note. Investigate:

:src_column => only => AsTable

pattern

ohaaga · 2024-05-14T11:22:17Z

Just learning Julia, so apologies if this is redundant, but I'd love to have nest/unnest (and convenience functions for mapping pipelined functions over columns that contain dataframes) for something like Jennifer Bryan's "row-oriented" workflow, which really helps to keep a project organized (in a real data structure, rather than with ad-hoc naming conventions, etc) when repeating multiple analyses over e.g. different geographic units.

https://github.com/jennybc/row-oriented-workflows

add nest, unnest, improve flatten

c459de9

bkamins requested a review from nalimilan December 28, 2022 19:42

bkamins added the feature label Dec 28, 2022

bkamins added this to the 1.5 milestone Dec 28, 2022

This was linked to issues Dec 28, 2022

feature: cols=:union argument (or something like it) for combine with AsTable #3005

Open

unnest #3116

Open

Improve flatten (slightly breaking) #2767

Closed

bkamins mentioned this pull request Dec 28, 2022

a new method of the flatten function in DataFrames #2890

Closed

add to docs

d764467

jariji reviewed Dec 28, 2022

View reviewed changes

src/abstractdataframe/nest.jl Outdated Show resolved Hide resolved

jariji reviewed Dec 28, 2022

View reviewed changes

src/abstractdataframe/nest.jl Outdated Show resolved Hide resolved

jariji reviewed Dec 29, 2022

View reviewed changes

src/abstractdataframe/nest.jl Outdated Show resolved Hide resolved

bkamins mentioned this pull request Dec 29, 2022

Lifecycle annotations #3259

Closed

nalimilan reviewed Jan 1, 2023

View reviewed changes

bkamins and others added 4 commits January 5, 2023 09:57

Apply suggestions from code review

c85a275

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

add extract

f2db6b1

Merge branch 'bk/nest' of https://github.com/JuliaData/DataFrames.jl …

a956dd2

…into bk/nest

initial implementation

792d355

bkamins added 3 commits January 5, 2023 20:12

change default cols to :union

7d05ac8

fix wrong function name

0e58244

remove cols and promote

4093533

change to extract

698330f

This was unlinked from issues Feb 5, 2023

feature: cols=:union argument (or something like it) for combine with AsTable #3005

Open

unnest #3116

Open

bkamins mentioned this pull request Feb 5, 2023

unnest #3116

Open

bkamins removed a link to an issue Feb 5, 2023

Improve flatten (slightly breaking) #2767

Closed

This was linked to issues Feb 5, 2023

feature: cols=:union argument (or something like it) for combine with AsTable #3005

Open

unnest #3116

Open

bkamins changed the title ~~Add nest, unnest, improve flatten~~ Add nest, unnest, extract and extract! Feb 5, 2023

bkamins marked this pull request as draft February 5, 2023 08:17

bkamins added 3 commits February 5, 2023 09:19

remove flatten from the PR

7017867

fix newlines

cca1c87

another newline fix

5c7111c

bkamins modified the milestones: 1.5, 1.6 Feb 5, 2023

bkamins modified the milestones: 1.6, 1.7 Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nest, unnest, extract and extract! #3258

Add nest, unnest, extract and extract! #3258

bkamins commented Dec 28, 2022 •

edited

nalimilan commented Jan 1, 2023

nalimilan Dec 30, 2022

bkamins Jan 8, 2023

bkamins commented Jan 2, 2023

jariji commented Jan 2, 2023 •

edited

bkamins commented Jan 2, 2023 •

edited

nalimilan commented Jan 2, 2023

bkamins commented Jan 2, 2023

jariji commented Jan 2, 2023 •

edited

bkamins commented Jan 2, 2023

jariji commented Jan 3, 2023

bkamins commented Jan 3, 2023

bkamins commented Jan 4, 2023

bkamins commented Jan 5, 2023

nalimilan commented Jan 8, 2023

bkamins commented Jan 9, 2023

jariji commented Jan 9, 2023

bkamins commented Feb 5, 2023

bkamins commented Jun 4, 2023

ohaaga commented May 14, 2024

		`cols` (default `:setequal`) and `promote` (default `true`) keyword arguments
		have the same meaning as in [`push!`](@ref).

Add nest, unnest, extract and extract! #3258

Are you sure you want to change the base?

Add nest, unnest, extract and extract! #3258

Conversation

bkamins commented Dec 28, 2022 • edited

nalimilan commented Jan 1, 2023

nalimilan Dec 30, 2022

Choose a reason for hiding this comment

bkamins Jan 8, 2023

Choose a reason for hiding this comment

bkamins commented Jan 2, 2023

jariji commented Jan 2, 2023 • edited

bkamins commented Jan 2, 2023 • edited

nalimilan commented Jan 2, 2023

bkamins commented Jan 2, 2023

jariji commented Jan 2, 2023 • edited

bkamins commented Jan 2, 2023

jariji commented Jan 3, 2023

bkamins commented Jan 3, 2023

bkamins commented Jan 4, 2023

bkamins commented Jan 5, 2023

nalimilan commented Jan 8, 2023

bkamins commented Jan 9, 2023

jariji commented Jan 9, 2023

bkamins commented Feb 5, 2023

bkamins commented Jun 4, 2023

ohaaga commented May 14, 2024

bkamins commented Dec 28, 2022 •

edited

jariji commented Jan 2, 2023 •

edited

bkamins commented Jan 2, 2023 •

edited

jariji commented Jan 2, 2023 •

edited