New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Updateindex #3401

Open

leei wants to merge 5 commits into JuliaData:main from leei:updateindex

leei commented Nov 8, 2023 •

edited

The is a replacement for #3366. It defines a new keyword mergeduplicates that can be set to a Function that combines values to merge columns instead of erroring when there are duplicate column names and makeunique=false.

It is implemented in stages, the first of which creates a temporary struct UpdateIndex which is used to initialize a DataFrame or represent a set of column names and columns that will be merged into the resulting DataFrame in an hcat or hcat! operation. It then follows up with commits that extend permutedims and joins to resolve column clashes in the same fashion.

leei added 5 commits

October 30, 2023 16:19


          Add new keyword mergeduplicates that defines how to combine duplica…

6f8fe10

…te columns when `makeunique` is false.

If `mergeduplicates` is passed a function then that function is invoked on the values of all duplicate columns and its return value is assigned to that named column.


          Extend mergeduplicates to permutedims

9e5dcaa


          Extend mergeduplicates to joins

b4955bb


          Add outerjoin! method for two dataframes.

f43864c


          Add language to join docs that indicates that mergeduplicates

c71a4af

will only be done two at a time.

leei mentioned this pull request

mergeduplicates keyword to handle makeunique=false #3366

Draft

Member

bkamins commented Nov 11, 2023

As I have commented before - it will be super hard to review such a big PR. That is why I have recommended to split it into smaller PRs and merge them incrementally.

But I will try to comment on this PR (however, take note that because it is so big it will be hard to properly review it and be sure that all issues are caught/discussed).

As a side note - it seems that you did not use latest main branch state to make this PR.

bkamins reviewed

View reviewed changes

src/abstractdataframe/abstractdataframe.jl

-                  rename!(index(df), vals, makeunique=makeunique)
+                               makeunique::Bool=false, mergeduplicates::MergeDuplicates=nothing)
+                  if !makeunique && isa(mergeduplicates, Function)
+                      (new_columns, colindex) = process_updates(UpdateIndex(vals), _columns(df), mergeduplicates)

Member

bkamins Nov 11, 2023

_columns is not defined for general AbstractDataFrame.

bkamins reviewed

View reviewed changes

src/abstractdataframe/abstractdataframe.jl

-                               makeunique::Bool=false)
-                  rename!(index(df), vals, makeunique=makeunique)
+                               makeunique::Bool=false, mergeduplicates::MergeDuplicates=nothing)
+                  if !makeunique && isa(mergeduplicates, Function)

Member

bkamins Nov 11, 2023

the docstring seems not to have been updated.

Member

bkamins Nov 11, 2023

in particular, I am not clear what rename!/rename should do when mergeduplicates is passed.

bkamins reviewed

View reviewed changes

src/abstractdataframe/abstractdataframe.jl

                   # renaming columns of SubDataFrame has to clean non-note metadata in its parent
                   _drop_all_nonnote_metadata!(parent(df))
                   return df
               end
+              function rename!(idx::Index, new_index::Index)

Member

bkamins Nov 11, 2023

functions for Index should be added in other/index.jl

bkamins reviewed

View reviewed changes

src/abstractdataframe/abstractdataframe.jl

                   # renaming columns of SubDataFrame has to clean non-note metadata in its parent
                   _drop_all_nonnote_metadata!(parent(df))
                   return df
               end
+              function rename!(idx::Index, new_index::Index)
+                  splice!(idx.names, 1:length(idx.names), new_index.names)

Member

bkamins Nov 11, 2023

should we first check that idx and new_index are independent?

bkamins reviewed

View reviewed changes

src/abstractdataframe/abstractdataframe.jl

@@ @@ -353,9 +365,11 @@ julia> rename(uppercase, df) @@
               ```
               """
               rename(df::AbstractDataFrame, vals::AbstractVector{Symbol};

Member

bkamins Nov 11, 2023

docstring update is missing

bkamins reviewed

View reviewed changes

src/abstractdataframe/abstractdataframe.jl

+              Wherever the `mergeduplicates` keyword argument is available it is either `nothing` or
+              a `Function` that will be executed to combine duplicated columns (when `makeunique=false`)
+              """
+              MergeDuplicates = Union{Nothing,Function}

Member

bkamins Nov 11, 2023

I am OK to add this definition, but then its docstring should be more precise I think.

bkamins reviewed

View reviewed changes

src/abstractdataframe/abstractdataframe.jl

+              will be combined by invoking the function with all values from those columns.
+              e.g. `mergeduplicates=coalesce` will use the first non-missing value. Since `hcat` and
+              `hcat!` are performed recursively for more than two frames, this `mergeduplicates`
+              function will only combine two columns at a time.

Member

bkamins Nov 11, 2023

it is not clear what happens if makeuniqe=true and mergeduplicates is Function`.

bkamins reviewed

View reviewed changes

src/abstractdataframe/abstractdataframe.jl

               """
                   hcat(df::AbstractDataFrame...;
-                       makeunique::Bool=false, copycols::Bool=true)
+                       makeunique::Bool=false, mergeduplicates::MergeDuplicates=nothing, copycols::Bool=true)
               Horizontally concatenate data frames.
               If `makeunique=false` (the default) column names of passed objects must be unique.

Member

bkamins Nov 11, 2023

this statement does not seem to be true after this PR.

bkamins reviewed

View reviewed changes

src/abstractdataframe/abstractdataframe.jl

+                    if it is `true` a new unique name will be generated by adding a suffix,
+                    if it is `false` an error will be thrown unless a `mergeduplicates` functiom is provided.
+                  - `mergeduplicates` : defines what to do if `name` already exists in `df` and `makeunique`
+                    is false. It should be given a Function that combines the values of all of the duplicated

Member

bkamins Nov 11, 2023

Suggested change

      
                  is false. It should be given a Function that combines the values of all of the duplicated
          
                  is false. It should be given a `Function` that combines the values of all of the duplicated

bkamins reviewed

View reviewed changes

src/abstractdataframe/abstractdataframe.jl

+                    if it is `false` an error will be thrown unless a `mergeduplicates` functiom is provided.
+                  - `mergeduplicates` : defines what to do if `name` already exists in `df` and `makeunique`
+                    is false. It should be given a Function that combines the values of all of the duplicated
+                    columns which will be passed as a varargs. The return value is used.

Member

bkamins Nov 11, 2023

it is not clear if the passed function takes elements of the columns iteratively or whole columns.

Member

bkamins Nov 11, 2023

Also it is not clear how things are processed if multiple duplicate columns are provided.

Member

bkamins commented Nov 11, 2023

I have not finished reviewing the PR. I will try to get back to it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment