Disambiguating and filling in headers similar to Pandas #1436

imrehg · 2024-03-13T08:42:09Z

Is your feature request related to a problem? Please describe.

In the real world it is quite common to have spreadsheets which have non-unique column headers (often the case when the sheet is intended to be used by people rather than code). There are also cases when column headers might be missing, even if there are available data in the columns.

Currently these wouldn't work when running get_all_records (e.g before passing data into a Pandas DataFrame), as the tool expects unique headers in general, and expects unique headers even when the expectations are passed in.

Describe the solution you'd like

The Pandas' read_excel tool handles these cases, and does a reasonable job with it.

if encounters a column with the same name it already had (going from left to right), a counter is attached (e.g. a column called Duplicate at the first encounter, on subsequent ones it is Duplicate.1, Duplicate.2...
if encountering a header without a value, Pandas calls it Unnamed: 0, Unnamed: 1, ... so there's always a column name that can be used (and likely renamed)

Something similar for the two types of disambiguation could be useful.

Describe alternatives you've considered

I've considered using manually rolled worksheet.col_values(...) to get the individual columns and manipulate them as needed, rather than using get_all_records, though this would be a lot of manual shuffling.

Additional context

Happy to potentially contribute.

The text was updated successfully, but these errors were encountered:

alifeee · 2024-03-13T12:52:47Z

Hi, thanks for the suggestion :)

We had a lot of feature creep with get_all_records near the end of v5, resulting in a simplification of the method. See #1367 and surrounding issues/PRs for more information.

Manually

To do this manually, you could get the titles and data with code similar to (where rename_unique() is a custom method that changes "data", "data", to "data", "data-1", for example):

header = spreadsheet.get("1:1")[0]
header_unique = rename_unique(header)
cells = spreadsheet.get("6:9")
some_records = utils.to_records(header, cells)

...as described in the v5 to v6 migration guide. See documentation for utils.to_records here.

As part of `gspread`

As I mentioned, we tried to limit the scope of get_all_records as it was becoming too complicated.

Your suggestion I would see as perhaps a new utils function (like rename_unique() above) in gspread/utils.py, a new kwarg in get_all_records here

gspread/gspread/worksheet.py

Lines 500 to 502 in 8d55a6b

    
               allow_underscores_in_numeric_literals=False, 
        
               empty2zero=False, 
        
           ) -> List[Dict[str, Union[int, float, str]]]:

        allow_underscores_in_numeric_literals=False,
        empty2zero=False,
+       rename_identical_headers=False,
    ) -> List[Dict[str, Union[int, float, str]]]:

and then the implementation somewhere like here

gspread/gspread/worksheet.py

Lines 567 to 568 in 8d55a6b

    
           keys = entire_sheet[head - 1] 
        
           values = entire_sheet[head:]

        keys = entire_sheet[head - 1]
        values = entire_sheet[head:]
+
+       if rename_identical_headers is True:
+         keys = utils.rename_unique(keys)

Opinion

Personally, I think this change is "simple enough" that I would be happy for it to be added.

I will ask @lavigne958's opinion too, on the added complexity.

Implementation

I will not implement this. As you say

Happy to potentially contribute.

I am happy to help you contribute if you desire this feature :)

Let us wait for @lavigne958's opinion and then you can read the contributing guide and we can help if you are interested in implementing this feature

good day :)

lavigne958 · 2024-03-13T22:09:53Z

Hi @imrehg thank you for this insight.

I really like the idea. It could solve our problem and offer the users some flexibility.

When I think about it, the major problem is:

when building the resulting dict, we must have unique keys.

Then from your proposal and the background of issues we faced, we don't actually need the user to change the data nor skip some columns of data too. We need to actually create a unique name if it's not the case.

What I propose as a flexible solution to this is:

we rename any duplicated header with the suffix
we name any empty header
we add a new argument which holds the default pattern for the duplicate header
- so the user can actually override the value to what they want
- it could look something like: .{} where the {} are replaced with an incremental number
we add a new argument with the default value for empty header
- again so the user can replace it with what they want
- we apply the same rule of appending the templated string for duplicate empty headers.

In the end a user can choose to have .1 etc or -1 or anything as suffix for duplicate and any name for empty header.

As mentioned by @alifeee if you need any help contributing we are happy to help 🙃

I agree as well it will be useful to put it all in a single function in utile so it could benefit everyone.

lavigne958 · 2024-03-27T11:11:59Z

Hi @imrehg let us know if you want us to make this feature or if you'd rather contribute, if so we can guide you too.

imrehg · 2024-03-28T06:41:59Z

Hey @lavigne958 @alifeee, sorry that I've missed the original repies.
Thanks for the detailed response, all this context makes sense (and I do appreciate scope creep and all that!)

Yes, I'm still happy to contribute. Your messages have quite a bit of details to start off on, and I'll go through that first, and look at making the changes in those spirit.
If you think there's anything useful that I should know, I'm happy to hear!

lavigne958 · 2024-03-28T07:04:46Z

The most useful link for you will be: contributing guide 😁

lavigne958 · 2024-05-17T21:30:37Z

Hey @lavigne958 @alifeee, sorry that I've missed the original repies. Thanks for the detailed response, all this context makes sense (and I do appreciate scope creep and all that!)

Yes, I'm still happy to contribute. Your messages have quite a bit of details to start off on, and I'll go through that first, and look at making the changes in those spirit. If you think there's anything useful that I should know, I'm happy to hear!

Hi @imrehg , just wondering if you are still interested in making the changes ?

imrehg · 2024-05-18T12:02:05Z

Hey @lavigne958 definitely still interested, just life got in the way a bit, apologies! 😰 Gonna be looking at this soon, unless I'm bottlenecking you. In that case just let me know!

lavigne958 · 2024-05-18T13:19:13Z

No problem we have other issues to solve for now, so far so good thanks 👍

alifeee added the Feature Request label Mar 13, 2024

alifeee added this to the 6.2.0 milestone May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disambiguating and filling in headers similar to Pandas #1436

Disambiguating and filling in headers similar to Pandas #1436

imrehg commented Mar 13, 2024

alifeee commented Mar 13, 2024

lavigne958 commented Mar 13, 2024

lavigne958 commented Mar 27, 2024

imrehg commented Mar 28, 2024

lavigne958 commented Mar 28, 2024

lavigne958 commented May 17, 2024

imrehg commented May 18, 2024

lavigne958 commented May 18, 2024

Disambiguating and filling in headers similar to Pandas #1436

Disambiguating and filling in headers similar to Pandas #1436

Comments

imrehg commented Mar 13, 2024

alifeee commented Mar 13, 2024

Manually

As part of gspread

Opinion

Implementation

lavigne958 commented Mar 13, 2024

lavigne958 commented Mar 27, 2024

imrehg commented Mar 28, 2024

lavigne958 commented Mar 28, 2024

lavigne958 commented May 17, 2024

imrehg commented May 18, 2024

lavigne958 commented May 18, 2024

As part of `gspread`