Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disambiguating and filling in headers similar to Pandas #1436

Open
imrehg opened this issue Mar 13, 2024 · 8 comments
Open

Disambiguating and filling in headers similar to Pandas #1436

imrehg opened this issue Mar 13, 2024 · 8 comments
Milestone

Comments

@imrehg
Copy link
Contributor

imrehg commented Mar 13, 2024

Is your feature request related to a problem? Please describe.

In the real world it is quite common to have spreadsheets which have non-unique column headers (often the case when the sheet is intended to be used by people rather than code). There are also cases when column headers might be missing, even if there are available data in the columns.

Currently these wouldn't work when running get_all_records (e.g before passing data into a Pandas DataFrame), as the tool expects unique headers in general, and expects unique headers even when the expectations are passed in.

Describe the solution you'd like

The Pandas' read_excel tool handles these cases, and does a reasonable job with it.

  • if encounters a column with the same name it already had (going from left to right), a counter is attached (e.g. a column called Duplicate at the first encounter, on subsequent ones it is Duplicate.1, Duplicate.2...
  • if encountering a header without a value, Pandas calls it Unnamed: 0, Unnamed: 1, ... so there's always a column name that can be used (and likely renamed)

Something similar for the two types of disambiguation could be useful.

Describe alternatives you've considered

I've considered using manually rolled worksheet.col_values(...) to get the individual columns and manipulate them as needed, rather than using get_all_records, though this would be a lot of manual shuffling.

Additional context

Happy to potentially contribute.

@alifeee
Copy link
Collaborator

alifeee commented Mar 13, 2024

Hi, thanks for the suggestion :)

We had a lot of feature creep with get_all_records near the end of v5, resulting in a simplification of the method. See #1367 and surrounding issues/PRs for more information.

Manually

To do this manually, you could get the titles and data with code similar to (where rename_unique() is a custom method that changes "data", "data", to "data", "data-1", for example):

header = spreadsheet.get("1:1")[0]
header_unique = rename_unique(header)
cells = spreadsheet.get("6:9")
some_records = utils.to_records(header, cells)

...as described in the v5 to v6 migration guide. See documentation for utils.to_records here.

As part of gspread

As I mentioned, we tried to limit the scope of get_all_records as it was becoming too complicated.

Your suggestion I would see as perhaps a new utils function (like rename_unique() above) in gspread/utils.py, a new kwarg in get_all_records here

allow_underscores_in_numeric_literals=False,
empty2zero=False,
) -> List[Dict[str, Union[int, float, str]]]:

        allow_underscores_in_numeric_literals=False,
        empty2zero=False,
+       rename_identical_headers=False,
    ) -> List[Dict[str, Union[int, float, str]]]:

and then the implementation somewhere like here

keys = entire_sheet[head - 1]
values = entire_sheet[head:]

        keys = entire_sheet[head - 1]
        values = entire_sheet[head:]
+
+       if rename_identical_headers is True:
+         keys = utils.rename_unique(keys)

Opinion

Personally, I think this change is "simple enough" that I would be happy for it to be added.

I will ask @lavigne958's opinion too, on the added complexity.

Implementation

I will not implement this. As you say

Happy to potentially contribute.

I am happy to help you contribute if you desire this feature :)

Let us wait for @lavigne958's opinion and then you can read the contributing guide and we can help if you are interested in implementing this feature

good day :)

@lavigne958
Copy link
Collaborator

Hi @imrehg thank you for this insight.

I really like the idea. It could solve our problem and offer the users some flexibility.

When I think about it, the major problem is:

  • when building the resulting dict, we must have unique keys.

Then from your proposal and the background of issues we faced, we don't actually need the user to change the data nor skip some columns of data too. We need to actually create a unique name if it's not the case.

What I propose as a flexible solution to this is:

  • we rename any duplicated header with the suffix
  • we name any empty header
  • we add a new argument which holds the default pattern for the duplicate header
    • so the user can actually override the value to what they want
    • it could look something like: .{} where the {} are replaced with an incremental number
  • we add a new argument with the default value for empty header
    • again so the user can replace it with what they want
    • we apply the same rule of appending the templated string for duplicate empty headers.

In the end a user can choose to have .1 etc or -1 or anything as suffix for duplicate and any name for empty header.

As mentioned by @alifeee if you need any help contributing we are happy to help 🙃

I agree as well it will be useful to put it all in a single function in utile so it could benefit everyone.

@lavigne958
Copy link
Collaborator

Hi @imrehg let us know if you want us to make this feature or if you'd rather contribute, if so we can guide you too.

@imrehg
Copy link
Contributor Author

imrehg commented Mar 28, 2024

Hey @lavigne958 @alifeee, sorry that I've missed the original repies.
Thanks for the detailed response, all this context makes sense (and I do appreciate scope creep and all that!)

Yes, I'm still happy to contribute. Your messages have quite a bit of details to start off on, and I'll go through that first, and look at making the changes in those spirit.
If you think there's anything useful that I should know, I'm happy to hear!

@lavigne958
Copy link
Collaborator

The most useful link for you will be: contributing guide 😁

@alifeee alifeee added this to the 6.2.0 milestone May 10, 2024
@lavigne958
Copy link
Collaborator

Hey @lavigne958 @alifeee, sorry that I've missed the original repies. Thanks for the detailed response, all this context makes sense (and I do appreciate scope creep and all that!)

Yes, I'm still happy to contribute. Your messages have quite a bit of details to start off on, and I'll go through that first, and look at making the changes in those spirit. If you think there's anything useful that I should know, I'm happy to hear!

Hi @imrehg , just wondering if you are still interested in making the changes ?

@imrehg
Copy link
Contributor Author

imrehg commented May 18, 2024

Hey @lavigne958 definitely still interested, just life got in the way a bit, apologies! 😰 Gonna be looking at this soon, unless I'm bottlenecking you. In that case just let me know!

@lavigne958
Copy link
Collaborator

No problem we have other issues to solve for now, so far so good thanks 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants