Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: ExcelFile.parse #58247

Open
rhshadrach opened this issue Apr 13, 2024 · 11 comments
Open

DEPR: ExcelFile.parse #58247

rhshadrach opened this issue Apr 13, 2024 · 11 comments
Labels
Deprecate Functionality to remove in pandas IO Excel read_excel, to_excel Needs Discussion Requires discussion from core team before further action

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Apr 13, 2024

It seems to me we should either fix ExcelFile.parse or deprecate it entirely, and I lean toward the latter. pandas originally started out with just ExcelFile but now has the top-level read_excel. The signatures started the same, but now read_excel has gained and modified parameters that have not been added/changed in ExcelFile.parse. For example:

  • ExcelFile.parse lacks a dtype parameter
  • ExcelFile.parse has a **kwds argument that is passed on to pandas internals with no documentation on what can be included. Invalid arguments are just ignored (e.g. BUG: xl.parse index_col ignoring skiprows #50953)

It appears to me that pd.ExcelFile(...).parse(...) offers no advantage over pd.read_excel(pd.ExcelFile(...)), and so rather than fixing parse we can deprecate it and make it internal.

Edit: I no longer think deprecating ExcelFile entirely as mentioned below is a good option. See #58247 (comment).

Another option is to deprecate ExcelFile entirely. The one thing ExcelFile still provides that isn't available elsewhere is to get the underlying book or sheet_names without reading the entire file.

df = pd.DataFrame(np.zeros((100, 100)))
with pd.ExcelWriter("test.xlsx") as writer:
    for e in range(10):
        df.to_excel(writer, sheet_name=str(e))

%timeit pd.ExcelFile("test.xlsx").sheet_names
# 14.1 ms ± 76 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.read_excel("test.xlsx", sheet_name=None)
# 411 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

One can somewhat work around this by using nrows, but it's clunky.

%timeit pd.read_excel("test.xlsx", sheet_name=None, nrows=0).keys()
# 57.3 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
@rhshadrach rhshadrach added IO Excel read_excel, to_excel Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Apr 13, 2024
@Aloqeely
Copy link
Member

Yes I agree we should deprecate ExcelFile.parse, no reason to fix it.

And I'm not against deprecating ExcelFile either.

@rmhowe425
Copy link
Contributor

+1 for deprecating ExcelFile and ExcelFile.parse

@samukweku
Copy link
Contributor

To facilitate working with multiple sheets from the same file, the ExcelFile class can be used to wrap the file and can be passed into read_excel There will be a performance benefit for reading multiple sheets as the file is read into memory only once.

Deprecating Excel file.parse means this feature is lost. What would be the alternative @rhshadrach ?

@asishm
Copy link
Contributor

asishm commented Apr 14, 2024

@samukweku That is not lost if only pd.ExcelFile.parse is deprecated while pd.ExcelFile remains available (which is what I would favor as well).

@rhshadrach
Copy link
Member Author

Deprecating Excel file.parse means this feature is lost. What would be the alternative @rhshadrach ?

pd.read_excel

@samukweku
Copy link
Contributor

@asishm kindly explain how it is not lost? @rhshadrach wont the pd.read_excel option be less performant for multiple sheets( since with read_excel you read it more than once)?

@rmhowe425
Copy link
Contributor

@samukweku the read_excel() function does allow you to read in multiple sheets at once by explicitly specifying the worksheets to be read in as values for the sheet_name parameter. Or you can pass None as a value to read in all sheets.

Albeit, when reading in multiple worksheets you are returned a dict of DataFrames.

image

@rhshadrach
Copy link
Member Author

@rhshadrach wont the pd.read_excel option be less performant for multiple sheets( since with read_excel you read it more than once)?

You can pass an ExcelFile instance to read_excel. It directly calls ExcelFile.parse internally.

@asishm
Copy link
Contributor

asishm commented Apr 20, 2024

As noted in the user guide, the usecase below can't be achieved without a re-read with the current api of pd.read_excel(fp-like)

The primary use-case for an ExcelFile is parsing multiple sheets with different parameters:

data = {}
# For when Sheet1's format differs from Sheet2
with pd.ExcelFile("path_to_file.xls") as xls:
    data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"])
    data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=1)

@rhshadrach
Copy link
Member Author

Thanks @asishm - that has me convinced that ExcelFile itself should stay.

@GAuravY19
Copy link

what is supposed to be done in this issue ??

Can you please clarify it properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas IO Excel read_excel, to_excel Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants