DEPR: ExcelFile.parse #58247

rhshadrach · 2024-04-13T11:32:06Z

It seems to me we should either fix ExcelFile.parse or deprecate it entirely, and I lean toward the latter. pandas originally started out with just ExcelFile but now has the top-level read_excel. The signatures started the same, but now read_excel has gained and modified parameters that have not been added/changed in ExcelFile.parse. For example:

ExcelFile.parse lacks a dtype parameter
ExcelFile.parse has a **kwds argument that is passed on to pandas internals with no documentation on what can be included. Invalid arguments are just ignored (e.g. BUG: xl.parse index_col ignoring skiprows #50953)

It appears to me that pd.ExcelFile(...).parse(...) offers no advantage over pd.read_excel(pd.ExcelFile(...)), and so rather than fixing parse we can deprecate it and make it internal.

Edit: I no longer think deprecating ExcelFile entirely as mentioned below is a good option. See #58247 (comment).

Another option is to deprecate ExcelFile entirely. The one thing ExcelFile still provides that isn't available elsewhere is to get the underlying book or sheet_names without reading the entire file.

df = pd.DataFrame(np.zeros((100, 100)))
with pd.ExcelWriter("test.xlsx") as writer:
    for e in range(10):
        df.to_excel(writer, sheet_name=str(e))

%timeit pd.ExcelFile("test.xlsx").sheet_names
# 14.1 ms ± 76 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.read_excel("test.xlsx", sheet_name=None)
# 411 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

One can somewhat work around this by using nrows, but it's clunky.

%timeit pd.read_excel("test.xlsx", sheet_name=None, nrows=0).keys()
# 57.3 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The text was updated successfully, but these errors were encountered:

Aloqeely · 2024-04-13T16:59:34Z

Yes I agree we should deprecate ExcelFile.parse, no reason to fix it.

And I'm not against deprecating ExcelFile either.

rmhowe425 · 2024-04-14T01:08:05Z

+1 for deprecating ExcelFile and ExcelFile.parse

samukweku · 2024-04-14T20:42:01Z

To facilitate working with multiple sheets from the same file, the ExcelFile class can be used to wrap the file and can be passed into read_excel There will be a performance benefit for reading multiple sheets as the file is read into memory only once.

Deprecating Excel file.parse means this feature is lost. What would be the alternative @rhshadrach ?

asishm · 2024-04-14T22:55:50Z

@samukweku That is not lost if only pd.ExcelFile.parse is deprecated while pd.ExcelFile remains available (which is what I would favor as well).

rhshadrach · 2024-04-14T23:06:00Z

Deprecating Excel file.parse means this feature is lost. What would be the alternative @rhshadrach ?

pd.read_excel

samukweku · 2024-04-15T00:00:41Z

@asishm kindly explain how it is not lost? @rhshadrach wont the pd.read_excel option be less performant for multiple sheets( since with read_excel you read it more than once)?

rmhowe425 · 2024-04-15T01:54:49Z

@samukweku the read_excel() function does allow you to read in multiple sheets at once by explicitly specifying the worksheets to be read in as values for the sheet_name parameter. Or you can pass None as a value to read in all sheets.

Albeit, when reading in multiple worksheets you are returned a dict of DataFrames.

rhshadrach · 2024-04-15T03:04:41Z

@rhshadrach wont the pd.read_excel option be less performant for multiple sheets( since with read_excel you read it more than once)?

You can pass an ExcelFile instance to read_excel. It directly calls ExcelFile.parse internally.

asishm · 2024-04-20T10:33:01Z

As noted in the user guide, the usecase below can't be achieved without a re-read with the current api of pd.read_excel(fp-like)

The primary use-case for an ExcelFile is parsing multiple sheets with different parameters:

data = {}
# For when Sheet1's format differs from Sheet2
with pd.ExcelFile("path_to_file.xls") as xls:
    data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"])
    data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=1)

rhshadrach · 2024-04-20T10:37:29Z

Thanks @asishm - that has me convinced that ExcelFile itself should stay.

GAuravY19 · 2024-05-11T09:20:49Z

what is supposed to be done in this issue ??

Can you please clarify it properly.

rhshadrach added IO Excel read_excel, to_excel Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPR: ExcelFile.parse #58247

DEPR: ExcelFile.parse #58247

rhshadrach commented Apr 13, 2024 •

edited

Aloqeely commented Apr 13, 2024

rmhowe425 commented Apr 14, 2024

samukweku commented Apr 14, 2024

asishm commented Apr 14, 2024

rhshadrach commented Apr 14, 2024

samukweku commented Apr 15, 2024

rmhowe425 commented Apr 15, 2024

rhshadrach commented Apr 15, 2024

asishm commented Apr 20, 2024

rhshadrach commented Apr 20, 2024

GAuravY19 commented May 11, 2024

DEPR: ExcelFile.parse #58247

DEPR: ExcelFile.parse #58247

Comments

rhshadrach commented Apr 13, 2024 • edited

Aloqeely commented Apr 13, 2024

rmhowe425 commented Apr 14, 2024

samukweku commented Apr 14, 2024

asishm commented Apr 14, 2024

rhshadrach commented Apr 14, 2024

samukweku commented Apr 15, 2024

rmhowe425 commented Apr 15, 2024

rhshadrach commented Apr 15, 2024

asishm commented Apr 20, 2024

rhshadrach commented Apr 20, 2024

GAuravY19 commented May 11, 2024

rhshadrach commented Apr 13, 2024 •

edited