BUG: read_orc does not use the provided filesystem for all operations #58746

mjperrone · 2024-05-16T22:59:52Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# first in one terminal, start a moto standalone server with `moto_server -p 5555`
import boto3
import os
import pandas
import pyarrow.fs


def test_pandas_read_orc():
    endpoint_port = f"5555"
    endpoint_uri = f"http://localhost:{endpoint_port}/"
    region = "us-east-1"
    os.environ["AWS_ACCESS_KEY_ID"] = "fake"
    os.environ["AWS_SECRET_ACCESS_KEY"] = "fake"
    os.environ["AWS_SECURITY_TOKEN"] = "fake"
    os.environ["AWS_SESSION_TOKEN"] = "fake"

    s3_resource = boto3.resource("s3", endpoint_url=endpoint_uri, region_name=region)
    bucket_name = "mybucket"
    s3_resource.Bucket(bucket_name).create()
    s3_resource.Bucket(bucket_name).upload_file(
        "userdata1.orc",
        "userdata1.orc",
    )
    filesystem = pyarrow.fs.S3FileSystem(endpoint_override=endpoint_uri, region=region)
    print(
        filesystem.get_file_info("mybucket/userdata1.orc")
    )  # outputs <FileInfo for 'mybucket/userdata1.orc': type=FileType.File, size=119367>,
    # proving the filesystem itself contacts the moto standalone server
    
    df = pandas.read_orc("s3://mybucket/userdata1.orc", filesystem=filesystem)
    # raises botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden


test_pandas_read_orc()

Issue Description

Pandas does not respect the filesystem given to .read_orc() when getting a handle for the file. This means if you provide a mocked s3 filesystem backend, pandas will bypass that and try to contact the real s3 backend, making unit tests with a mocked s3 impossible, and potentially dangerous!

Here is a sample ORC file which I had next to the test file to upload to the mock s3 server (remove the .zip file extension as github doesn't support uploading .orc files, but this is in fact an ORC file as is) for retrieval:
userdata1.orc.zip. Note that you can reproduce this with an invalid .orc file as the error happens before reading any ORC data.

Error produced:

Traceback (most recent call last):
  File ".venv/lib/python3.10/site-packages/s3fs/core.py", line 113, in _error_wrapper
    return await func(*args, **kwargs)
  File ".venv/lib/python3.10/site-packages/aiobotocore/client.py", line 409, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test_s3_pandas_min.py", line 36, in <module>
    test_pandas_read_orc()
  File "test_s3_pandas_min.py", line 23, in test_pandas_read_orc
    df = pandas.read_orc("s3://mybucket/userdata1.orc", filesystem=filesystem)
  File ".venv/lib/python3.10/site-packages/pandas/io/orc.py", line 109, in read_orc
    with get_handle(path, "rb", is_text=False) as handles:
  File ".venv/lib/python3.10/site-packages/pandas/io/common.py", line 730, in get_handle
    ioargs = _get_filepath_or_buffer(
  File ".venv/lib/python3.10/site-packages/pandas/io/common.py", line 443, in _get_filepath_or_buffer
    ).open()
  File ".venv/lib/python3.10/site-packages/fsspec/core.py", line 135, in open
    return self.__enter__()
    ...

Expected Behavior

I would expect the .read_orc function to fully use the filesystem provided instead of trying to talk to the real s3, and succeed at reading the orc file.

My initial investigation

before the .read_table call happens, it is erroring at the get_handle() call with PermissionError('Forbidden') .get_handle() is not using the custom filesystem I provided, and read_table doesn't allow passing through storage_options (even though _get_filepath_or_buffer does accept that).

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.10.12.final.0
python-bits : 64
OS : Darwin
OS-release : 22.6.0
Version : Darwin Kernel Version 22.6.0: Mon Feb 19 19:43:13 PST 2024; root:xnu-8796.141.3.704.6~1/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 68.1.2
pip : 24.0
Cython : None
pytest : 7.4.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.7
jinja2 : 3.1.3
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : 2024.2.0
fsspec : 2024.3.1
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 16.1.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.3.1
scipy : None
sqlalchemy : 2.0.28
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
None

The text was updated successfully, but these errors were encountered:

mjperrone · 2024-05-16T23:10:50Z

tagging @mroeschke as he implemented filesystem for orc and parquet.

@jorisvandenbossche as author of initial implementation of parquet

mjperrone added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 16, 2024

mjperrone changed the title ~~BUG: read_orc does not~~ BUG: read_orc does not use the provided filesystem for all operations May 16, 2024

mjperrone mentioned this issue May 16, 2024

DOC: Update orc.py to say orc instead of parquet #58744

Merged

5 tasks

rhshadrach added IO Network Local or Cloud (AWS, GCS, etc.) IO Issues and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_orc does not use the provided filesystem for all operations #58746

BUG: read_orc does not use the provided filesystem for all operations #58746

mjperrone commented May 16, 2024 •

edited

INSTALLED VERSIONS

mjperrone commented May 16, 2024

BUG: read_orc does not use the provided filesystem for all operations #58746

BUG: read_orc does not use the provided filesystem for all operations #58746

Comments

mjperrone commented May 16, 2024 • edited

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

My initial investigation

Installed Versions

INSTALLED VERSIONS

mjperrone commented May 16, 2024

mjperrone commented May 16, 2024 •

edited