Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_orc does not use the provided filesystem for all operations #58746

Open
3 tasks done
mjperrone opened this issue May 16, 2024 · 1 comment
Open
3 tasks done
Labels
Bug IO Network Local or Cloud (AWS, GCS, etc.) IO Issues

Comments

@mjperrone
Copy link
Contributor

mjperrone commented May 16, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandas.
  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# first in one terminal, start a moto standalone server with `moto_server -p 5555`
import boto3
import os
import pandas
import pyarrow.fs


def test_pandas_read_orc():
    endpoint_port = f"5555"
    endpoint_uri = f"http://localhost:{endpoint_port}/"
    region = "us-east-1"
    os.environ["AWS_ACCESS_KEY_ID"] = "fake"
    os.environ["AWS_SECRET_ACCESS_KEY"] = "fake"
    os.environ["AWS_SECURITY_TOKEN"] = "fake"
    os.environ["AWS_SESSION_TOKEN"] = "fake"

    s3_resource = boto3.resource("s3", endpoint_url=endpoint_uri, region_name=region)
    bucket_name = "mybucket"
    s3_resource.Bucket(bucket_name).create()
    s3_resource.Bucket(bucket_name).upload_file(
        "userdata1.orc",
        "userdata1.orc",
    )
    filesystem = pyarrow.fs.S3FileSystem(endpoint_override=endpoint_uri, region=region)
    print(
        filesystem.get_file_info("mybucket/userdata1.orc")
    )  # outputs <FileInfo for 'mybucket/userdata1.orc': type=FileType.File, size=119367>,
    # proving the filesystem itself contacts the moto standalone server
    
    df = pandas.read_orc("s3://mybucket/userdata1.orc", filesystem=filesystem)
    # raises botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden


test_pandas_read_orc()

Issue Description

Pandas does not respect the filesystem given to .read_orc() when getting a handle for the file. This means if you provide a mocked s3 filesystem backend, pandas will bypass that and try to contact the real s3 backend, making unit tests with a mocked s3 impossible, and potentially dangerous!

Here is a sample ORC file which I had next to the test file to upload to the mock s3 server (remove the .zip file extension as github doesn't support uploading .orc files, but this is in fact an ORC file as is) for retrieval:
userdata1.orc.zip. Note that you can reproduce this with an invalid .orc file as the error happens before reading any ORC data.

Error produced:

Traceback (most recent call last):
  File ".venv/lib/python3.10/site-packages/s3fs/core.py", line 113, in _error_wrapper
    return await func(*args, **kwargs)
  File ".venv/lib/python3.10/site-packages/aiobotocore/client.py", line 409, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test_s3_pandas_min.py", line 36, in <module>
    test_pandas_read_orc()
  File "test_s3_pandas_min.py", line 23, in test_pandas_read_orc
    df = pandas.read_orc("s3://mybucket/userdata1.orc", filesystem=filesystem)
  File ".venv/lib/python3.10/site-packages/pandas/io/orc.py", line 109, in read_orc
    with get_handle(path, "rb", is_text=False) as handles:
  File ".venv/lib/python3.10/site-packages/pandas/io/common.py", line 730, in get_handle
    ioargs = _get_filepath_or_buffer(
  File ".venv/lib/python3.10/site-packages/pandas/io/common.py", line 443, in _get_filepath_or_buffer
    ).open()
  File ".venv/lib/python3.10/site-packages/fsspec/core.py", line 135, in open
    return self.__enter__()
    ...

Expected Behavior

I would expect the .read_orc function to fully use the filesystem provided instead of trying to talk to the real s3, and succeed at reading the orc file.

My initial investigation

before the .read_table call happens, it is erroring at the get_handle() call with PermissionError('Forbidden') .get_handle() is not using the custom filesystem I provided, and read_table doesn't allow passing through storage_options (even though _get_filepath_or_buffer does accept that).

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.10.12.final.0
python-bits : 64
OS : Darwin
OS-release : 22.6.0
Version : Darwin Kernel Version 22.6.0: Mon Feb 19 19:43:13 PST 2024; root:xnu-8796.141.3.704.6~1/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 68.1.2
pip : 24.0
Cython : None
pytest : 7.4.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.7
jinja2 : 3.1.3
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : 2024.2.0
fsspec : 2024.3.1
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 16.1.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.3.1
scipy : None
sqlalchemy : 2.0.28
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
None

@mjperrone mjperrone added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 16, 2024
@mjperrone mjperrone changed the title BUG: read_orc does not BUG: read_orc does not use the provided filesystem for all operations May 16, 2024
@mjperrone
Copy link
Contributor Author

tagging @mroeschke as he implemented filesystem for orc and parquet.

@jorisvandenbossche as author of initial implementation of parquet

@rhshadrach rhshadrach added IO Network Local or Cloud (AWS, GCS, etc.) IO Issues and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Network Local or Cloud (AWS, GCS, etc.) IO Issues
Projects
None yet
Development

No branches or pull requests

2 participants