You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# first in one terminal, start a moto standalone server with `moto_server -p 5555`importboto3importosimportpandasimportpyarrow.fsdeftest_pandas_read_orc():
endpoint_port=f"5555"endpoint_uri=f"http://localhost:{endpoint_port}/"region="us-east-1"os.environ["AWS_ACCESS_KEY_ID"] ="fake"os.environ["AWS_SECRET_ACCESS_KEY"] ="fake"os.environ["AWS_SECURITY_TOKEN"] ="fake"os.environ["AWS_SESSION_TOKEN"] ="fake"s3_resource=boto3.resource("s3", endpoint_url=endpoint_uri, region_name=region)
bucket_name="mybucket"s3_resource.Bucket(bucket_name).create()
s3_resource.Bucket(bucket_name).upload_file(
"userdata1.orc",
"userdata1.orc",
)
filesystem=pyarrow.fs.S3FileSystem(endpoint_override=endpoint_uri, region=region)
print(
filesystem.get_file_info("mybucket/userdata1.orc")
) # outputs <FileInfo for 'mybucket/userdata1.orc': type=FileType.File, size=119367>,# proving the filesystem itself contacts the moto standalone serverdf=pandas.read_orc("s3://mybucket/userdata1.orc", filesystem=filesystem)
# raises botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbiddentest_pandas_read_orc()
Issue Description
Pandas does not respect the filesystem given to .read_orc() when getting a handle for the file. This means if you provide a mocked s3 filesystem backend, pandas will bypass that and try to contact the real s3 backend, making unit tests with a mocked s3 impossible, and potentially dangerous!
Here is a sample ORC file which I had next to the test file to upload to the mock s3 server (remove the .zip file extension as github doesn't support uploading .orc files, but this is in fact an ORC file as is) for retrieval: userdata1.orc.zip. Note that you can reproduce this with an invalid .orc file as the error happens before reading any ORC data.
Error produced:
Traceback (most recent call last):
File ".venv/lib/python3.10/site-packages/s3fs/core.py", line 113, in _error_wrapper
return await func(*args, **kwargs)
File ".venv/lib/python3.10/site-packages/aiobotocore/client.py", line 409, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test_s3_pandas_min.py", line 36, in <module>
test_pandas_read_orc()
File "test_s3_pandas_min.py", line 23, in test_pandas_read_orc
df = pandas.read_orc("s3://mybucket/userdata1.orc", filesystem=filesystem)
File ".venv/lib/python3.10/site-packages/pandas/io/orc.py", line 109, in read_orc
with get_handle(path, "rb", is_text=False) as handles:
File ".venv/lib/python3.10/site-packages/pandas/io/common.py", line 730, in get_handle
ioargs = _get_filepath_or_buffer(
File ".venv/lib/python3.10/site-packages/pandas/io/common.py", line 443, in _get_filepath_or_buffer
).open()
File ".venv/lib/python3.10/site-packages/fsspec/core.py", line 135, in open
return self.__enter__()
...
Expected Behavior
I would expect the .read_orc function to fully use the filesystem provided instead of trying to talk to the real s3, and succeed at reading the orc file.
My initial investigation
before the .read_table call happens, it is erroring at the get_handle() call with PermissionError('Forbidden') .get_handle() is not using the custom filesystem I provided, and read_table doesn't allow passing through storage_options (even though _get_filepath_or_buffer does accept that).
Installed Versions
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.10.12.final.0
python-bits : 64
OS : Darwin
OS-release : 22.6.0
Version : Darwin Kernel Version 22.6.0: Mon Feb 19 19:43:13 PST 2024; root:xnu-8796.141.3.704.6~1/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
Pandas version checks
Reproducible Example
Issue Description
Pandas does not respect the filesystem given to
.read_orc()
when getting a handle for the file. This means if you provide a mocked s3 filesystem backend, pandas will bypass that and try to contact the real s3 backend, making unit tests with a mocked s3 impossible, and potentially dangerous!Here is a sample ORC file which I had next to the test file to upload to the mock s3 server (remove the
.zip
file extension as github doesn't support uploading.orc
files, but this is in fact an ORC file as is) for retrieval:userdata1.orc.zip. Note that you can reproduce this with an invalid
.orc
file as the error happens before reading any ORC data.Error produced:
Expected Behavior
I would expect the
.read_orc
function to fully use the filesystem provided instead of trying to talk to the real s3, and succeed at reading the orc file.My initial investigation
before the .read_table call happens, it is erroring at the get_handle() call with PermissionError('Forbidden') .get_handle() is not using the custom filesystem I provided, and read_table doesn't allow passing through storage_options (even though _get_filepath_or_buffer does accept that).
Installed Versions
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.10.12.final.0
python-bits : 64
OS : Darwin
OS-release : 22.6.0
Version : Darwin Kernel Version 22.6.0: Mon Feb 19 19:43:13 PST 2024; root:xnu-8796.141.3.704.6~1/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 68.1.2
pip : 24.0
Cython : None
pytest : 7.4.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.7
jinja2 : 3.1.3
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : 2024.2.0
fsspec : 2024.3.1
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 16.1.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.3.1
scipy : None
sqlalchemy : 2.0.28
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
None
The text was updated successfully, but these errors were encountered: