-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault when running a join between pyarrow.RecordBatchReader
and pyarrow.Table
#12133
Comments
I suspect this is because the RecordBatchReader being destructive This is a known issue and hard to detect currently, but we're working on a fix |
This might be a problem in import pyarrow as pa
tbl = pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]})
t = pa.RecordBatchReader(tbl.to_batches())
print(t.read_all()) |
I've opened a bug report in the arrow repository here - apache/arrow#41758 Closing this as it does not seem to be caused by DuckDB itself. |
This seems to be using the wrong syntax for creating a record batch reader, here's the correct one: import duckdb
import pyarrow as pa
d = duckdb.connect()
tbl = pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]})
t = pa.RecordBatchReader.from_batches(tbl.schema, tbl.to_batches())
s = pa.Table.from_pydict({"x": [11, 12], "y": ["c", "d"]})
print(d.execute("SELECT * FROM s, t WHERE t.x = s.x").arrow())
# pyarrow.Table
# x: int64
# y: string
# x: int64
# y: string
# ----
# x: [[11,12]]
# y: [["c","d"]]
# x: [[11,12]]
# y: [["c","d"]] |
Many thanks @Mytherin! |
What happens?
DuckDB fails with a segmentation fault when running a join between
pyarrow.RecordBatchReader
andpyarrow.Table
.Joining
pyarrow.Table
andpyarrow.Table
does work.To Reproduce
OS:
Ubuntu 24.04 x64
DuckDB Version:
0.10.2
DuckDB Client:
Python 3.12
Full Name:
Roman Zeyde
Affiliation:
VAST Data
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
The text was updated successfully, but these errors were encountered: