Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PanicException when slicing a LazyFrame streaming from globbed CSV #16163

Closed
2 tasks done
riley-harper opened this issue May 10, 2024 · 1 comment · Fixed by #16174
Closed
2 tasks done

PanicException when slicing a LazyFrame streaming from globbed CSV #16163

riley-harper opened this issue May 10, 2024 · 1 comment · Fixed by #16174
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@riley-harper
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from pathlib import Path
import polars as pl

csv_dir = Path("./test.csv")
csv_dir.mkdir()
df = pl.DataFrame({"A": [1, 2, 3]})
df.write_csv(csv_dir / "test-1.csv")

lf = pl.scan_csv(csv_dir / "*.csv")

# Setting streaming=True causes a panic here
lf.slice(0, 4).collect(streaming=True)

The exception traceback (with RUST_BACKTRACE=1) is

thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-utils/src/arena.rs:82:31:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic
   3: core::option::unwrap_failed
   4: polars_pipe::pipeline::convert::get_sink
   5: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next
   6: polars_lazy::physical_plan::streaming::convert_alp::insert_streaming_nodes
   7: polars_lazy::frame::LazyFrame::optimize_with_scratch
   8: polars_lazy::frame::LazyFrame::collect
   9: polars::lazyframe::PyLazyFrame::__pymethod_collect__
  10: pyo3::impl_::trampoline::trampoline
  11: polars::lazyframe::_::__INVENTORY::trampoline
  12: method_vectorcall_VARARGS_KEYWORDS
             at /usr/local/src/conda/python-3.12.3/Objects/descrobject.c:365:14
  13: _PyObject_VectorcallTstate
             at /usr/local/src/conda/python-3.12.3/Include/internal/pycore_call.h:92:11
  14: PyObject_Vectorcall
             at /usr/local/src/conda/python-3.12.3/Objects/call.c:325:12
  15: _PyEval_EvalFrameDefault
             at /home/conda/feedstock_root/build_artifacts/python-split_1713204800955/work/build-static/Python/bytecodes.c:2706:19
  16: PyEval_EvalCode
             at /usr/local/src/conda/python-3.12.3/Python/ceval.c:578:21
  17: run_eval_code_obj
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:1722
  18: run_mod
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:1743
  19: PyRun_InteractiveOneObjectEx
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:260
  20: _PyRun_InteractiveLoopObject
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:137
  21: _PyRun_AnyFileObject
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:72
  22: PyRun_AnyFileExFlags
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:104
  23: pymain_run_stdin
             at /usr/local/src/conda/python-3.12.3/Modules/main.c:520
  24: pymain_run_python
             at /usr/local/src/conda/python-3.12.3/Modules/main.c:632
  25: Py_RunMain
             at /usr/local/src/conda/python-3.12.3/Modules/main.c:709
  26: Py_BytesMain
             at /usr/local/src/conda/python-3.12.3/Modules/main.c:763:12
  27: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  28: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  29: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rileyh/micromamba/envs/rileyh_test/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 1816, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

Log output

thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-utils/src/arena.rs:82:31:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rileyh/micromamba/envs/rileyh_test/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 1816, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

Issue description

In the specific case where I use scan_csv() with a *.csv glob, LazyFrame.slice(), and collect() with streaming set to True, I get a PanicException. If I set streaming to False, or don't call LazyFrame.slice() before collecting, I get the result I expect, not a panic.

Expected behavior

I would expect that the result with streaming=True would be the same as with streaming=False, which is a DataFrame that looks like

shape: (3, 1)
┌─────┐
│ A   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

Installed versions

--------Version info---------
Polars:               0.20.25
Index type:           UInt32
Platform:             Linux-5.15.0-105-generic-x86_64-with-glibc2.35
Python:               3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@riley-harper riley-harper added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 10, 2024
@cmdlineluser
Copy link
Contributor

cmdlineluser commented May 10, 2024

Can reproduce.

It seems something is up with slice_pushdown on the streaming engine.

>>> lf.slice(0, 4).collect(streaming=True, slice_pushdown=False)
shape: (3, 1)
┌─────┐
│ A   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

Update: It seems to be specific to the frame method, Expr.slice is ok:

>>> lf.select(pl.all().slice(0, 4)).collect(streaming=True)
shape: (3, 1)
┌─────┐
│ A   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants