`with_context` yields invalid `DataFrame` dimensions for the `other` `DataFrame` #16144

tharindurr · 2024-05-10T01:48:37Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

n_rows = 2485

cols = {
    "A": [i for i in range(n_rows)],
    "B": [i for i in range(n_rows)],
    "C": [i for i in range(n_rows)],
}
pl.DataFrame(cols).write_csv("simple.csv")

ldf_1 = pl.scan_csv("simple.csv")
ldf_2 = pl.scan_csv("simple.csv").select(pl.all().name.prefix("foo_"))

x = ldf_1.with_context(ldf_2).select(pl.col("foo_A"))

print("CSV Shape: ", ldf_2.collect().shape) # This prints (2485, 3)
print("foo_A column Shape after with_context: ", x.collect().shape) # This prints (29820, 1) but should print (2485, 1)


x = ldf_1.with_context(ldf_2).select(pl.all()).collect() # This is what I want to do, but throws a ComputeError

Log output

avg line length: 13.183594
std. dev. line length: 2.0509784
initial row estimate: 2578
no. of chunks: 12 processed by: 12 threads.
CSV Shape:  (2485, 3)
avg line length: 13.183594
std. dev. line length: 2.0509784
initial row estimate: 2578
no. of chunks: 12 processed by: 12 threads.
avg line length: 13.183594
std. dev. line length: 2.0509784
initial row estimate: 2578
no. of chunks: 12 processed by: 12 threads.
foo_A column Shape after with_context:  (29820, 1)
avg line length: 13.183594
std. dev. line length: 2.0509784
initial row estimate: 2578
no. of chunks: 12 processed by: 12 threads.
avg line length: 13.183594
std. dev. line length: 2.0509784
initial row estimate: 2578
no. of chunks: 12 processed by: 12 threads.
Traceback (most recent call last):
  File "/home/rr/work/rapyuta-datakit/test_bug.py", line 21, in <module>
    x = ldf_1.with_context(ldf_2).select(pl.all()).collect() # This is what I want to do, but throws a ComputeError
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rr/work/rapyuta-datakit/env/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 1816, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: Series length 189 doesn't match the DataFrame height of 2485

Issue description

Latest working version: 0.20.19

The error happens when the LazyFrame constructed from the file as in the example code.

It works fine when the LazyFrame is constructed in memory directly, as shown below.

ldf_1 = pl.LazyFrame(cols)
ldf_2 = pl.LazyFrame(cols).select(pl.all().name.prefix("foo_"))

Expected behavior

The yielded combined DataFrame should have equal number of rows.

Installed versions

--------Version info---------
Polars:               0.20.25
Index type:           UInt32
Platform:             Linux-6.0.12-76060012-generic-x86_64-with-glibc2.31
Python:               3.12.3 (main, Apr 27 2024, 19:00:26) [GCC 9.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

tharindurr added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 10, 2024

tharindurr changed the title ~~with_context yields invalid DataFrame for the other DataFrame~~ with_context yields invalid DataFrame dimensions for the other DataFrame May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`with_context` yields invalid `DataFrame` dimensions for the `other` `DataFrame` #16144

`with_context` yields invalid `DataFrame` dimensions for the `other` `DataFrame` #16144

tharindurr commented May 10, 2024 •

edited

with_context yields invalid DataFrame dimensions for the other DataFrame #16144

with_context yields invalid DataFrame dimensions for the other DataFrame #16144

Comments

tharindurr commented May 10, 2024 • edited

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

`with_context` yields invalid `DataFrame` dimensions for the `other` `DataFrame` #16144

`with_context` yields invalid `DataFrame` dimensions for the `other` `DataFrame` #16144

tharindurr commented May 10, 2024 •

edited