Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

with_context yields invalid DataFrame dimensions for the other DataFrame #16144

Open
2 tasks done
tharindurr opened this issue May 10, 2024 · 0 comments
Open
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@tharindurr
Copy link

tharindurr commented May 10, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

n_rows = 2485

cols = {
    "A": [i for i in range(n_rows)],
    "B": [i for i in range(n_rows)],
    "C": [i for i in range(n_rows)],
}
pl.DataFrame(cols).write_csv("simple.csv")

ldf_1 = pl.scan_csv("simple.csv")
ldf_2 = pl.scan_csv("simple.csv").select(pl.all().name.prefix("foo_"))

x = ldf_1.with_context(ldf_2).select(pl.col("foo_A"))

print("CSV Shape: ", ldf_2.collect().shape) # This prints (2485, 3)
print("foo_A column Shape after with_context: ", x.collect().shape) # This prints (29820, 1) but should print (2485, 1)


x = ldf_1.with_context(ldf_2).select(pl.all()).collect() # This is what I want to do, but throws a ComputeError

Log output

avg line length: 13.183594
std. dev. line length: 2.0509784
initial row estimate: 2578
no. of chunks: 12 processed by: 12 threads.
CSV Shape:  (2485, 3)
avg line length: 13.183594
std. dev. line length: 2.0509784
initial row estimate: 2578
no. of chunks: 12 processed by: 12 threads.
avg line length: 13.183594
std. dev. line length: 2.0509784
initial row estimate: 2578
no. of chunks: 12 processed by: 12 threads.
foo_A column Shape after with_context:  (29820, 1)
avg line length: 13.183594
std. dev. line length: 2.0509784
initial row estimate: 2578
no. of chunks: 12 processed by: 12 threads.
avg line length: 13.183594
std. dev. line length: 2.0509784
initial row estimate: 2578
no. of chunks: 12 processed by: 12 threads.
Traceback (most recent call last):
  File "/home/rr/work/rapyuta-datakit/test_bug.py", line 21, in <module>
    x = ldf_1.with_context(ldf_2).select(pl.all()).collect() # This is what I want to do, but throws a ComputeError
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rr/work/rapyuta-datakit/env/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 1816, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: Series length 189 doesn't match the DataFrame height of 2485

Issue description

Latest working version: 0.20.19

The error happens when the LazyFrame constructed from the file as in the example code.

It works fine when the LazyFrame is constructed in memory directly, as shown below.

ldf_1 = pl.LazyFrame(cols)
ldf_2 = pl.LazyFrame(cols).select(pl.all().name.prefix("foo_"))

Expected behavior

The yielded combined DataFrame should have equal number of rows.

Installed versions

--------Version info---------
Polars:               0.20.25
Index type:           UInt32
Platform:             Linux-6.0.12-76060012-generic-x86_64-with-glibc2.31
Python:               3.12.3 (main, Apr 27 2024, 19:00:26) [GCC 9.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@tharindurr tharindurr added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 10, 2024
@tharindurr tharindurr changed the title with_context yields invalid DataFrame for the other DataFrame with_context yields invalid DataFrame dimensions for the other DataFrame May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant