Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_ipc overwrites value in previously loaded dataframe #16119

Closed
2 tasks done
sezanzeb opened this issue May 8, 2024 · 6 comments · Fixed by #16208
Closed
2 tasks done

write_ipc overwrites value in previously loaded dataframe #16119

sezanzeb opened this issue May 8, 2024 · 6 comments · Fixed by #16208
Assignees
Labels
accepted Ready for implementation python Related to Python Polars

Comments

@sezanzeb
Copy link

sezanzeb commented May 8, 2024

Checks

  • I have checked that this issue has not already been reported.

(I have searched for issues mentioning write_ipc)

Reproducible example

import random

import polars as pl


def create_testdata() -> pl.DataFrame:
    return pl.DataFrame(
        [
            {
                "value": random.random(),
            }
            for _ in range(4)
        ]
    )


test_data_1 = create_testdata()
test_data_2 = create_testdata()

test_data_1.write_ipc("test.arrow")

loaded_data = pl.read_ipc("test.arrow")

print("before", loaded_data)

test_data_2.write_ipc("test.arrow")

print("after", loaded_data)

Log output

before shape: (4, 1)
┌──────────┐
│ value    │
│ ---      │
│ f64      │
╞══════════╡
│ 0.566947 │
│ 0.896406 │
│ 0.218272 │
│ 0.831894 │
└──────────┘
after shape: (4, 1)
┌──────────┐
│ value    │
│ ---      │
│ f64      │
╞══════════╡
│ 0.672392 │
│ 0.592068 │
│ 0.979936 │
│ 0.93201  │
└──────────┘

Issue description

A previously loaded dataframe from a given file is changed, once the original file is overwritten

Expected behavior

For the loaded dataframe to be immutable

Installed versions

--------Version info---------
Polars:               0.20.25
Index type:           UInt32
Platform:             Linux-6.5.0-28-generic-x86_64-with-glibc2.38
Python:               3.11.6 (main, Oct  8 2023, 05:06:43) [GCC 13.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.1.4
pyarrow:              14.0.2
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@sezanzeb sezanzeb added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 8, 2024
@sezanzeb
Copy link
Author

sezanzeb commented May 8, 2024

Using

loaded_data = pl.read_ipc("test.arrow", memory_map=False)

works

@sezanzeb
Copy link
Author

sezanzeb commented May 8, 2024

With this modification, the process can be crashed, by printing the dataframe after write_ipc:

import random

import polars as pl


def create_testdata(size: int) -> pl.DataFrame:
    return pl.DataFrame(
        [
            {
                "value": random.random(),
            }
            for _ in range(size)
        ]
    )


create_testdata(500).write_ipc("test.arrow")

loaded_data = pl.read_ipc("test.arrow")

print("before", loaded_data)

create_testdata(4).write_ipc("test.arrow")

print("after", loaded_data)

output:

before shape: (500, 1)
┌──────────┐
│ value    │
│ ---      │
│ f64      │
╞══════════╡
│ 0.570629 │
│ 0.713849 │
│ 0.295737 │
│ 0.998848 │
│ 0.865854 │
│ …        │
│ 0.564431 │
│ 0.48356  │
│ 0.284105 │
│ 0.011276 │
│ 0.597003 │
└──────────┘
after 
Process finished with exit code 135 (interrupted by signal 7:SIGBUS)

And if you reduce it from 500 to 20, you get all sorts of funny values in the output:

after shape: (20, 1)
┌─────────────┐
│ value       │
│ ---         │
│ f64         │
╞═════════════╡
│ 0.701536    │
│ 0.379791    │
│ 0.629545    │
│ 0.51221     │
│ 0.0         │
│ …           │
│ 3.0557e-312 │
│ 1.3581e-312 │
│ 0.0         │
│ NaN         │
│ 1.1126e-308 │
└─────────────┘

@ritchie46
Copy link
Member

You shouldn't write where you read. We should see if we can keep the file handle around to give you a proper error.

@ritchie46
Copy link
Member

Even if we hold a file handle, you can open a file in write mode, so if you want to write to the same file you should turn off memory mapping. I think we should add some docs about this.

@ritchie46 ritchie46 added invalid A bug report that is not actually a bug and removed bug Something isn't working invalid A bug report that is not actually a bug needs triage Awaiting prioritization by a maintainer labels May 13, 2024
@ritchie46
Copy link
Member

It is a consequence of memory mapping. Don't memory map if you want to write to the same file.

@sezanzeb
Copy link
Author

sezanzeb commented May 13, 2024

It is a consequence of memory mapping. Don't memory map if you want to write to the same file.

Yeah, I figured that much.

I think we should add some docs about this.

Sounds good

We should see if we can keep the file handle around to give you a proper error.

Even better

You shouldn't write where you read.

If one is handling cached data in an event-based application there could be easily a race-condition, in which the cache is updated after the cache has been read by another service.

So either some file-locking using semaphores is engineered in a cache-service that handles the file, or memory map should be turned off. But I don't feel like "not writing where you read" can be a general rule for things.

Anyhow, thanks for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation python Related to Python Polars
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants