New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: cannot use np.save
and allow_pickle=True
with data larger than 4 GB
#26224
Comments
Just call dump with |
that line is in the numpy source not my code. I have no way (that i know of) to change it without editing the source code... which is undesirable |
It is just the default, all you need to do to override it is specify it in the call. |
Thank you @charris for your time but I'm sorry i dont understand. This is in the the full stack trace would be File "/data2/khood/GitHub/MLAudio/convertDataToNumpy.py", line 40, in np.save('neuroTrain.npy', neuroTrain) File "/home/khood/anaconda3/envs/mlaudio/lib/python3.12/site-packages/numpy/lib/npyio.py", line 546, in save format.write_array(fid, arr, allow_pickle=allow_pickle, File "/home/khood/anaconda3/envs/mlaudio/lib/python3.12/site-packages/numpy/lib/format.py", line 719, in write_array pickle.dump(array, fp, protocol=3, **pickle_kwargs) OverflowError: serializing a bytes object larger than 4 GiB requires pickle protocol 4 or higher If there is a way to tell numpy to use the newer pickle version please let me know! |
Python functions can have default values for arguments, which you see as
|
I appreciate your advice. Just to clarify, are you suggesting that I avoid using the numpy save method and instead opt for directly using pickle to dump the data? My concern is that using pickle might create a file in a different format than npy. Since downstream processes are expecting npy format, can you confirm if the formats are identical? |
I suppose it does look like numpy save is just wrapping the dump |
np.save works for normal files:
That is an 8GB file. Are you trying to save an object array? |
Also, are you running on Windows? |
And what numpy version? |
Linux (Ubuntu I believe I'm afk atm but I'll check soon) numpy '1.26.4' Yes it has objects because it's multiple dimensions [10000, 7001, 201] and each dimension would be a ndarray I believe |
Why would each dimension be an ndarray? How did you make the array? What does |
And what is the exact call you make to |
@charris Please do look at the code he's linking. We explicitly call |
@rkern At this point, the question is why is it an object array. |
NAME="Ubuntu"
Because a multidimensional ndarray is essentially an ndarray of ndarrays. In a prior discussion here @charris mentioned that numpy was considering switching to pickle 4 as the default protocol. However, since the comment was made in April 2021 and it's now 2024, it's possible that the line hard-coding pickle 3 was not updated resulting in a bug. While it would be helpful to allow users to override the protocol version, i get that ensuring predictability is important. Finally to add some context I, like the original poster, am also working with neural networks, which can make it hard to make nice small snippets of repeatable code. I was able to work around this by making sure all elements are floats. I dont know if you want to leave this open to address the fixed protocol version so I'll let you close it if you want Thank you for your help! I really appreciate your time and having you kinda walk me though it did make me realize there where object in the array that probably should not have been there. |
The user has an object array. They want to serialize it in NPY format. This is a thing that we support. However, they happen to have a (standard, builtin) object within that object array that pickle refuses to serialize with the pickle version that we hardcode. That's a problem for us to solve. I'm glad that this user found a better way to organize their arrays that avoided the problem, but it still exists. |
Presumably we need a size check to see if we'll hit this corner case and in that case choose a different protocol. Unless of course it's safe to just use the newer protocol always, but I suspect protocol 3 is hardcoded because of backward compatibility concerns. |
If I where you guys I'd just change it to a default that can be overridden. You're already accepting kwargs for the dump. I'm not even really sure why the protocol was excluded from it |
Or just let pickle decide the default. According to this the current default is 4. I think 4 is backwards compatible with 3 |
I think it would be fine to bump the protocol to 4. This is the default since python 3.8, we now (NumPy 2.1) only support python 3.10 and up. |
np.save
and allow_pickle=True
with data larger than 4 GB
still getting
with
and
I think this is incorrect still set to 3. Or am I missing something?
numpy/numpy/lib/format.py
Line 744 in e59c074
Originally posted by @khood5 in #18784 (comment)
The text was updated successfully, but these errors were encountered: