New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading large STATA file throws "OSError: [Errno 22] Invalid argument" #10641
Comments
I don't know if this should go upstream or it should just chunk to work around this (if it's even a memory issue, which I'm now doubting). If it's the latter, I'm not sure how one would determine a reasonable chunking size. For posterity, my workaround was to load the file up in python2, save as hdf, then load the hdf file in python3 - no clue why this works. |
try reading with a |
@jreback well, yeah, of course that works, but this does seem like a regression in that it works with py2 but not py3 for some reason. A 3GB file should have no problem fitting in a computer with 16GB ram, even with all the python data type wrappers, and besides the fact that it fails immediately after you run and not after a long time loading tells me something else is going on. |
Bah, I managed to reproduce this on its own: In [5]: import io
In [8]: f = io.open("/..../output2008_2013.dta", "rb")
In [9]: f.read()
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-9-bacd0e0f09a3> in <module>()
----> 1 f.read()
OSError: [Errno 22] Invalid argument So this for sure is not on pandas anymore, but until upstream fixes this, py3 users on OSX are stuck, looks like. Also seems like torch is working around this, for example: torch/DEPRECEATED-torch7-distro@40e6593 And here is a related python bug. So I leave it to your discretion. |
thanks |
I just faced this bug [3 years after opening] when reading a 3.7GB file into pandas
|
The python ticket seems to be stalled in review: https://bugs.python.org/issue24658 - it's probably a fair question as to whether this will ever get fixed in a reasonable amount of time. Open issue x-ref on numpy: numpy/numpy#3858 |
Hello, this error happens when loading a large (~3.3GB) stata 13 file:
where
read_len
is3525947880
andself.path_or_buf
is<_io.BufferedReader name='/Users/makmana/ciddata/Subnationals/Atlas/Colombia/beta/output2008_2013.dta'>
.At that point, the question in my head was "well, what /is/ a reasonable read_len"? So I binary-searched until I converged to a value that was around 721000000, but then when I quit a bunch of other applications and somehow it started working again! This makes me think this has to do with available memory, maybe.
Another funny thing is that this happens on
Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
, but when I do the same read_stata onPython 2.7.9 (default, Jan 7 2015, 11:50:42) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
, it doesn't choke. Both are using numpy 1.9.2 and pandas 0.16.2.A final insight is that it fails very quickly (quicker than it could have possibly loaded the file) with that read_len. With smaller read_len values, say dividing by 2, it waits for a long time and then fails.
This issue and this one perhaps might be related.
I can't really share the file because of data confidentiality but I'd be happy to dig through to figure out what might be going on wrong if someone has ideas and pointers.
The text was updated successfully, but these errors were encountered: