Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading large STATA file throws "OSError: [Errno 22] Invalid argument" #10641

Closed
makmanalp opened this issue Jul 20, 2015 · 7 comments
Closed
Labels
IO Stata read_stata, to_stata

Comments

@makmanalp
Copy link
Contributor

Hello, this error happens when loading a large (~3.3GB) stata 13 file:

/Users/makmana/colombia/colombia/datasets.py in <lambda>()
    111
    112 industry4digit_department = {
--> 113     "read_function": lambda: pd.read_stata("/Users/makmana/ciddata/Subnationals/Atlas/Colombia/beta/output2008_2013.dta"),
    114     "field_mapping": pila_to_atlas,
    115     "classification_fields": {

/Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, encoding, index, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize, iterator)
    160         return reader
    161
--> 162     return reader.read()
    163
    164 _date_formats = ["%tc", "%tC", "%td", "%d", "%tw", "%tm", "%tq", "%th", "%ty"]

/Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py in read(self, nrows, convert_dates, convert_categoricals, index, convert_missing, preserve_dtypes, columns, order_categoricals)
   1349         self.path_or_buf.seek(self.data_location + offset)
   1350         read_lines = min(nrows, self.nobs - self._lines_read)
-> 1351         data = np.frombuffer(self.path_or_buf.read(read_len), dtype=dtype,
   1352                              count=read_lines)
   1353         self._lines_read += read_lines

OSError: [Errno 22] Invalid argument
> /Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py(1351)read()
   1350         read_lines = min(nrows, self.nobs - self._lines_read)
-> 1351         data = np.frombuffer(self.path_or_buf.read(read_len), dtype=dtype,
   1352                              count=read_lines)

where read_len is 3525947880 and self.path_or_buf is <_io.BufferedReader name='/Users/makmana/ciddata/Subnationals/Atlas/Colombia/beta/output2008_2013.dta'>.

At that point, the question in my head was "well, what /is/ a reasonable read_len"? So I binary-searched until I converged to a value that was around 721000000, but then when I quit a bunch of other applications and somehow it started working again! This makes me think this has to do with available memory, maybe.

Another funny thing is that this happens on Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin, but when I do the same read_stata on Python 2.7.9 (default, Jan 7 2015, 11:50:42) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin, it doesn't choke. Both are using numpy 1.9.2 and pandas 0.16.2.

A final insight is that it fails very quickly (quicker than it could have possibly loaded the file) with that read_len. With smaller read_len values, say dividing by 2, it waits for a long time and then fails.

This issue and this one perhaps might be related.

I can't really share the file because of data confidentiality but I'd be happy to dig through to figure out what might be going on wrong if someone has ideas and pointers.

@makmanalp
Copy link
Contributor Author

I don't know if this should go upstream or it should just chunk to work around this (if it's even a memory issue, which I'm now doubting). If it's the latter, I'm not sure how one would determine a reasonable chunking size. For posterity, my workaround was to load the file up in python2, save as hdf, then load the hdf file in python3 - no clue why this works.

@jreback
Copy link
Contributor

jreback commented Jul 20, 2015

try reading with a chunksize when reading a large file

@jreback jreback added the IO Stata read_stata, to_stata label Jul 20, 2015
@makmanalp
Copy link
Contributor Author

@jreback well, yeah, of course that works, but this does seem like a regression in that it works with py2 but not py3 for some reason. A 3GB file should have no problem fitting in a computer with 16GB ram, even with all the python data type wrappers, and besides the fact that it fails immediately after you run and not after a long time loading tells me something else is going on.

@makmanalp
Copy link
Contributor Author

Bah, I managed to reproduce this on its own:

In [5]: import io

In [8]: f = io.open("/..../output2008_2013.dta", "rb")

In [9]: f.read()
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-9-bacd0e0f09a3> in <module>()
----> 1 f.read()

OSError: [Errno 22] Invalid argument

So this for sure is not on pandas anymore, but until upstream fixes this, py3 users on OSX are stuck, looks like. Also seems like torch is working around this, for example:

torch/DEPRECEATED-torch7-distro@40e6593

and more here and here.

And here is a related python bug.

So I leave it to your discretion.

@jreback
Copy link
Contributor

jreback commented Jul 20, 2015

thanks

@jreback jreback closed this as completed Jul 20, 2015
@eXcuvator
Copy link

I just faced this bug [3 years after opening] when reading a 3.7GB file into pandas 0.23.4 and python 3.7.0. Is the only workaround really to use chunk sizes?

Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1664, in <module>
    main()
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1658, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1068, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/foo/Repositories/occupations/code/motivation_occ.py", line 221, in <module>
    df = getData()
  File "/Users/foo/Repositories/occupations/code/motivation_occ.py", line 32, in getData
    df = pd.read_stata('../data/stata/motivation_occ.dta', convert_categoricals=False)
  File "/Users/foo/anaconda/envs/myenv3/lib/python3.7/site-packages/pandas/util/_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "/Users/foo/anaconda/envs/myenv3/lib/python3.7/site-packages/pandas/io/stata.py", line 191, in read_stata
    data = reader.read()
  File "/Users/foo/anaconda/envs/myenv3/lib/python3.7/site-packages/pandas/util/_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "/Users/foo/anaconda/envs/myenv3/lib/python3.7/site-packages/pandas/io/stata.py", line 1529, in read
    data = np.frombuffer(self.path_or_buf.read(read_len), dtype=dtype,
OSError: [Errno 22] Invalid argument

@makmanalp
Copy link
Contributor Author

The python ticket seems to be stalled in review: https://bugs.python.org/issue24658 - it's probably a fair question as to whether this will ever get fixed in a reasonable amount of time. Open issue x-ref on numpy: numpy/numpy#3858

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata
Projects
None yet
Development

No branches or pull requests

3 participants