Loading large STATA file throws "OSError: [Errno 22] Invalid argument" #10641

makmanalp · 2015-07-20T19:41:55Z

Hello, this error happens when loading a large (~3.3GB) stata 13 file:

/Users/makmana/colombia/colombia/datasets.py in <lambda>()
    111
    112 industry4digit_department = {
--> 113     "read_function": lambda: pd.read_stata("/Users/makmana/ciddata/Subnationals/Atlas/Colombia/beta/output2008_2013.dta"),
    114     "field_mapping": pila_to_atlas,
    115     "classification_fields": {

/Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, encoding, index, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize, iterator)
    160         return reader
    161
--> 162     return reader.read()
    163
    164 _date_formats = ["%tc", "%tC", "%td", "%d", "%tw", "%tm", "%tq", "%th", "%ty"]

/Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py in read(self, nrows, convert_dates, convert_categoricals, index, convert_missing, preserve_dtypes, columns, order_categoricals)
   1349         self.path_or_buf.seek(self.data_location + offset)
   1350         read_lines = min(nrows, self.nobs - self._lines_read)
-> 1351         data = np.frombuffer(self.path_or_buf.read(read_len), dtype=dtype,
   1352                              count=read_lines)
   1353         self._lines_read += read_lines

OSError: [Errno 22] Invalid argument
> /Users/makmana/colombia/env/lib/python3.4/site-packages/pandas/io/stata.py(1351)read()
   1350         read_lines = min(nrows, self.nobs - self._lines_read)
-> 1351         data = np.frombuffer(self.path_or_buf.read(read_len), dtype=dtype,
   1352                              count=read_lines)

where read_len is 3525947880 and self.path_or_buf is <_io.BufferedReader name='/Users/makmana/ciddata/Subnationals/Atlas/Colombia/beta/output2008_2013.dta'>.

At that point, the question in my head was "well, what /is/ a reasonable read_len"? So I binary-searched until I converged to a value that was around 721000000, but then when I quit a bunch of other applications and somehow it started working again! This makes me think this has to do with available memory, maybe.

Another funny thing is that this happens on Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin, but when I do the same read_stata on Python 2.7.9 (default, Jan 7 2015, 11:50:42) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin, it doesn't choke. Both are using numpy 1.9.2 and pandas 0.16.2.

A final insight is that it fails very quickly (quicker than it could have possibly loaded the file) with that read_len. With smaller read_len values, say dividing by 2, it waits for a long time and then fails.

This issue and this one perhaps might be related.

I can't really share the file because of data confidentiality but I'd be happy to dig through to figure out what might be going on wrong if someone has ideas and pointers.

The text was updated successfully, but these errors were encountered:

makmanalp · 2015-07-20T20:00:40Z

I don't know if this should go upstream or it should just chunk to work around this (if it's even a memory issue, which I'm now doubting). If it's the latter, I'm not sure how one would determine a reasonable chunking size. For posterity, my workaround was to load the file up in python2, save as hdf, then load the hdf file in python3 - no clue why this works.

jreback · 2015-07-20T21:19:35Z

try reading with a chunksize when reading a large file

makmanalp · 2015-07-20T21:27:47Z

@jreback well, yeah, of course that works, but this does seem like a regression in that it works with py2 but not py3 for some reason. A 3GB file should have no problem fitting in a computer with 16GB ram, even with all the python data type wrappers, and besides the fact that it fails immediately after you run and not after a long time loading tells me something else is going on.

makmanalp · 2015-07-20T21:39:42Z

Bah, I managed to reproduce this on its own:

In [5]: import io

In [8]: f = io.open("/..../output2008_2013.dta", "rb")

In [9]: f.read()
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-9-bacd0e0f09a3> in <module>()
----> 1 f.read()

OSError: [Errno 22] Invalid argument

So this for sure is not on pandas anymore, but until upstream fixes this, py3 users on OSX are stuck, looks like. Also seems like torch is working around this, for example:

torch/DEPRECEATED-torch7-distro@40e6593

and more here and here.

And here is a related python bug.

So I leave it to your discretion.

jreback · 2015-07-20T21:54:30Z

thanks

eXcuvator · 2018-09-05T08:52:08Z

I just faced this bug [3 years after opening] when reading a 3.7GB file into pandas 0.23.4 and python 3.7.0. Is the only workaround really to use chunk sizes?

Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1664, in <module>
    main()
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1658, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1068, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/foo/Repositories/occupations/code/motivation_occ.py", line 221, in <module>
    df = getData()
  File "/Users/foo/Repositories/occupations/code/motivation_occ.py", line 32, in getData
    df = pd.read_stata('../data/stata/motivation_occ.dta', convert_categoricals=False)
  File "/Users/foo/anaconda/envs/myenv3/lib/python3.7/site-packages/pandas/util/_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "/Users/foo/anaconda/envs/myenv3/lib/python3.7/site-packages/pandas/io/stata.py", line 191, in read_stata
    data = reader.read()
  File "/Users/foo/anaconda/envs/myenv3/lib/python3.7/site-packages/pandas/util/_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "/Users/foo/anaconda/envs/myenv3/lib/python3.7/site-packages/pandas/io/stata.py", line 1529, in read
    data = np.frombuffer(self.path_or_buf.read(read_len), dtype=dtype,
OSError: [Errno 22] Invalid argument

makmanalp · 2018-09-05T16:27:45Z

The python ticket seems to be stalled in review: https://bugs.python.org/issue24658 - it's probably a fair question as to whether this will ever get fixed in a reasonable amount of time. Open issue x-ref on numpy: numpy/numpy#3858

jreback added the IO Stata read_stata, to_stata label Jul 20, 2015

jreback closed this as completed Jul 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading large STATA file throws "OSError: [Errno 22] Invalid argument" #10641

Loading large STATA file throws "OSError: [Errno 22] Invalid argument" #10641

makmanalp commented Jul 20, 2015

makmanalp commented Jul 20, 2015

jreback commented Jul 20, 2015

makmanalp commented Jul 20, 2015

makmanalp commented Jul 20, 2015

jreback commented Jul 20, 2015

eXcuvator commented Sep 5, 2018

makmanalp commented Sep 5, 2018

Loading large STATA file throws "OSError: [Errno 22] Invalid argument" #10641

Loading large STATA file throws "OSError: [Errno 22] Invalid argument" #10641

Comments

makmanalp commented Jul 20, 2015

makmanalp commented Jul 20, 2015

jreback commented Jul 20, 2015

makmanalp commented Jul 20, 2015

makmanalp commented Jul 20, 2015

jreback commented Jul 20, 2015

eXcuvator commented Sep 5, 2018

makmanalp commented Sep 5, 2018