Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv C-engine CParserError: Error tokenizing data #11166

Closed
joshlk opened this issue Sep 22, 2015 · 19 comments
Closed

read_csv C-engine CParserError: Error tokenizing data #11166

joshlk opened this issue Sep 22, 2015 · 19 comments
Labels
Bug IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue

Comments

@joshlk
Copy link

joshlk commented Sep 22, 2015

Hi,

I have encountered a dataset where the C-engine read_csv has problems. I am unsure of the exact issue but I have narrowed it down to a single row which I have pickled and uploaded it to dropbox. If you obtain the pickle try the following:

df = pd.read_pickle('faulty_row.pkl')
df.to_csv('faulty_row.csv', encoding='utf8', index=False)
df.read_csv('faulty_row.csv', encoding='utf8')

I get the following exception:

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

If you try and read the CSV using the python engine then no exception is thrown:

df.read_csv('faulty_row.csv', encoding='utf8', engine='python')

Suggesting that the issue is with read_csv and not to_csv. The versions I using are:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-28-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
IPython: 3.2.1
patsy: 0.3.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
@chris-b1
Copy link
Contributor

Your second-to-last line includes an '\r' break. I think it's a bug, but one workaround is to open in universal-new-line mode.

pd.read_csv(open('test.csv','rU'), encoding='utf-8', engine='c')

@jreback jreback added the IO CSV read_csv, to_csv label Sep 24, 2015
@jelmelk
Copy link

jelmelk commented Feb 21, 2016

I'm encountering this error as well. Using the method suggested by @chris-b1 causes the following error:

Traceback (most recent call last):
  File "C:/Users/je/Desktop/Python/comparison.py", line 30, in <module>
    encoding='utf-8', engine='c')
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas\parser.pyx", line 515, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4948)
  File "pandas\parser.pyx", line 705, in pandas.parser.TextReader._get_header (pandas\parser.c:7386)
  File "pandas\parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8838)
  File "pandas\parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas\parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

@shaynekang
Copy link

+1

@jreback jreback added this to the Next Major Release milestone Mar 21, 2016
@alfonsomhc
Copy link
Contributor

I have also found this issue when reading a large csv file with the default egine. If I use engine='python' then it works fine.

@justinjdickow
Copy link

I missed @alfonsomhc answer because it just looked like a comment.

You need

df = pd.read_csv('test.csv', engine='python')

@Vozf
Copy link

Vozf commented Sep 29, 2018

had the same issue trying to read a folder not a csv file

@dgrahn
Copy link

dgrahn commented Oct 31, 2018

Has anyone investigated this issue? It's killing performance when using read_csv in a keras generator.

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Oct 31, 2018
@WillAyd
Copy link
Member

WillAyd commented Oct 31, 2018

The original data provided is no longer available so the issue is not reproducible. Closing as it's not clear what the issue is, but @dgrahn or anyone else if you can provide a reproducible example we can reopen

@WillAyd WillAyd closed this as completed Oct 31, 2018
@dgrahn
Copy link

dgrahn commented Nov 5, 2018

@WillAyd Let me know if you need additional info.

Since GitHub doesn't accept CSVs, I changed the extension to .txt.
Here's the code which will trigger the exception.

for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)):
    pass

Here's the file: debug.txt

Here's the exception from Windows 10, using Anaconda.

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

And the same on RedHat.

$ python3
Python 3.6.6 (default, Aug 13 2018, 18:24:23)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

@joshlk
Copy link
Author

joshlk commented Nov 5, 2018

@dgrahn I have downloaded debug.txt and I get the following if you run pd.read_csv('debug.xt', header=None) on a mac:

ParserError: Error tokenizing data. C error: Expected 204 fields in line 3, saw 2504

Which is different from the Buffer overflow caught error originally described.

I have inspected the debug.txt file and the first two lines have 204 columns but the 3rd line has 2504 columns. This would make the file unparsable and explains why an error is thrown.

Is this expected? GitHub could be doing some implicit conversion in the background between newline types ("\r\n" and "\n") that is messing up the uploaded example.

@dgrahn
Copy link

dgrahn commented Nov 5, 2018

@joshlk Did you use the names=range(2504) option as described in the comment above?

@joshlk
Copy link
Author

joshlk commented Nov 5, 2018

@dgrahn good point.

Ok can now reproduce the error with pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)).

It's good to note that pandas.read_csv('debug.csv', names=range(2504)) works fine and so its then unlikely related to the original bug but it is producing the same symptom.

@dgrahn
Copy link

dgrahn commented Nov 5, 2018

@joshlk I could open a separate issue if that would be preferred.

@egenc
Copy link

egenc commented Jun 17, 2020

pd.read_csv(open('test.csv','rU'), encoding='utf-8', engine='python')

Solved my problem.

@dheeman00
Copy link

engine='python'

I tried this approach and was able to upload large data files. But when I checked the dimension of the dataframe I saw that the number of rows have increased. What can be the logical regions for that?

@Pegayus
Copy link

Pegayus commented Dec 14, 2020

@dheeman00 : I am facing the same problem as you with changing sizes. I have a dataframe of shape (100K, 21) and after using engine = 'python', it gives me a dataframe of shape (100,034,21) (without enging='python', I get the same error as OP). After comparing them, I figured the problem is with one of my columns that contains text field, some with unknown chars, and some of them are broken into two different rows (the second row with the continuation of the text has all other columns set to "nan").
If you know your data well, playing with delimiters and maybe running a data cleaning before saving as CSV would be helpful. In my case, the data was too messy and too big (it was the subset of a bigger csv file) so I changed to Spark for data cleaning.

@dheeman00
Copy link

@Pegayus: Yes you are right. Some of the "nan" value columns break down into multiple columns. I have performed the following task to resolve the issue. pd.read_csv(file_name, sep=',', usecols=columns_name, engine='python'). I have called the columns individually and it worked for me.

@PavniGairola
Copy link

I am unable to upload my data file. I have tried following:
1)netflix_df = pd.read_csv('/Users/pavnigairola/Desktop/netflix_titles.csv', encoding='utf8', engine='python')

error :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 572: invalid continuation byte

2)netflix_df = pd.read_csv('/Users/pavnigairola/Desktop/netflix_titles.csv',

Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

@PavniGairola
Copy link

I am unable to upload my data file. I have tried following:
1)netflix_df = pd.read_csv('/Users/pavnigairola/Desktop/netflix_titles.csv', encoding='utf8', engine='python')

error :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 572: invalid continuation byte

2)netflix_df = pd.read_csv('/Users/pavnigairola/Desktop/netflix_titles.csv',

Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

  1. netflix_df = pd.read_csv(open('netflix_titles.csv','rU'), encoding='utf-8', engine='python')

Error:
[Errno 2] No such file or directory: 'netflix_titles.csv'

PLEASE SUGGEST

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests