read_csv C-engine CParserError: Error tokenizing data #11166

joshlk · 2015-09-22T10:25:00Z

Hi,

I have encountered a dataset where the C-engine read_csv has problems. I am unsure of the exact issue but I have narrowed it down to a single row which I have pickled and uploaded it to dropbox. If you obtain the pickle try the following:

df = pd.read_pickle('faulty_row.pkl')
df.to_csv('faulty_row.csv', encoding='utf8', index=False)
df.read_csv('faulty_row.csv', encoding='utf8')

I get the following exception:

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

If you try and read the CSV using the python engine then no exception is thrown:

df.read_csv('faulty_row.csv', encoding='utf8', engine='python')

Suggesting that the issue is with read_csv and not to_csv. The versions I using are:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.0-28-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
IPython: 3.2.1
patsy: 0.3.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2

The text was updated successfully, but these errors were encountered:

chris-b1 · 2015-09-23T00:58:49Z

Your second-to-last line includes an '\r' break. I think it's a bug, but one workaround is to open in universal-new-line mode.

pd.read_csv(open('test.csv','rU'), encoding='utf-8', engine='c')

jelmelk · 2016-02-21T00:36:27Z

I'm encountering this error as well. Using the method suggested by @chris-b1 causes the following error:

Traceback (most recent call last):
  File "C:/Users/je/Desktop/Python/comparison.py", line 30, in <module>
    encoding='utf-8', engine='c')
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "C:\Program Files\Python 3.5\lib\site-packages\pandas\io\parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas\parser.pyx", line 515, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4948)
  File "pandas\parser.pyx", line 705, in pandas.parser.TextReader._get_header (pandas\parser.c:7386)
  File "pandas\parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8838)
  File "pandas\parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas\parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

shaynekang · 2016-03-21T05:20:38Z

+1

alfonsomhc · 2017-05-18T11:46:55Z

I have also found this issue when reading a large csv file with the default egine. If I use engine='python' then it works fine.

justinjdickow · 2018-01-10T14:52:49Z

I missed @alfonsomhc answer because it just looked like a comment.

You need

df = pd.read_csv('test.csv', engine='python')

Vozf · 2018-09-29T17:28:32Z

had the same issue trying to read a folder not a csv file

dgrahn · 2018-10-31T17:12:57Z

Has anyone investigated this issue? It's killing performance when using read_csv in a keras generator.

WillAyd · 2018-10-31T19:40:44Z

The original data provided is no longer available so the issue is not reproducible. Closing as it's not clear what the issue is, but @dgrahn or anyone else if you can provide a reproducible example we can reopen

dgrahn · 2018-11-05T11:39:11Z

@WillAyd Let me know if you need additional info.

Since GitHub doesn't accept CSVs, I changed the extension to .txt.
Here's the code which will trigger the exception.

for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)):
    pass

Here's the file: debug.txt

Here's the exception from Windows 10, using Anaconda.

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "D:\programs\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

And the same on RedHat.

$ python3
Python 3.6.6 (default, Aug 13 2018, 18:24:23)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

joshlk · 2018-11-05T12:09:59Z

@dgrahn I have downloaded debug.txt and I get the following if you run pd.read_csv('debug.xt', header=None) on a mac:

ParserError: Error tokenizing data. C error: Expected 204 fields in line 3, saw 2504

Which is different from the Buffer overflow caught error originally described.

I have inspected the debug.txt file and the first two lines have 204 columns but the 3rd line has 2504 columns. This would make the file unparsable and explains why an error is thrown.

Is this expected? GitHub could be doing some implicit conversion in the background between newline types ("\r\n" and "\n") that is messing up the uploaded example.

dgrahn · 2018-11-05T12:16:03Z

@joshlk Did you use the names=range(2504) option as described in the comment above?

joshlk · 2018-11-05T12:26:14Z

@dgrahn good point.

Ok can now reproduce the error with pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)).

It's good to note that pandas.read_csv('debug.csv', names=range(2504)) works fine and so its then unlikely related to the original bug but it is producing the same symptom.

dgrahn · 2018-11-05T12:35:17Z

@joshlk I could open a separate issue if that would be preferred.

egenc · 2020-06-17T09:52:42Z

pd.read_csv(open('test.csv','rU'), encoding='utf-8', engine='python')

Solved my problem.

dheeman00 · 2020-10-11T02:35:39Z

engine='python'

I tried this approach and was able to upload large data files. But when I checked the dimension of the dataframe I saw that the number of rows have increased. What can be the logical regions for that?

Pegayus · 2020-12-14T19:55:46Z

@dheeman00 : I am facing the same problem as you with changing sizes. I have a dataframe of shape (100K, 21) and after using engine = 'python', it gives me a dataframe of shape (100,034,21) (without enging='python', I get the same error as OP). After comparing them, I figured the problem is with one of my columns that contains text field, some with unknown chars, and some of them are broken into two different rows (the second row with the continuation of the text has all other columns set to "nan").
If you know your data well, playing with delimiters and maybe running a data cleaning before saving as CSV would be helpful. In my case, the data was too messy and too big (it was the subset of a bigger csv file) so I changed to Spark for data cleaning.

dheeman00 · 2020-12-14T20:16:22Z

@Pegayus: Yes you are right. Some of the "nan" value columns break down into multiple columns. I have performed the following task to resolve the issue. pd.read_csv(file_name, sep=',', usecols=columns_name, engine='python'). I have called the columns individually and it worked for me.

PavniGairola · 2021-03-26T18:15:40Z

I am unable to upload my data file. I have tried following:
1)netflix_df = pd.read_csv('/Users/pavnigairola/Desktop/netflix_titles.csv', encoding='utf8', engine='python')

error :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 572: invalid continuation byte

2)netflix_df = pd.read_csv('/Users/pavnigairola/Desktop/netflix_titles.csv',

Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

PavniGairola · 2021-03-26T18:18:17Z

I am unable to upload my data file. I have tried following:
1)netflix_df = pd.read_csv('/Users/pavnigairola/Desktop/netflix_titles.csv', encoding='utf8', engine='python')

error :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 572: invalid continuation byte

2)netflix_df = pd.read_csv('/Users/pavnigairola/Desktop/netflix_titles.csv',

Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

netflix_df = pd.read_csv(open('netflix_titles.csv','rU'), encoding='utf-8', engine='python')

Error:
[Errno 2] No such file or directory: 'netflix_titles.csv'

PLEASE SUGGEST

jreback added the IO CSV read_csv, to_csv label Sep 24, 2015

jreback added Bug Difficulty Intermediate labels Mar 21, 2016

jreback added this to the Next Major Release milestone Mar 21, 2016

chris-b1 mentioned this issue Jan 22, 2018

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index' #19352

Closed

djarpin mentioned this issue Mar 26, 2018

scikit_bring_your_own.ipynb train model pandas error aws/amazon-sagemaker-examples#219

Closed

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Oct 31, 2018

WillAyd closed this as completed Oct 31, 2018

dgrahn mentioned this issue Nov 5, 2018

C error: Buffer overflow caught on CSV with chunksize #23509

Closed

mr3coi mentioned this issue Oct 31, 2019

Data parser nilesc/NYCETA#1

Merged

normanius mentioned this issue Mar 23, 2021

BUG: Occasional "tokenizing data error" when reading in large files with read_csv() #40587

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv C-engine CParserError: Error tokenizing data #11166

read_csv C-engine CParserError: Error tokenizing data #11166

joshlk commented Sep 22, 2015

chris-b1 commented Sep 23, 2015

jelmelk commented Feb 21, 2016

shaynekang commented Mar 21, 2016

alfonsomhc commented May 18, 2017

justinjdickow commented Jan 10, 2018

Vozf commented Sep 29, 2018

dgrahn commented Oct 31, 2018

WillAyd commented Oct 31, 2018

dgrahn commented Nov 5, 2018 •

edited

joshlk commented Nov 5, 2018

dgrahn commented Nov 5, 2018

joshlk commented Nov 5, 2018

dgrahn commented Nov 5, 2018

egenc commented Jun 17, 2020

dheeman00 commented Oct 11, 2020

Pegayus commented Dec 14, 2020

dheeman00 commented Dec 14, 2020

PavniGairola commented Mar 26, 2021

PavniGairola commented Mar 26, 2021

read_csv C-engine CParserError: Error tokenizing data #11166

read_csv C-engine CParserError: Error tokenizing data #11166

Comments

joshlk commented Sep 22, 2015

chris-b1 commented Sep 23, 2015

jelmelk commented Feb 21, 2016

shaynekang commented Mar 21, 2016

alfonsomhc commented May 18, 2017

justinjdickow commented Jan 10, 2018

Vozf commented Sep 29, 2018

dgrahn commented Oct 31, 2018

WillAyd commented Oct 31, 2018

dgrahn commented Nov 5, 2018 • edited

joshlk commented Nov 5, 2018

dgrahn commented Nov 5, 2018

joshlk commented Nov 5, 2018

dgrahn commented Nov 5, 2018

egenc commented Jun 17, 2020

dheeman00 commented Oct 11, 2020

Pegayus commented Dec 14, 2020

dheeman00 commented Dec 14, 2020

PavniGairola commented Mar 26, 2021

PavniGairola commented Mar 26, 2021

dgrahn commented Nov 5, 2018 •

edited