-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Closed
Labels
Description
I am having trouble with read_csv
(Pandas 0.17.0) when trying to read a 380+ MB csv file. The file starts with 54 fields but some lines have 53 fields instead of 54. Running the below code gives me the following error:
parser = lambda x: datetime.strptime(x, '%y %m %d %H %M %S %f')
df = pd.read_csv(filename,
names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND',
'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS',
'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2',
'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6',
'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10',
'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'],
usecols=range(0, 42),
parse_dates={"TIMESTAMP": [0, 1, 2, 3, 4, 5, 6]},
date_parser=parser,
header=None)
Error:
CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54
If I pass the error_bad_lines=False
keyword, problematic lines are displayed similar to the example below:
Skipping line 1683401: expected 53 fields, saw 54
however I get the following error this time ( also the DataFrame does not get loaded):
CParserError: Too many columns specified: expected 54 and found 53
If I pass the engine='python'
keyword, I do not get any errors, but it takes a really long time to parse the data. Please note that 53 and 54 are switched in the error messages depending on if error_bad_lines=False
is used or not.