csv-parser does not parse big csv file correct, after ~95K rows it begins merge all rows in single JSON. #207

iliaivanov2016 · 2021-11-30T20:49:56Z

Operating System:
Centos 7
Node Version:
16.13.0
NPM Version:
8.1.4
csv-parser Version:
v3.0.0

Expected Behavior

167K rows parsed

Actual Behavior

95 K rows parsed

How Do We Reproduce?

https://edbq.xyz/test/Freight3.csv

danneu · 2022-03-04T01:38:56Z

I'm seeing something like this too with the authors dump on https://openlibrary.org/developers/dumps.

Replacing csv-parser with csv-stream with no changes to the data nor options fixes the issue.

However, I don't think it's failing after N rows. Rather, there seems to be a bug with quote/end-of-line detection as it will produce a row that contains hundreds of concatenated rows in the final column, go back to parsing rows correctly, and then parse a long concatenated row many more times, back and forth.

This code will demonstrate the issue on https://openlibrary.org/data/ol_dump_authors_latest.txt.gz (0.4GB):

const pipe = require('stream').createReadStream(
    require('fs').createReadStream('ol_dump_authors_latest.txt.gz'),
    require('zlib').createGunzip(),
    require('csv-parser')({
        headers: ['type', 'key', 'revision', 'last_modified', 'json'],
        separator: '\t',
    }),
    (err) => err ? console.error(err) : console.log('done')
)

let seen = 0

pipe.on('data', (row) => {
    seen++
    // detect long row
    if (row.json.length > 10000) {
        console.log(seen, row)
    }
})

This code will reveal many problem rows that accidentally concatenate following rows into the final column.

[40430] {
  type: '/type/author',
  key: '/authors/OL5247858A',
  revision: '1',
  last_modified: '2008-09-28T05:16:27.104438',
  json: '{"name": "Kommunisticheskaya partiya Armenii. S\\"ezd", "last_modified": {"type": "/type/datetime", "value": "2008-09-28 05:16:27.104438"}, "key": "/a/OL5247858A", "type": {"key": "/type/author"}, "id": 26329826, "revision": 1}\n' +
    '/type/author\t/authors/OL5247929A\t1\t2008-09-28T05:17:19.811748\t{"name": "Archibald Gray", "personal_name": "Archibald Gray", "last_modified": {"type": "/type/datetime", "value": "2008-09-28 05:17:19.811748"}, "key": "/a/OL5247929A", "type": {"key": "/type/author"}, "id": 26330110, "revision": 1}\n' +
    '/type/author\t/authors/OL5248963A\t1\t2008-09-28T05:39:41.512087\t{"name": "GREAT BRITAIN.  ROYAL COMMISSION ON LABOUR IN INDIA", "last_modified": {"type": "/type/datetime", "value": "2008-09-28 05:39:41.512087"}, "key": "/a/OL5248963A", "type": {"key": "/type/author"}, "id": 26336569, "revision": 1}\n' +
  '/type/au'... 710973 more characters

I notice that it happens on any row that has an escaped quote \" like in the example above. It looks like the parser will start to concatenate rows when it sees the first \" finish concatenating at the next row that contains a \".

Perhaps { escape: '\\' } just needs to be passed to the parser, but I would have thought that the default of escape: '"' would handle backslash escapes between quotes.

mjpowersjr · 2023-07-23T16:46:47Z

I also hit this bug, somewhere around line 2.7M in the following data set:

https://ridb.recreation.gov/downloads/reservations2022.zip

Switching to papaparse worked on the same file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csv-parser does not parse big csv file correct, after ~95K rows it begins merge all rows in single JSON. #207

csv-parser does not parse big csv file correct, after ~95K rows it begins merge all rows in single JSON. #207

iliaivanov2016 commented Nov 30, 2021

danneu commented Mar 4, 2022 •

edited

mjpowersjr commented Jul 23, 2023

csv-parser does not parse big csv file correct, after ~95K rows it begins merge all rows in single JSON. #207

csv-parser does not parse big csv file correct, after ~95K rows it begins merge all rows in single JSON. #207

Comments

iliaivanov2016 commented Nov 30, 2021

Expected Behavior

Actual Behavior

How Do We Reproduce?

danneu commented Mar 4, 2022 • edited

mjpowersjr commented Jul 23, 2023

danneu commented Mar 4, 2022 •

edited