Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv-parser does not parse big csv file correct, after ~95K rows it begins merge all rows in single JSON. #207

Open
iliaivanov2016 opened this issue Nov 30, 2021 · 2 comments

Comments

@iliaivanov2016
Copy link

  • Operating System:
    Centos 7
  • Node Version:
    16.13.0
  • NPM Version:
  • 8.1.4
  • csv-parser Version:
    v3.0.0

Expected Behavior

167K rows parsed

Actual Behavior

95 K rows parsed

How Do We Reproduce?

https://edbq.xyz/test/Freight3.csv

@danneu
Copy link

danneu commented Mar 4, 2022

I'm seeing something like this too with the authors dump on https://openlibrary.org/developers/dumps.

Replacing csv-parser with csv-stream with no changes to the data nor options fixes the issue.

However, I don't think it's failing after N rows. Rather, there seems to be a bug with quote/end-of-line detection as it will produce a row that contains hundreds of concatenated rows in the final column, go back to parsing rows correctly, and then parse a long concatenated row many more times, back and forth.

This code will demonstrate the issue on https://openlibrary.org/data/ol_dump_authors_latest.txt.gz (0.4GB):

const pipe = require('stream').createReadStream(
    require('fs').createReadStream('ol_dump_authors_latest.txt.gz'),
    require('zlib').createGunzip(),
    require('csv-parser')({
        headers: ['type', 'key', 'revision', 'last_modified', 'json'],
        separator: '\t',
    }),
    (err) => err ? console.error(err) : console.log('done')
)

let seen = 0

pipe.on('data', (row) => {
    seen++
    // detect long row
    if (row.json.length > 10000) {
        console.log(seen, row)
    }
})

This code will reveal many problem rows that accidentally concatenate following rows into the final column.

[40430] {
  type: '/type/author',
  key: '/authors/OL5247858A',
  revision: '1',
  last_modified: '2008-09-28T05:16:27.104438',
  json: '{"name": "Kommunisticheskaya partiya Armenii. S\\"ezd", "last_modified": {"type": "/type/datetime", "value": "2008-09-28 05:16:27.104438"}, "key": "/a/OL5247858A", "type": {"key": "/type/author"}, "id": 26329826, "revision": 1}\n' +
    '/type/author\t/authors/OL5247929A\t1\t2008-09-28T05:17:19.811748\t{"name": "Archibald Gray", "personal_name": "Archibald Gray", "last_modified": {"type": "/type/datetime", "value": "2008-09-28 05:17:19.811748"}, "key": "/a/OL5247929A", "type": {"key": "/type/author"}, "id": 26330110, "revision": 1}\n' +
    '/type/author\t/authors/OL5248963A\t1\t2008-09-28T05:39:41.512087\t{"name": "GREAT BRITAIN.  ROYAL COMMISSION ON LABOUR IN INDIA", "last_modified": {"type": "/type/datetime", "value": "2008-09-28 05:39:41.512087"}, "key": "/a/OL5248963A", "type": {"key": "/type/author"}, "id": 26336569, "revision": 1}\n' +
  '/type/au'... 710973 more characters

I notice that it happens on any row that has an escaped quote \" like in the example above. It looks like the parser will start to concatenate rows when it sees the first \" finish concatenating at the next row that contains a \".

Perhaps { escape: '\\' } just needs to be passed to the parser, but I would have thought that the default of escape: '"' would handle backslash escapes between quotes.

@mjpowersjr
Copy link

I also hit this bug, somewhere around line 2.7M in the following data set:

https://ridb.recreation.gov/downloads/reservations2022.zip

Switching to papaparse worked on the same file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants