Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed eol characters in the same csv file is not handled #439

Open
zxlin opened this issue Feb 18, 2022 · 4 comments
Open

Mixed eol characters in the same csv file is not handled #439

zxlin opened this issue Feb 18, 2022 · 4 comments

Comments

@zxlin
Copy link

zxlin commented Feb 18, 2022

This may be a non-standard csv format, but if a csv file has a carriage-return as the eol character in the first row, but then uses say new-line characters for the remaining rows, then the lib will parse the remaining lines as 1 giant row. Meaning the output will be an array of 1 object with a massive number of keys (example file below yields field7507057 as the last key in the object).

Example of this kind of file is a data file from the US Department of Education: https://nces.ed.gov/surveys/pss/zip/pss1920_pu_csv.zip

This may be outside the scope of this lib to handle, but I wanted to bring it to your attention.

Screen Shot 2022-02-18 at 4 27 18 PM

Repro steps:

Download and unzip the example file

$ csvtojson pss1920_pu_csv > pss.json

See the results:

$ tail -c 100 test.json
1372549","field7507055":"0","field7507056":"2.94117647058824","field7507057":"5.48387096774194"}

]
@jfoclpf
Copy link

jfoclpf commented Mar 21, 2022

IMHO this lib does not have to support all types of incorrectly formated CSV files, there are users already complaining about the size of the lib.

You have to preprocess the file before feeding it into this module. In Node is quite easy and you can use a stream reader with this lib.

@zxlin
Copy link
Author

zxlin commented Mar 21, 2022

I don't necessarily disagree, as mentioned, this may be out of scope for this lib, this may be a code fix or simple documentation describing how the new line character is auto-detected/used or no action at all, just wanted to bring it up with the maintainers here in the event that this case was not considered.

@jfoclpf
Copy link

jfoclpf commented Mar 21, 2022

Maybe I was not clear enough, this module allows fromStream method which you may use to preprocess the file.

Non-tested code, something like this

const fs = require('fs')
const { Transform } = require("stream")
const csv = require('csvtojson')

const trans = new Transform({
  transform(chunk, encoding, callback) {
    // process chunk, for example chunk.toString().toUpperCase()
    const processedChunk = chunk.toString().toUpperCase()
    callback(null, processedChunk)
  },
});

csv()
  .fromStream(fs.createReadStream('/path/to/file', { encoding: 'utf-8' }).pipe(trans))
  .subscribe((json) => {
    console.log(json)
  },
  (err) => {
    throw err
  },
  () => {
    console.log('success')
  })

@zxlin
Copy link
Author

zxlin commented Mar 21, 2022

Thanks, that's exactly what I used to pre-process the file actually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants