Add "column headers detection function" when column headers row is unknown #294

ev45ive · 2020-08-01T18:07:19Z

Is your feature request related to a problem? Please describe.
When column headers differ and their row differs per document (different "intro" in each document)
I have to parse once to find the header and then do it again to work from that line.
Also "from" vs "from_line" becomes confusing when the header is not the first row

Describe the solution you'd like
I think aside from static row number there should be option to add function that returns true or array of headers (column names) when it detects columns header row, and then we parse from that on with columns ( object fields ) named as headers

Describe alternatives you've considered
I've tried a lot of workarounds. using on_records with some global isHEader boolean and then columns to manually convert to columns. All felt like reinventing what library already does well (when the header is in the same place).

Second workaround was to parse once to locate header, and then start over parsing from that line - that seems to be working but code is much much complicated and I cannot just stream the file and do it in one pass.

Let me know how much of a problem would it be to make header detection and if you are accepting PRs what conditions / requirements for it to be accepted without much back and forth - maybe I could contribute. Whatever would work :-)

Aside from that - great job and awesome library guys! - It helped me a lot with a LOT of huge and nasty csv files. :-)

The text was updated successfully, but these errors were encountered:

wdavidw · 2020-08-04T20:46:10Z

If I understood well, the column options can already be defined as an array, see those 3 tests. It is also documented. Does this answer your need or did I read your request to fast? Please provide a little wished sample to ease my understanding.

ev45ive · 2020-08-05T16:41:13Z

Nope, I know about that one and unfortunately, the columns array doesn't help.

The problem is when header position (its row number) is not known.

There is no way to "detect header" row and THEN return Array (or true for auto) like the option you suggested.

ev45ive · 2020-08-05T16:42:40Z

Each document I have has a different start/header row position.

Column labels also changes. Each time I have to find header first and then start parsing again.

wdavidw · 2020-08-05T19:12:45Z

Then please illustrate your requirements with a minimalist sample.

d-mon- · 2021-02-26T06:56:37Z

First of all, thank you for this awesome library!

We encountered the same issue on our side too.

As @ev45ive described, we need to skip from 1 to 8 rows before reaching the header position.
In addition, the columns order may differ from one file to another: it can be a,b,c, a,b,c,d or a,c,b

Currently the only solution we could find was the same as described above:

parse the file once to find the header position
and parse it again starting from the header position found previously.

Because we are managing large files, we must ensure that only a small section of the file is computed during the first operation (with from_line, to_line).

Here's a small illustration of the files we may have to deal with for the same parser:

line to skip 1
a,b,c
1,2,3
5,6,7

line to skip 1
line to skip 2
a,c,b
1,3,2
5,7,6

line to skip 1
line to skip 2
a,b,c,d
1,2,3,4
5,6,7,8

ev45ive added the enhancement label Aug 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "column headers detection function" when column headers row is unknown #294

Add "column headers detection function" when column headers row is unknown #294

ev45ive commented Aug 1, 2020 •

edited

wdavidw commented Aug 4, 2020

ev45ive commented Aug 5, 2020

ev45ive commented Aug 5, 2020

wdavidw commented Aug 5, 2020

d-mon- commented Feb 26, 2021 •

edited

Add "column headers detection function" when column headers row is unknown #294

Add "column headers detection function" when column headers row is unknown #294

Comments

ev45ive commented Aug 1, 2020 • edited

wdavidw commented Aug 4, 2020

ev45ive commented Aug 5, 2020

ev45ive commented Aug 5, 2020

wdavidw commented Aug 5, 2020

d-mon- commented Feb 26, 2021 • edited

ev45ive commented Aug 1, 2020 •

edited

d-mon- commented Feb 26, 2021 •

edited