Skip to content
This repository has been archived by the owner on Jun 28, 2021. It is now read-only.

Add "column headers detection function" when column headers row is unknown #294

Open
ev45ive opened this issue Aug 1, 2020 · 5 comments

Comments

@ev45ive
Copy link

ev45ive commented Aug 1, 2020

Is your feature request related to a problem? Please describe.
When column headers differ and their row differs per document (different "intro" in each document)
I have to parse once to find the header and then do it again to work from that line.
Also "from" vs "from_line" becomes confusing when the header is not the first row

Describe the solution you'd like
I think aside from static row number there should be option to add function that returns true or array of headers (column names) when it detects columns header row, and then we parse from that on with columns ( object fields ) named as headers

Describe alternatives you've considered
I've tried a lot of workarounds. using on_records with some global isHEader boolean and then columns to manually convert to columns. All felt like reinventing what library already does well (when the header is in the same place).

Second workaround was to parse once to locate header, and then start over parsing from that line - that seems to be working but code is much much complicated and I cannot just stream the file and do it in one pass.

Let me know how much of a problem would it be to make header detection and if you are accepting PRs what conditions / requirements for it to be accepted without much back and forth - maybe I could contribute. Whatever would work :-)

Aside from that - great job and awesome library guys! - It helped me a lot with a LOT of huge and nasty csv files. :-)

@wdavidw
Copy link
Member

wdavidw commented Aug 4, 2020

If I understood well, the column options can already be defined as an array, see those 3 tests. It is also documented. Does this answer your need or did I read your request to fast? Please provide a little wished sample to ease my understanding.

@ev45ive
Copy link
Author

ev45ive commented Aug 5, 2020

Nope, I know about that one and unfortunately, the columns array doesn't help.

The problem is when header position (its row number) is not known.

There is no way to "detect header" row and THEN return Array (or true for auto) like the option you suggested.

@ev45ive
Copy link
Author

ev45ive commented Aug 5, 2020

Each document I have has a different start/header row position.

Column labels also changes. Each time I have to find header first and then start parsing again.

@wdavidw
Copy link
Member

wdavidw commented Aug 5, 2020

Then please illustrate your requirements with a minimalist sample.

@d-mon-
Copy link

d-mon- commented Feb 26, 2021

First of all, thank you for this awesome library!

We encountered the same issue on our side too.

As @ev45ive described, we need to skip from 1 to 8 rows before reaching the header position.
In addition, the columns order may differ from one file to another: it can be a,b,c, a,b,c,d or a,c,b

Currently the only solution we could find was the same as described above:

  • parse the file once to find the header position
  • and parse it again starting from the header position found previously.

Because we are managing large files, we must ensure that only a small section of the file is computed during the first operation (with from_line, to_line).

Here's a small illustration of the files we may have to deal with for the same parser:

line to skip 1
a,b,c
1,2,3
5,6,7
line to skip 1
line to skip 2
a,c,b
1,3,2
5,7,6
line to skip 1
line to skip 2
a,b,c,d
1,2,3,4
5,6,7,8

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants