Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using commented header line (flat file readers) #20378

Closed
Socob opened this issue Mar 16, 2018 · 4 comments
Closed

Using commented header line (flat file readers) #20378

Socob opened this issue Mar 16, 2018 · 4 comments
Labels
IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue

Comments

@Socob
Copy link

Socob commented Mar 16, 2018

Currently, it is not possible to both ignore comments and use a commented header when reading a CSV file. From the documentation for the header argument of read_table etc.:

Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

It would be great if one could specify that header should not skip commented lines so that a header can be used even if it happens to contain the comment character.

Other people requesting this:

@TomAugspurger
Copy link
Contributor

Can you show an actual example? The two you linked to sound different than" Using a commended header line."

@TomAugspurger TomAugspurger added IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue labels Mar 16, 2018
@Socob
Copy link
Author

Socob commented Mar 16, 2018

Sure, just take the example from the Stack Overflow link:

import pandas as pd
from io import StringIO
s = '#one two three\n1 2 3'
pd.read_csv(StringIO(s), delim_whitespace=True, comment='#')

Output:

Empty DataFrame
Columns: [1 2 3]
Index: []

Desired: Instead of the second line (1 2 3), the first line with the comment (#one two three) should be used as the header. The second line should be interpreted as data.

@TomAugspurger
Copy link
Contributor

Thanks.

FWIW, I think that

f = StringIO(s)
header = f.readline().rstrip().strip("#").split(" ")  # use csv to make more robust
df = pd.read_csv(f, names=header)

To be pretty clear.

How would your proposal interact with the other keywords that deal with position, like header, skiprows, etc?

Would this require a new keyword to preserve backwards compatibility? As you've written it, it's backwards incompatible, and we're hesitant to add new keywords to the already long read_csv signature, especially when the workaround is relatively straightforward.

@Socob
Copy link
Author

Socob commented Mar 16, 2018

Of course, I don’t intend to break backwards compatibility. To cover all cases, a new keyword would probably be necessary, yes. The workaround is not so straightforward if the header is not the very first line, but I suppose any cases where that’s necessary are pretty obscure.

If the desire against new keywords outweighs the benefits of simplifying this use case, I’d be willing to close this.

@Socob Socob closed this as completed Mar 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

2 participants