Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple and/or multi-character and/or regex comment_char in read_csv() #10583

Closed
Wainberg opened this issue Aug 18, 2023 · 5 comments
Closed
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@Wainberg
Copy link
Contributor

Wainberg commented Aug 18, 2023

Problem description

Multiple comment_char meaning e.g. ('#', '%') both start comments. Multi-character meaning e.g. // starts a comment, like in C++.

In particular, it would be very nice to support comment_char='##' for VCF files, one of the most common file formats in computational biology. In VCF files, the first few lines are metadata starting with a ## (and should be excluded), but the header line starts with a single #, so comment_char='#' would erroneously exclude the header.

Multi-character comments were requested in pandas, but the feature request (which was originally about multiple rather than multi-character comments) was closed for being "difficult" to implement. I'm sure it would be no problem for the polars team though :)

@Wainberg Wainberg added the enhancement New feature or an improvement of an existing feature label Aug 18, 2023
@Wainberg Wainberg changed the title Support multiple comment_char and/or multi-character comment_char in read_csv() Support multiple and/or multi-character and/or regex comment_char in read_csv() Aug 18, 2023
@stinodego
Copy link
Member

For your specific use case, I would recommend setting skip_rows to skip the metadata lines. If you don't know in advance how many lines there are, you could write some util to determine this.

Supporting multiple chars / non-ASCII chars would be nice (for separator / quote_char / comment_char / eol_char), but definitely not simple. If someone manages to implement this elegantly, I wouldn't mind a PR.

@ritchie46
Copy link
Member

@stinodego I have refused to add this on many occassions and I really don't think we should add this. This would have very large performance impacts which I don't think are worth it. I want the csv-parser to be performant and close to a formal csv format as possible. Multiple character and worse regex delimiters will have very negative performance impacts.

@ritchie46
Copy link
Member

I would only accept mutli-char comments as this can be implemented cheaply.

@Wainberg
Copy link
Contributor Author

Honestly, I'd be inclined to agree - regex doesn't seem that much more useful and could have a large performance impact unless it's implemented as an entirely separate code path. @ritchie46 would you be inclined to accept #12519 for the multi-char comments?

@Wainberg
Copy link
Contributor Author

Closing as completed via #12519, thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

No branches or pull requests

3 participants