Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support extracting end-of-line comments? #27

Open
nickrobinson251 opened this issue Oct 5, 2021 · 4 comments
Open

Support extracting end-of-line comments? #27

nickrobinson251 opened this issue Oct 5, 2021 · 4 comments
Assignees

Comments

@nickrobinson251
Copy link
Owner

nickrobinson251 commented Oct 5, 2021

Suppose we have a line like:

111,'STBC ',161.00,1, 0.00, 0.00,227, 1,1.09814, -8.327, 1 /* [STBC 1 ] */

can we support the user wanting to extract the end-of-line comment "[STBC 1 ]" or even "STBC 1"

@nickrobinson251 nickrobinson251 added the improvement improvement to an existing feature label Oct 5, 2021
@nickrobinson251 nickrobinson251 added new feature and removed improvement improvement to an existing feature labels Oct 26, 2021
@nickrobinson251
Copy link
Owner Author

this may require reverting back to the tactic we were using before #28

@nickrobinson251
Copy link
Owner Author

@raphaelsaavedra discovered these actually come in two ways (from different sources):

  1. the "trailing characters" case, given above, which i presumed was some kind of end-of-line comment (with /* as some kind of comment marker)
 111,'STBC      ',161.00,1,    0.00,    0.00,227,   1,1.09814,  -8.327,  1 /* [STBC   1   ] */ 

and
2. the "extra column" case (note the final , separator):

 111,'STBC      ',161.00,1,    0.00,    0.00,227,   1,1.09814,  -8.327,  1, /* [STBC   1   ] */ 

And we may need to support both.

Fortunately, i think we can support both.

  • we probably need to add to all Records an extra Union{Missing,String} column (maybe this String type could be String31 or something? or even be detected as part of parsing e.g. the smallest string type possible, like CSV.jl does)
  • we might want the ability to opt-in to parsing them (i.e. returning them as part of the parsed data) e.g. a comments=true keyword... which we could either default to false or default to "true if present" and then do some auto-detecting on whether or not there are comments present (e.g. by checking the first line of the Buses data).
  • For the "comments" case, we need to handle hitting an invalid delimiter...
  • For the "extra column" case, i guess we'd need to follow all the current last columns by a _parse_maybemissing call
    • potentially this could depend on a comments::Bool keyword
  • All of this is slightly complicated further when records are not a single line (e.g. Transformers, Multi-Terminal DC lines, etc)

@raphaelsaavedra
Copy link

Just guessing out here since I know very little about how this package is structured, but wouldn't it be a good idea to make it so that both cases can be addressed in the same way? e.g. by doing something like "if we see there's a comment at the end of the line, split it out to a new column", which makes case 1 become the same as case 2.

@nickrobinson251
Copy link
Owner Author

nickrobinson251 commented Feb 22, 2022

the difference is in what Parsers.jl sees

basically how parsing works is that the file is a big vector of bytes (a Vector{UInt8}) and we go "byte by byte" through it (well, Parsers.jl does).

We tell Parsers.jl:

  • (i) how to split the file up into bytes which go together (e.g. that "bytes which go together" are separated by the delimiter , i.e. 0x2c) which is done via the Parsers.Options, and
  • (ii) what type those bytes should be parsed into (e.g. [0x31, 0x32, 0x33] should be an Int64) which is given by the field type for that column (i.e. we hardcode the column-types in dedicated structs, e.g. Loads, then pass this info to Parsers.jl)

then Parsers.xparse does the heavy-lifting (here's the main parsing code, which is all just "use xparse and handle what it gives us e.g. check it worked and store thee returned value)

Anyway, all of this is to say, that Parsers will see the two cases differently, because in the first case 1 /* [STBC 1 ] */ won't be split correctly into "bytes which go together" unless we tell it how to (i.e. if we say "',' is the delimiter between bytes which go together" then this won't be split up as we need it to bein step (i)), in contrast 1, /* [STBC 1 ] */ will be split up fine with the current code... but we'd still need to add an extra String column to the structs for step (ii)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants