Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing a Field with internal quotes fails #100

Open
Daenyth opened this issue Apr 10, 2019 · 4 comments
Open

Parsing a Field with internal quotes fails #100

Daenyth opened this issue Apr 10, 2019 · 4 comments

Comments

@Daenyth
Copy link
Contributor

Daenyth commented Apr 10, 2019

It's very common for CSV files in practice to have quotes inside of a field, like this:

h1,h2
f1 "with quote" inside,f2

I'd expect the row to be parsed as follows, because the quotes are not escaping a , (or \t for TSV parsing)

Row(
  Field("""f1 "with quote" inside"""),
  Field(f2)
)

A test:

  "parser" should "parse inner quotes" in {
    import _root_.io.chrisdavenport.cormorant.{CSV, parser}
    val s = """f1 "with quote" inside,f2"""
    val r = parser.parseRow(s)
    r shouldBe Right(
      CSV.Row(NonEmptyList.of(CSV.Field("""f1 "with quote" inside"""),
                              CSV.Field("f2"))))
  }

Failure:

Right(Row(NonEmptyList(Field(f1 )))) was not equal to
Right(Row(NonEmptyList(Field(f1 "with quote" inside), Field(f2))))

I expect this behavior is to handle using " to escape separators inside fields, but I think that could be covered by only handling it as an escape if it's at the edge of the field.

@ChristopherDavenport
Copy link
Collaborator

ChristopherDavenport commented Apr 10, 2019

You cannot have a quote inside without quotes on the outside per the RFC.

field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
-- Baseline
COMMA = %x2C
CR = %x0D ;as per section 6.1 of RFC 2234 [2]
DQUOTE =  %x22 ;as per section 6.1 of RFC 2234 [2]
LF = %x0A ;as per section 6.1 of RFC 2234 [2]
CRLF = CR LF ;as per section 6.1 of RFC 2234 [2]
TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

@Daenyth
Copy link
Contributor Author

Daenyth commented Apr 10, 2019

Which RFC though? I looked at 2234 and it seems to be talking about ABNF and not any csv stuff

@ChristopherDavenport
Copy link
Collaborator

https://tools.ietf.org/html/rfc4180 - Section 2

@Daenyth
Copy link
Contributor Author

Daenyth commented Apr 11, 2019

I have some ideas about refactoring how the parsers are defined to make code reuse more possible and let people define their own custom parsing rules - I'll see if I can make a POC repo at some point for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants