Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues when parsing CSV files that extensively use text qualifiers #60

Open
mvlakh opened this issue Apr 26, 2021 · 6 comments
Open

Issues when parsing CSV files that extensively use text qualifiers #60

mvlakh opened this issue Apr 26, 2021 · 6 comments

Comments

@mvlakh
Copy link
Contributor

mvlakh commented Apr 26, 2021

Hi,

I am using Camel to process CSV files and as I understand Camel utilizes Flatpack to parse CSV content. It looks like there are several defects in Flatpack that do not allow to parse CSV files properly if they use text qualifier a lot, looks like there are several edge cases when the library cannot handle content properly:

  • if there multiline string like this one the library handles it incorrectly:
Bob,Smith,bsmiht@test.com,"This is a long fragment of text
that should be processed as a single field", 1988, 111-222-33,"another field with new line character
 that should be considered as a field of the same data row"

It looks like it tries to consume it as separate CSV rows and not as one row

  • if string starts with or contains escaped text qualifier characters that are part of the string value, the library tries to consume one string as several separate cells:
Bob, Smith,"""Test"" , 2, Some string, still string, also part of the string.",11111111
@benoitx
Copy link
Contributor

benoitx commented Apr 26, 2021 via email

@mvlakh
Copy link
Contributor Author

mvlakh commented Apr 26, 2021

Hello Benoit,

That space was a typo, in my case it was not present so I undated my past comment.
Prepared a pull request with 2 unit tests that replicate the issues and one fix suggestion

Thank you,
Mykhailo

benoitx added a commit that referenced this issue Apr 26, 2021
Unit tests that replicate the issue #60
@benoitx
Copy link
Contributor

benoitx commented May 26, 2021 via email

@mvlakh
Copy link
Contributor Author

mvlakh commented May 27, 2021

Hello Benoit,

Yes it looks like you misunderstood me a bit. My pull requested contained 2 tests ,that reproduce 2 issues, and a fix only for one of these issues. So it is expected that the other test testCsvDocumentWithMultilineString still fails.

As a workaround for this issue with multiline string, that does not have a fix as of now, I implemented my own version of the function fetchNextRecord and injected it via inheritance. I do not know if my alternative implementation is completely correct but it solved the issues I had.

Attaching the class here for you to take a look. If this code makes sense you can try to use it instead of the existing impl to see how well it works for all the cases.

I do not want to create a pull request because this is an experimental change and I do not know if you can consume it as whole, most likely you will need to completely revisit this impl.

Thank you,
Mykhailo

DelimiterParser.txt

@benoitx
Copy link
Contributor

benoitx commented May 27, 2021 via email

@mvlakh
Copy link
Contributor Author

mvlakh commented Jun 1, 2021

Hi,

Sorry for a delay, had a very busy week.

I made a mistake in my test and this is the reason it does not work for you. I have created a pull request that contains fixes for that test and my custom fetch next record function impl that fixes that test. Please take a look at it and see if that can help you to clarify everything.

The alternative fetch function implementation seems work fine, at least it solves my issues and I have not issues with it. But I cannot be sure that it covers all the cases.

Thank you,
Mykhailo

benoitx added a commit that referenced this issue Jun 1, 2021
Fixed unit test that replicates the issue qualifiers #60 and a fix suggestion for it and similar issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants