Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlr --otsv does not handle broken quotes correctly compared to --ocsv #1533

Open
SpikyClip opened this issue Mar 22, 2024 · 6 comments
Open

Comments

@SpikyClip
Copy link

SpikyClip commented Mar 22, 2024

version:mlr 6.12.0

I have a TSV file that contains uneven double quotes " that often results in downstream tools expecting RFC compliant TSVs dropping rows silently (e.g. R read_tsv). This is a problem as it may not be immediately obvious that rows are getting dropped. Here is a simulated dataset (broken.tsv.txt):

col1	col2	col3
1	abc	tsvb
2	"efg	csv
3	hij	fdfd
4	"klm"	sss
5	oh"	sss

It is expected that the following would make the TSV RFC compliant:

$ mlr --itsvlite --otsv cat broken.tsv.txt
col1    col2    col3
1       abc     tsvb
2       "efg    csv
3       hij     fdfd
4       "klm"   sss
5       oh"     sss

However it does not. Though --ocsv works:

$ mlr --itsvlite --ocsv cat broken.tsv.txt
col1,col2,col3
1,abc,tsvb
2,"""efg",csv
3,hij,fdfd
4,"""klm""",sss
5,"oh""",sss

I tried the same with the same error in csv format:

$ mlr --icsvlite --ocsv cat broken.csv
col1,col2,col3
1,abc,tsvb
2,"""efg",csv
3,hij,fdfd
4,"""klm""",sss
5,"oh""",sss

So this definitely feels like a bug. I was under the impression that --tsv is essentially --csv with tab as the field separator, so it should follow all the standard RFC compliance rules around double quotes. I tried to see if I could trick it by converting to csv then to tsv but it didnt work:

$ mlr --itsvlite --ocsv cat broken.tsv.txt | mlr --icsv --otsv cat
col1    col2    col3
1       abc     tsvb
2       "efg    csv
3       hij     fdfd
4       "klm"   sss
5       oh"     sss

I think this could be fixed by transferring the --csv behaviour around stray double quotes such that it also works for --tsv. --tsvlite could remain unaffected in case this behaviour is not desired.

broken.csv
broken.tsv.txt

@aborruso
Copy link
Contributor

Hi @SpikyClip is there an RFC for TSV? If yes, could you share the URL?

I think there is only the one for CSV, in fact maybe the way to have a TSV that behaves like a CSV is the solution.

I'm not sure I understand your goal, but I'll try. If you run

mlr --icsvlite --ocsv --ragged --fs "\t" cat broken.tsv.txt

you get

col1	col2	col3
1	abc	tsvb
2	"""efg"	csv
3	hij	fdfd
4	"""klm"""	sss
5	"oh"""	sss

@SpikyClip
Copy link
Author

SpikyClip commented Mar 22, 2024

I get your point that RFC4180 formally applies to CSV files and not necessarily TSV files. But from my experience the two are essentially treated the same except that the delimiter is different.

To expand on my use case, a common package used in my field is readr from R tidyverse which has a function read_tsv that has a default arg quote = "\"" that would result in the silent dropping (or concatenation) of rows, and to downstream users of a potentially broken TSV file they may not notice it easily (especially in a large file with thousands of rows and hundreds of columns). Sure its an easy fix on the reader-side (quote = "") but it requires some due diligence, I'd rather just avoid the issue.

Besides, its unusual considering --tsvlite is an option - if they behave the same it doesn't really match the same semantic pattern compared to --csvlite and --csv.

Thanks for the --fs "\t" solution, will try that out (is --ragged necessary though?).

@aborruso
Copy link
Contributor

In my experience RFC4180 is rightly and usually taken into account for CSVs. Not for TSV.
And so it is for Miller as well, so the way to solve your use case is --fs"\t", csvlite and csv.

--ragged I dont think its necessary.

@johnkerl
Copy link
Owner

I get your point that RFC4180 formally applies to CSV files and not necessarily TSV files. But from my experience the two are essentially treated the same except that the delimiter is different.

@SpikyClip there's the rub! I had agreed with your statement here, and this is what Miller once did, but as of #923 this is no longer the case. Formerly, Miller's TSV was RFC4180 CSV with commas replaced by tabs. Now, while TSV does not have an RFC, as of #923 I follow the behavior as described at https://miller.readthedocs.io/en/6.12.0/file-formats/#csvtsvasvusvetc.

Namely: embedded newlines and tabs are encoded as \n and \t.

@SpikyClip
Copy link
Author

Thanks John, I came across the issue to add --tsvlite but not on #923. So IANA TSV is the standard. Admittedly I was a bit confused about the decoding/encoding description in the docs but it makes sense now if --tsv has no other ways of protecting against those characters.

I have no issues with this behaviour, though I think it would be a good idea to mention in the docs that --tsv does not support double quoting. It's a minor thing to add and makes it clear, and considering 3 issues have been raised on this, and that there are parsers out there that are based off the (non-existent) RFC4180 TSV compliance.

TSV (tab-separated values): FS is tab and RS is newline (or carriage return + linefeed for Windows). On input, if fields have \r, \n, \t, or \, those are decoded as carriage return, newline, tab, and backslash, respectively. On output, the reverse is done -- for example, if a field has an embedded newline, that newline is replaced by \n. TSV does not support RFC-4180-style double-quoting.

For my purposes I am simply going to sed those " into ' as they don't contribute any useful information in the TSV. That way I can be sure it will not cause issues with RFC4180 CSV and IANA TSV parsers.

Am happy to close this issue if you are.

@johnkerl
Copy link
Owner

Thanks @SpikyClip !! And thanks for reminding me of the name "IANA" for the (pseudo-)spec which I ended up following on #923.

Let's leave this issue open as a doc issue. I'll incorporate your suggestions above, and I'll also use the term "IANA" for clarity, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants