Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always make the last separators mandatory #2

Open
cipriancraciun opened this issue May 5, 2022 · 3 comments
Open

Always make the last separators mandatory #2

cipriancraciun opened this issue May 5, 2022 · 3 comments

Comments

@cipriancraciun
Copy link

First of all, given there is no clear specification, I interpret the current USV as described in issue #1.

Thus, my suggestion is to make the unit / record / group / file separators mandatory at the end of each such block.

The reasons are:

  • many parsers would most likely be lenient and just ignore a RS that is immediately followed by a GS or FS; (this plays into the next ones;)
  • in security there is the rule of "canonical representation", thus especially if one were to sign an USV file, there should be a canonical representation (enforced by the parser), so that an attacker can't just fill the file with separators that a lenient parser would just ignore;
  • at the moment the empty string is a valid USV; how should it be interpreted? as a single file, with a single group, with a single record, with an empty field (this would result from my interpretation described in Provide an actual specification (i.e. BNF or equivalent form) #1)? should it be interpreted as a single file, with a single group, but with no records? perhaps as an empty list of no files?
  • also, given that separators are not mandatory, any file (that is an UTF-8 valid one) that doesn't contain separators is a valid USV file with a single file/group/record/unit;
  • detecting a truncated file -- at the moment a single value-1<US>value-2 is a valid USV, however it might also be the prefix of a longer file that contained more records, but which was truncated; having the last separators mandatory, make the truncation detectable; (granted, the stream might get truncated at <FS> boundaries and not be detected, but given that most USV file would contain only one file, that would be an acceptable trade-off;)

And, if those are not convincing enough, here is a practical reason: it's simpler to write the formatter, because one can just print the last separator without checking if this was indeed the last item in its block:

for f in files :
  for g in f.groups :
    for r in g.records :
      for u in r.units :
        print(u.value)
        print(US)
      print(RS)
    print(GS)
  print(FS)

(I'll leave to others to think about the implementation where the last separator is not mandatory.) :)

@joelparkerhenderson
Copy link
Member

Yes your writeup is excellent. In practice, I see two additional issues that are related to your points. What I'd like to do is keep this issue open and use it for discussion because I'm 100% aiming to standardize and do a BNF and similar, and this repo is helping to find corner cases and to find advice.

  1. CSV and TSV files often end with a newline, which makes these formats easier to edit with a typical line-oriented editor, and also easier to commit to repositories that require every text file to have a final newline, or that use line-oriented merge tools that flag a missing final newline. In practice, a USV format user will often encounter a final newline to deal with, or delete, because of line-oriented Unix tools. An open question is how much of a pain point it would be to enforce zero trailing newline in a typical developer's editor.

  2. The primary use cases that I've seen so far in the past few years I've been working with USV is for units and records (a.k.a. columns and rows), not groups and files (a.k.a. tables and schemas). So I believe it's highly desirable in practice to have USV work with the unit separator and the record separator, without any group separator or file separator. An open question is the tradeoff between developer ergonomics in a typical editor versus a simpler parser.

What are your thoughts about these?

@cipriancraciun
Copy link
Author

Your point (1) (about newlines) I think is more related to issue #3. (I'll reply to it there.)

The primary use cases that I've seen so far in the past few years I've been working with USV is for units and records (a.k.a. columns and rows), not groups and files (a.k.a. tables and schemas). So I believe it's highly desirable in practice to have USV work with the unit separator and the record separator, without any group separator or file separator. An open question is the tradeoff between developer ergonomics in a typical editor versus a simpler parser.

I also believe that any "tabular format" will deal in 99.999% of the cases only with one table per file. Thus, in terms of specifications, there are two major choices to be made (which are most of the time conflicting):

  • should one give special treatment to the "default" use-case? in USV's case, should the group and file separator be optional? (this makes the life of the user easier;)
  • should the format be as simple as possible? in USV's case it means that the formatter and parser be as simple as possible, thus the group and file separator be mandatory;

That being said, I think in the case of USV these are the most sensible choices:

  • don't support groups and files at all;
  • make the separators mandatory as this issue initially suggested;
  • go back to the drawing board and see if there isn't a better way to support groups / files;

I'll try to tackle a bit the third choice (i.e. going back to the drawing board with groups / files).

My assumption is that groups (and files) were meant to support multiple tables in the same spreadsheet, and multiple spreadsheets respectively.

However, currently USV misses one important feature of these, namely how to identify which group / file is which? I.e. table / spreadsheets titles.

So perhaps one could rework how groups / files work by introducing some missing features, and perhaps by dropping the symmetry with units / records.

For example (and this is not something I've thoroughly thought about) how about this new syntax:

USV := file + | group + | records
file := FS <file name> US <file description> RS ( ( group ) * | records )
group := GS <group name> US <group description> RS records
records := ( record ( RS record ) * ) ?
record := ( unit ( US unit ) * ) ?

Namely, files and groups are introduced by FS / GS, meanwhile records / units are joined (or in my #2 proposal terminated) by RS / US. Moreover a USV can contain either multiple files, multiple groups, or just records in an unnamed file; then a file can contain multiple groups, or just records in an unnamed group. The US and RS are reused by files and groups to denote the name and description.

It's not as nice as the initial specification, but it does support (without ambiguity) the case of just records, just groups, files with just records, files with groups.

Also this second proposal does suffer from the same truncation issue as described in #2, thus perhaps a group terminator and file terminator might be useful, as in:

file := FS <file name> US <file description> RS ( ( group ) * | records ) FS
group := GS <group name> US <group description> RS records GS

I.e. two adjacent files would be joined by FS FS as would two adjacent groups by GS GS.

@joelparkerhenderson
Copy link
Member

Lots of info below... I'm hoping I'm responding to each of your points because I very much appreciate your insights.

in security there is the rule of "canonical representation",
thus especially if one were to sign an USV file, there should
be a canonical representation (enforced by the parser)

100% agree.

many parsers would most likely be lenient and just ignore
a RS that is immediately followed by a GS or FS

This must be a hard error i.e. the entire parse must be invalid.

TODO: add this to the docs.

at the moment the empty string is a valid USV; how should it be interpreted?

This must have a spec.

The complement also must have a spec e.g. given a blank spreadsheet, what must the USV export be?

TODO: spec this.

given that separators are not mandatory, any file (that
is an UTF-8 valid one) that doesn't contain separators is
a valid USV file with a single file/group/record/unit;

You're correct this is an issue.

How does these issues interact with similar data exchange formats?

  • CSV/TSV/ASV use the middle infix only. Is a blank file, or file with just "foo", valid CSV?

  • HTML/XML/JSON structures use a leading prefix (e.g. open tag or open curly) and different trailing suffix (e.g. close curly). Is a blank file, or file with just "foo", valid HTML?

I believe you're honing in on a tension of these options:

  • leading prefix i.e. print separator then content

  • trailing suffix i.e. print content then separator

  • middle infix i.e. print content then separator then content

  • some combination of the above

detecting a truncated file

How about delegating this to a checksum that's out of scope of USV?

Detecting unexpected file truncation, or other kinds of unexpect corruption, are big scope increase (IMHO) for a simple format.

at the moment a single value-1value-2 is a valid USV

Yes, and real world cases that have come up somewhat-often where the content is solely units, never records.

In practice, the big ones so far have involved logging:

  • trace logging, where each log atom can span many lines, such as printing a stack trace.

  • keyboard logging, where each log atom is one user keypress, and could be a tab, or comma, or return, etc.

Worth mentioning, the real world cases are somehat-often using different dimensions meaning each record is using a different number of units. In other words, the data isn't an X,Y grid. A typical example is walking file systems, where directories (which are treated as USV records) can have a different numbers of entries (which are treated as USV units).

That being said, I think in the case of USV these are the most sensible choices:

  1. don't support groups and files at all;
  2. make the separators mandatory as this issue initially suggested;
  3. go back to the drawing board and see if there isn't a better way to support groups / files;

I agree with your choices.

1 is not viable because the groups are must-have in practice, in order to be able to export a typical database set of schemas, or a typical Excel spreadsheet set of folios. The real world use case is import/export all the data, which is then slurped into another system that knows enough about the data structure. For import/export where the other system doesn't know enough about the data, we use a typical Postgres database dump (including metadata, table layouts, etc.), or a zip file of Excel .xls files (including metadata, macros, etc.).

2 I want to think more about this

3 Likewise

here is a practical reason: it's simpler to write the formatter, because one can just print the last separator without checking if this was indeed the last item in its block:

I would describe that style of loop as using content "terminators" or "trailing separators", rather than content "splitters" a.k.a. "in-between separators".

This feels akin to C style null terminated strings.

My intuition is there are large advantages to this approach, such as for streaming data-- a stream source can output a unit and its terminator, without needing to be aware of whether there's a next unit coming. What would you do to trigger the start-of-file or start-of-group or start-of-record or start-of-unit?

OTOH, it's a totally different approach than CSV, TSV, ASV, all of which use in-between separators.

USV misses one important feature of these, namely how to
identify which group / file is which? I.e. table / spreadsheets titles.

Yes. In practice this hasn't been an issue because the reader and writer both pre-agree on the overall data structure. In other words, USV hasn't yet aimed to reconstitute table names, nor even table column headers. For example, USV doesn't specify that a record's first row is the column names. Whenever we've needed to reconstitute the data structure, we've switched from USV to more-powerful formats (e.g. Postgres dump, Excel zips, etc. as above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants