Improve tsv-pretty lookahead logic [tsv-pretty mistake in column formatting.] #264

Jemi · 2020-02-18T06:41:10Z

I'm using tsv-utils from the arch linux aur, trying to format some word frequency data from the new general services list dataset. tsv-utils makes at least two errors that I'm able to see when I'm running this commandline:

tsv-select -f 1,7 NGSL+1.01+with+SFI.tsv | tsv-pretty | less

adding -s 5 to tsv-pretty works around this problem. The tsv file was converted from the file NGSL+1.01+with+SFI.xlsx available from this site: http://www.newgeneralservicelist.org/

I'll attach
NGSL+1.01+with+SFI.xlsx
the original xlsx file because github won't let me post the tsv file.

Jemi · 2020-02-18T06:53:07Z

I forgot to mention - the mistakes I found were at 'misrepresentation' and 'interoperability'.

jondegenhardt · 2020-02-18T08:41:46Z

Thanks for the bug report! I'm wondering if the default lookahead is insufficient for this file. By default, tsv-pretty reads the first 1000 lines and bases columns boundaries on that. Then when wider columns are found it adjusts. I grabbed the file from the site, it has 31242 lines.

Try running the command with a lookahead that includes the whole file. e.g.

tsv-select -f 1,7 NGSL+1.01+with+SFI.tsv | tsv-pretty -l 40000 | less

Let me know if that fixes the problem you are seeing. (Looks reasonable on my machine.) If it doesn't fix it we'll have to take a more detailed look.

Jemi · 2020-02-18T17:15:17Z

With that setting it looks like the issue is fixed. However, I would note that 1000 lines is pretty conservative, particularly since tsv files are liable to be huge. The lookahead would probably be fast on modern machines even if it defaulted to the first megabyte of data. Also, since a line can be any length, and it seems that what you're trying to conserve is file loading time (or at least trying to prevent loading times from being arbitrarily long), the lookahead should probably be based on size in bytes, rather than lines. 512k or even 1,024k should be a reasonable default. This particular file is 1.7M and both less and nvim were able to load it in less than 1 second on a 10 year old i3 laptop - albeit with a modern SSD. But even with an IDE HDD I wouldn't expect it to take very long.

jondegenhardt · 2020-02-18T18:07:02Z

Yes, I agree. And I appreciate your feedback on this issue.

The reasoning behind the 1000 lines was more to handle the cases when data is being read from standard input, where lines might trickle in slowly, depending on the input source. There's also a secondary issue that the lookahead buffer needs to be kept in memory.

However, a better default choice might be a relatively large buffer, and allow the user to select a small buffer in the small number of cases where immediate output is preferred. Obviously, more sophisticated approaches are possible as well. The implementation could track the time and rate of input arrival and the amount of memory consumed (addressing your point about differences in line size), and make on-the-fly choices.

Jemi · 2020-02-18T19:00:55Z

You could make the default 1024k and print a message to stderr if the input is too slow to hit the screen in reasonable time. Something like:

"Slow input stream. Consider adjusting one of the following options: [-l][etc...]."

and similarly, when the input file is larger than 1024k, a message could be printed to stderr like:

"Input file larger than default (1024k) lookahead. Please check the formatting of output columns and consider adjusting options [-l][etc...] if necessary."

jondegenhardt · 2020-02-20T22:22:50Z

Marking as an enhancement request. Improvement opportunities discussed in the issue.

jondegenhardt · 2020-02-22T21:57:10Z

A consideration I forgot to mention - In the interactive use case it is often advantageous to align based on what the user is most going to see and try to gracefully degrade in exception cases. This is different than the case where consistent alignment over a large data set is desired.

Consider an interactive use like:

$ <command> | tsv-pretty | less

In many cases the user will scroll through a few screenfuls and then stop. Aligning based on the widest value in data the user doesn't reach is not beneficial, as it unnecessarily increases the whitespace gaps between columns and total line width.

Using a smaller lookahead buffer is helpful in a pragmatic way, in that it aligns based on the widest column the user is mostly likely to see. tsv-pretty goes to some effort to gracefully degrade when wider fields are encountered.

The smaller lookahead buffer is no panacea though. It's fairly common to have columns with outlier values. Image that a field has nearly all values less than 10 characters, but one or two values that are 20-30 characters. If one of these occurs early in the data stream then it will still result in undesirably large whitespace gaps for the vast majority of screenfuls where all the values are less than 10 characters.

For this reason I have thought about adding some form of analysis of field width distribution to identify outliers and select the common case field width more optimally.

However, the above approaches are not desirable for use-cases where alignment based on the widest value in the data is preferred. Such use cases are relatively easy to satisfy, in that looking through an entire data set, or at least a very large portion of it, is not especially hard. For these cases, tsv-pretty could make it easier for the user to communicate the goal. The current mechanisms, -l|--lookahead and -m|--max-text-width, place additional burden on the user to find and set the parameters correctly for their use case.

Jemi · 2020-02-23T06:42:06Z

I hadn't really considered the less use case too much, as I was only using less to check my work. In the less use case, messages on stderr will be especially helpful, as they will not be mixed with stdout messages in that use case. But in the general case, we can assume that the user's tsv will end up in multiple different kinds of pipeline. Worst case here, user calls tsv-pretty to do some work with a file, runs it through several pipelines, believes they have a consistently formatted file to their specifications, but then finds out 2 (weeks, months, years) later that the formatting was never consistent and whatever work they had produced based on that belief is now in question. The bug in this use case is both potentially severe and also hard to see.

Also, for the interactive use case, an even better idea might be a dedicated tsv-utils pager. The advantage here would be that it could do the formatting automatically and dynamically in the manner you've described (and in both directions - expanding whitespace and contracting it as needed), and the user could mark the first line (or an arbitrary line) as a header line, and it could follow them around dynamically in the file (I believe in html/css they call this position:fixed). The pager could allow "sideways" paging with a configurable amount of "horizontal" context, as well as a sensible default. In general a dedicated pager for tsv files can do things a standard pager cannot do, and can also do things a static pipeline formatting utility cannot do.

jondegenhardt added the enhancement label Feb 20, 2020

jondegenhardt changed the title ~~tsv-pretty mistake in column formatting.~~ Improve tsv-pretty lookahead logic [tsv-pretty mistake in column formatting.] Feb 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve tsv-pretty lookahead logic [tsv-pretty mistake in column formatting.] #264

Improve tsv-pretty lookahead logic [tsv-pretty mistake in column formatting.] #264

Jemi commented Feb 18, 2020

Jemi commented Feb 18, 2020

jondegenhardt commented Feb 18, 2020

Jemi commented Feb 18, 2020

jondegenhardt commented Feb 18, 2020

Jemi commented Feb 18, 2020 •

edited

jondegenhardt commented Feb 20, 2020

jondegenhardt commented Feb 22, 2020 •

edited

Jemi commented Feb 23, 2020

Improve tsv-pretty lookahead logic [tsv-pretty mistake in column formatting.] #264

Improve tsv-pretty lookahead logic [tsv-pretty mistake in column formatting.] #264

Comments

Jemi commented Feb 18, 2020

Jemi commented Feb 18, 2020

jondegenhardt commented Feb 18, 2020

Jemi commented Feb 18, 2020

jondegenhardt commented Feb 18, 2020

Jemi commented Feb 18, 2020 • edited

jondegenhardt commented Feb 20, 2020

jondegenhardt commented Feb 22, 2020 • edited

Jemi commented Feb 23, 2020

Jemi commented Feb 18, 2020 •

edited

jondegenhardt commented Feb 22, 2020 •

edited