Skip to content

Releases: eBay/tsv-utils

v1.1.19: Minor updates

18 Mar 21:27
Compare
Choose a tag to compare

NOTE: Unfortunately, the pre-built binaries for v1.1.19 and earlier releases have been lost. Please use the pre-built binaries from the latest release. There is nothing wrong with the old binaries, if you downloaded one earlier you can continue to use it.

Changes in v1.1.19:

  • tsv-uniq - New options for printing only repeated lines: --r|repeated, --a|at-least N.
  • tsv-pretty - New option for verbatim output of an initial set of lines: --a|preamble N.
  • makefile help - Bug fix in the output.

v1.1.18: Minor updates

25 Feb 17:31
Compare
Choose a tag to compare

NOTE: Unfortunately, the pre-built binaries for v1.1.19 and earlier releases have been lost. Please use the pre-built binaries from the latest release. There is nothing wrong with the old binaries, if you downloaded one earlier you can continue to use it.

Changes in v1.1.18:

  • tsv-uniq - Added a --m|max option to output up to a max number of duplicate lines. The default of course is one.
  • tsv-sample - Added PGO support. Small gains, up to 5% depending on sampling method.
  • Better unit test diagnostic output on "command line" tests. This simplifies tracking down errors when tests are run on a system like TravisCI. In the past it was necessary to run the test locally to see what failed.
  • Bash completion - Fix a tsv-filter option.
  • Doc updates - Added a pair of sections to the Tips and Tricks doc. One describing TSV and CSV differences, another giving examples of using dos2unix and iconv to deal with encoding and newline issues.

v1.1.17: Output Buffering

26 Jan 07:23
Compare
Choose a tag to compare

NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.

Changes in v1.1.17:

Most of the tools were switched to use output buffering. This is a performance enhancement that works by buffering small writes into larger blocks before writing to the final output destination, usually stdout. The amount of benefit depends on the tool and the nature of the file being processed. Narrow files (short lines) see the most benefit, and in some cases run 50% faster. More typical gains are 5-20%.

Output buffering logic is in the BufferedOutputRange struct found in common/src/tsvutil.d. The resulting source code in each tool turns out to be quite readable.

v1.1.16: Profile guided optimization; New sampling methods

14 Jan 16:55
Compare
Choose a tag to compare

NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.

Changes in v1.1.16:

The main changes in this release are the use of Profile Guided Optimization (PGO) and the addition of new sampling methods in tsv-sample.

Profile Guided Optimization - This is a follow-on to the Link Time Optimization work done in v1.1.15. It is based on LDC compiler support for LTO and PGO, including the ability to operate on the application code and the D standard libraries (druntime, phobos) together.

Profile Guided Optimization uses data collected from instrumented builds to better optimize executables. The tsv utilities build process has been updated to generate and use instrumentation for several of the tools. LTO and PGO builds are enabled by options passed to make. The pre-built binaries available from the GitHub releases page are built with LTO and PGO, but they must enabled explicitly when building from source. See Building with Link Time Optimization and Profile Guided Optimization for details.

PGO results in material performance gains (10% or more) on csv2tsv and tsv-summarize, and smaller gains (2-5%) on several other tools. Considering LTO (v1.1.15) and PGO (v1.1.16) combined, performance gains on five of six measured benchmarks ranged from 8-45% on Linux, and 6-57% on MacOS. Three of the benchmarks saw gains greater than 25% on both platforms.

New sampling methods - Two sampling methods have been added to tsv-sample. One is a simple stream sampling mode that selects a random portion of an input stream based on a sampling rate. Another is a form of sampling known as "distinct" sampling. This selects a random portion of records based on a key in the data. For example, if records contain an IP address, sampling to take all records from 1% of the unique IP addresses. See the tsv-sample reference for details.

Other changes

  • tsv-summarize bug fix, incorrect headers on two operations.
  • Windows line ending detection when running on Unix platforms (Issue #96)
  • tsv-select performance improvement: Avoid unnecessary memory allocation from std.array.join. A 5% performance improvement and less memory allocation.

v1.1.15: Link Time Optimization

10 Nov 07:18
Compare
Choose a tag to compare

NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.

Changes in v1.1.15:

This release uses new link-time optimization (LTO) available starting with the LDC 1.5 compiler release. This improves the performance of most of the tools, typically by about 10% over the previous release, and significantly more in some cases. Benchmarks can be found in this slide deck from the Silicon Valley D Meetup, Dec 14, 2017.

Previous releases used Thin LTO on OS X builds. LTO was not used on Linux builds. In the OS X case, LTO was used on the tsv utilities code, but not the code from the D libraries, phobos and druntime.

The LDC 1.5 release supports LTO on both Linux and OS X out of the box, and includes support for building phobos and druntime with LTO.

This release of the tsv utilities adds support for the new LTO capabilities to the makefiles. It is not enabled by default, but can be turned on with make arguments. The prebuilt binaries have been built with LTO turned on. For more information, see Building With LTO.

v1.1.15-beta3: Link Time Optimization

05 Nov 04:11
Compare
Choose a tag to compare
Pre-release

NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.

Changes in v1.1.15-beta3:
This release uses new link-time optimization (LTO) available starting with the LDC 1.5 compiler release.

v1.1.14 - Documentation updates

19 Oct 14:28
Compare
Choose a tag to compare

NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.

Changes in v1.1.14:

No functional changes, updates to documentation only.

v1.1.13 - New tool: tsv-pretty

23 Sep 21:47
Compare
Choose a tag to compare

NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.

Changes in v1.1.13: New tool, tsv-pretty.

tsv-pretty prints TSV data in an aligned fasion for command-line readability. Headers are detected automatically and numeric values aligned. An example, first without formatting:

$ cat sample.tsv
Color   Count   Ht      Wt
Brown   106     202.2   1.5
Canary Yellow   7       106     0.761
Chartreuse	1139	77.02   6.22
Fluorescent Orange	422     1141.7  7.921
Grey	19	140.3	1.03

Now with tsv-pretty, using header underlining and float formatting:

$ tsv-pretty -u -f sample.tsv
Color               Count       Ht     Wt
-----               -----       --     --
Brown                 106   202.20  1.500
Canary Yellow           7   106.00  0.761
Chartreuse           1139    77.02  6.220
Fluorescent Orange    422  1141.70  7.921
Grey                   19   140.30  1.030

v1.1.12: Link Time Optimization on OS X builds

21 Jun 13:48
Compare
Choose a tag to compare

NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.

Changes in v1.1.12:

Turn on Link Time Optimization (LTO) when using the LDC compiler on OS X. This produces faster executables. The difference is especially notable for the csv2tsv tool, which runs about 20% faster. LTO is used in the pre-built OS X binaries and will be used on OS X source code builds (git clone, dub fetch) when building with the LDC compiler.

OS X directly supports LTO with the system linker provided by XCode (Clang / LLVM). LTO can also be used on Linux, but at present it requires installing and building special linker support. This complicates the build process, which is why it is not used on Linux by this toolset. For more information on LDC's LTO support see http://johanengelen.github.io/ldc/2016/11/10/Link-Time-Optimization-LDC.html.

v1.1.11 - Field ranges

08 May 00:04
Compare
Choose a tag to compare

NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.

Changes in v1.1.11:

Main feature is support for field ranges. Any place a list of fields can be entered, field ranges can be used as well. A field range is a pair of field numbers separated by a hyphen. Reverse order is supported as well. Single field numbers and field ranges can be used together. Some examples:

$ tsv-select --fields 1,2,17-33,10-7  data.tsv
$ tsv-summarize --group-by 3-5 --median 7-17
$ tsv-uniq --fields 7-10 data.tsv

There are also some improvements to error message text.