Skip to content

Releases: eBay/tsv-utils

v2.2.1

13 Jun 02:36
v2.2.1
7c6a3dd
Compare
Choose a tag to compare
Bump version for new release.

v2.2.0 Release: Line buffering; New tsv-filter features (--count, --label)

14 Mar 19:00
v2.2.0
eea97ee
Compare
Choose a tag to compare

To download and unpack prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.2.0/tsv-utils-v2.2.0_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.2.0/tsv-utils-v2.2.0_osx-x86_64_ldc2.tar.gz | tar xz

Installation instructions are in the ReleasePackageReadme.txt file in the release package.

To be notified of new releases:

GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".

Release 2.2.0 Changes:

  • tsv-filter: New feature, count matches rather than filtering (--c|count). This option causes the number of matching lines to be printed rather than the individual matching lines.
  • tsv-filter: New feature, marking records rather than filtering (--label). This option causes every record to be marked with an indication of whether it satisfied the test. Marking is done by appending a new field with an indicator value. See PR #338 for details.
  • New option: Line buffering, available in most tools (--line-buffered). This option causes each line to read and written as soon as it is available. This overrides the default buffering behavior. This is useful when reading from slow input streams. See PR #336 for details.

Other Changes

  • Prebuilt binaries have been updated to use LDC compiler version ldc-1.24.0.
  • Changes to the LDC build parameters to better support Archlinux and other platforms. See PR #329.

v2.1.2 Minor Release

11 Oct 02:50
v2.1.2
37bb806
Compare
Choose a tag to compare

To download and unpack prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.2/tsv-utils-v2.1.2_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.2/tsv-utils-v2.1.2_osx-x86_64_ldc2.tar.gz | tar xz

Installation instructions are in the ReleasePackageReadme.txt file in the release package.

To be notified of new releases:

GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".

Release 2.1.2 Changes

  • Small performance improvement in several tools by switching from File.write to File.rawWrite. See PR #316.
  • Stopped using LDC option -disable-fp-elim. This option is no longer available starting with LDC 1.24.0 (next version) and is a required change. See PR #316.

Prebuilt binaries have been built using the latest LDC compiler (ldc-1.23.0).

v2.1.1 Minor Release

14 Sep 01:13
v2.1.1
0c6154c
Compare
Choose a tag to compare

To download and unpack prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.1/tsv-utils-v2.1.1_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.1/tsv-utils-v2.1.1_osx-x86_64_ldc2.tar.gz | tar xz

Installation instructions are in the ReleasePackageReadme.txt file in the release package.

To be notified of new releases:

GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".

Release 2.1.1 Changes

  • Improved csv2tsv buffer utilization. Enables better performance of subsequent tasks in a pipeline due to more frequent writes to standard output (better parallelization). Minor performance benefits to csv2tsv by itself. See PR #305.
  • Code change to support an upcoming D language change (minor). A tagged release with this change is needed to support tsv-utils use in the D Language project tester. See PR #306.

Prebuilt binaries have been built using the latest LDC compiler (ldc-1.23.0).

v2.1.0 Release: csv2tsv updates

08 Sep 09:05
v2.1.0
f1a81d6
Compare
Choose a tag to compare

To download and unpack prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.0/tsv-utils-v2.1.0_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.0/tsv-utils-v2.1.0_osx-x86_64_ldc2.tar.gz | tar xz

Installation instructions are in the ReleasePackageReadme.txt file in the release package.

To be notified of new releases:

GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".

Release 2.1.0 Changes: csv2tsv

  • Performance improvements: csv2tsv is significantly faster as a result of switching to a buffer-based conversion algorithm. The 2.1.0 version runs 40-60% faster than the 2.0.0 version on tests on Mac OS, depending on the type of file. See PR #301 for details.
  • UTF-8 Byte Order Marks (BOMs) found in CSV input files are discarded when producing TSV output. See PR #302 for details.
  • TAB and Newline replacement strings can now be specified separately. Previously, only one replacement string was allowed for both newline and TAB characters in the CSV data. Now different replacements can be provided. This uses the new command line arguments --r|tab-replacement and --n|newline-replacement. See PR #303 for details.

Other Changes

  • Prebuilt binaries have been updated to use the latest LDC compiler (ldc-1.23.0).

v2.0.0 Release: Named Fields

11 Jul 04:11
v2.0.0
4295ba5
Compare
Choose a tag to compare

To download and unpack prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.0.0/tsv-utils-v2.0.0_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.0.0/tsv-utils-v2.0.0_osx-x86_64_ldc2.tar.gz | tar xz

Installation instructions are in the ReleasePackageReadme.txt file in the release package.

To be notified of new releases:

GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".

Release 2.0.0 Changes: Named Field Support

Release 2.0.0 adds named field support to all tools in the tsv-utils toolkit. This is a significant usability improvement.

Named fields can be used with any file or data stream that has a header line. Named fields are enabled by the --H|header option. Field numbers can be used as well, just as in the prior versions of the toolkit. Glob-style wildcards can be used and escapes can be used to specify field names containing special characters.

Details are available in the Field Syntax section of the Tools Reference manual.

Examples - Assume a file with the header fields:

 1    test_name
 2    run
 3    elapsed_time
 4    user_time
 5    system_time
 6    max_memory

Commands like the following can be used:

$ # Select individual fields, like 'cut'
$ tsv-select data.tsv -H -f user_time            # Field  4
$ tsv-select data.tsv -H -f test_name,user_time  # Fields 1,4
$ tsv-select data.tsv -H -f '*_time'             # Fields 3,4,5

$ # Filter lines using numeric comparisons against individual fields
$ tsv-filter data.tsv -H --lt elapsed_time:100
$ tsv-filter data.tsv -H --gt elapsed_time:100 --lt system_time:20

$ # Statistical summaries
$ tsv-summarize data.tsv -H --median elapsed_time
$ tsv-summarize data.tsv -H --median '*_time'
$ tsv-summarize data.tsv -H --group-by test_name --median '*_time'

$ # Uniq'ing on a field
$ tsv-uniq data.tsv -H -f test_name 

$ # Joins - Assume another file 'test_info.tsv' with 'test_name' and
$ # 'expected_time' fields. A join can be performed using column names.
$ tsv-join -H -f test_into.tsv data.tsv --key-fields test_name --append-fields expected_time

See the reference docs or online help for details on specific tools. There is also documentation in the Tools Overview section of the main project README file.

Named field support addresses enhancement request #25. It implemented via PRs #284 through #300.

Other Changes

  • Prebuilt binaries have been updated to use the latest LDC compiler (ldc-1.22.0).

v1.6.1 Minor Release

19 Apr 20:07
v1.6.1
68c6ff2
Compare
Choose a tag to compare

To download and unpack prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.1/tsv-utils-v1.6.1_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.1/tsv-utils-v1.6.1_osx-x86_64_ldc2.tar.gz | tar xz

Installation instructions are in the ReleasePackageReadme.txt file in the release package.

To be notified of new releases:

GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".

Release 1.6.1 Changes:

  • Performance improvement to tsv-split --lines-per-file functionality (PR #280).
  • Bug fix: Detect command line entered field ranges ending with field zero (PR #279).
  • Bug fix: @safe attribution changes to enable Windows compilation of bufferedByLine (Issue #282, PR #283).

v1.6.0 Release

28 Mar 04:29
v1.6.0
f9c0ef7
Compare
Choose a tag to compare

To download and unpack prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.0/tsv-utils-v1.6.0_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.0/tsv-utils-v1.6.0_osx-x86_64_ldc2.tar.gz | tar xz

Installation instructions are in the ReleasePackageReadme.txt file in the release package.

To be notified of new releases:

GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".

Release 1.6.0 Changes:

  • Prebuilt binaries have been updated to use the latest LDC compiler (1.20.1).

  • tsv-select: New feature, the ability to exclude fields (PR #267).

    Fields to exclude are specified with the --e|exclude option. Examples:

    $ # Drop the first field, keep everything else.
    $ # Equivalent to `cut -f 2- file.tsv`
    $ tsv-select --exclude 1 file.tsv
    
    $ # Drop fields 3-10, keep everything else
    $ tsv-select --exclude 3-10 file.tsv
    

    See the tsv-select reference for more information.

  • New tool: tsv-split (PR #270)

    tsv-split is used to split one or more input files into multiple output files. There are three modes of operation:

    • Fixed number of lines per file (--l|lines-per-file NUM): Each input block of NUM lines is written to a new file. This is similar to the Unix split utility.

    • Random assignment (--n|num-files NUM): Each input line is written to a randomly selected output file. Random selection is from NUM files.

    • Random assignment by key (--n|num-files NUM, --k|key-fields FIELDS): Input lines are written to output files using fields as a key. Each unique key is randomly assigned to one of NUM output files. All lines with the same key are written to the same file.

    Examples:

    $ # Split a file into files of 10,000 lines each.
    $ tsv-split data.txt --lines-per-file 10000 --dir split_files
    
    $ # Split a file into 1000 files with lines randomly assigned.
    $ tsv-split data.txt --num-files 1000 --dir split_files
    
    # Randomly assign lines to 1000 files using field 3 as a key.
    $ tsv-split data.tsv --num-files 1000 -key-fields 3 --dir split_files
    

    See the tsv-split reference for more information.

v1.5.0 Release

16 Feb 06:34
v1.5.0
31a318e
Compare
Choose a tag to compare

To download and unpack prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.5.0/tsv-utils-v1.5.0_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.5.0/tsv-utils-v1.5.0_osx-x86_64_ldc2.tar.gz | tar xz

Installation instructions are in the ReleasePackageReadme.txt file in the release package.

To be notified of new releases:

GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".

Release 1.5.0 Changes:

  • Prebuilt binaries have been updated to use the latest LDC compiler (1.20.0).

  • tsv-filter: Field list support (PR #259).

    Field list provide a compact way to specify multiple fields for a command. Most tsv-utils tools already support field lists, now tsv-filter does as well. Examples:

    $ # Select lines where fields 1-10 are not empty.
    $ tsv-filter --not-empty 1-10 data.tsv
    
    $ # Select lines where fields 1-5 and 17 are less than 100
    $ tsv-filter --lt 1-5,17:100 data.tsv
    
  • tsv-filter: New field length tests based on either characters or bytes (PR #258).

    The new operators allow filtering on field length. Field length can be measured in either characters or bytes. (Characters can occupy multiple bytes in UTF-8). Examples:

    $ # Keep only lines where field 3 is less than 50 characters
    $ tsv-filter --char-len-lt 3:50 data.tsv
    
    $ # Find lines where field 5 is more than 20 bytes
    $ tsv-filter --byte-len-gt 5:20
    

    Character length tests have names of the form: --char-len-eq|ne|lt|le|gt|ge]. Byte length tests have names of the form: --byte-len-[eq|ne|lt|le|gt|ge].

  • tsv-filter: Improved error messages when invalid regular expressions are used.

    The error message printed by tsv-filter now includes the error text provided by the D regular expression engine. This is helpful when trying to debug complex regular expressions. Examples:

    $ # Old error message (tsv-filter 1.4.4)
    $ tsv-filter --regex 4:'abc(d|e' data.tsv
    [tsv-filter] Error processing command line arguments: Invalid values in option: '--regex 4:abc(d|e'. Expected: '--regex <field>:<val>' where <field> is a number and <val> is a regular expression.
    
    $ # New error message (tsv-filter 1.5.0)
    [tsv-filter] Error processing command line arguments: Invalid regular expression: '--regex 4:abc(d|e'. no matching ')'
    Pattern with error: `abc(d|e` <--HERE-- ``
       Expected: '--regex <field>:<val>' or '--regex <field-list>:<val>' where <val> is a regular expression.
    

    The formatting of the message can be improved and is likely to be updated in the future.

  • tsv-uniq: Performance improvements (PRs #234, #235).

    Better memory management and other changes improved tsv-uniq performance by 5-35% depending on the operation.

  • tsv-sample: Performance improvements reading large data blocks from standard input (PR #238).

    Sampling and shuffling operations requiring that all data be read into memory were unnecessarily slow when large amounts of data was read from standard input. Performance issues were noticed with data sizes larger than 10 GB. This is now fixed.

  • Sample bash scripts included in release package (PR #254).

    Sample versions of the tsv-sort and tsv-sort-fast scripts described on the Tips and Tricks page are now included in the repository and in prebuilt binary packages.

v.1.4.4 Minor Release

23 Sep 17:53
v1.4.4
5182339
Compare
Choose a tag to compare

Changes:

  • New tsv-sample option --i|inorder

    This option preserves input order when using simple or weighted random sampling. These sampling modes are engaged when a sample size is selected via the --n|num NUM option. Documentation was updated to better reflect the distinction between shuffling the full data set and random sampling which selects a subset of lines. (PR #226)

  • tsv-summarize --min and --max operators changed to preserve original input string

    The prior behavior of the operators was to read the values to a double, then use numeric formatting to print the recorded double. In some cases this would cause the original input to change, especially if it was a long format number, for example, 16 digits long. (PR #220)

    The prior behavior makes sense for calculations like mean and median, but not for min and max. In particular, preserving the original values allows them to be joined with or compared to the original data.

  • Prebuilt binaries have been updated to use the latest LDC compiler (1.17.0).

To download and unpack the prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.4/tsv-utils-v1.4.4_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.4/tsv-utils-v1.4.4_osx-x86_64_ldc2.tar.gz | tar xz