Skip to content

Releases: epruesse/SINA

Minor fix (build issue w/o TBB Malloc)

13 Dec 20:06
Compare
Choose a tag to compare

See #98. No changes if TBB Malloc is present.

Minor fix (rounding error in classifier)

19 Aug 00:47
Compare
Choose a tag to compare
  • Fixes #93 where an LCA-Quorum of 0.8 on 10 results allowed only one outlier, rather than the expected 2.

Improved CSV output

31 Jul 23:50
Compare
Choose a tag to compare

The old --meta-fmt CSV option has been deprecated in favor of having multiple output modules active. To get CSV output as well as the aligned sequences, you can now write -o aligned.fasta.gz -o aligned.csv. The fields that are written to CSV, FASTA or ARB output types can be configured with the --field (-f) parameter. SINA can now also show a list of all fields available in a reference ARB database using --arb-list-fields FILENAME.

Changelog:

  • allow multiple output types at once
  • add dedicated CSV/TSV output (#10)
  • fix loading reference database from running ARB (#76 )
  • report errors when sequence can't be read from ARB (#73)
  • add --arb-list-fields listing fields available in ARB
    database

Minor fixes

09 Mar 03:27
Compare
Choose a tag to compare
  • All progress bars now silence when the output is redirected into a file or pipe
  • Progress bars no longer overwrite some of the previous output (i.e. the cursor is no longer moved up too often).

Speedups: Internal Kmer Search Now Default

26 Apr 01:22
Compare
Choose a tag to compare

With 1.6.0, the new, very fast internal search engine has become the default. The --search module has been parallelized and performance has been tweaked in many other places.

Here are some numbers:

Input Reference Settings 1.6.0 1.5.0 speedup
V4 SILVA NR align 282/s 22/s 12.8
V4 SILVA NR align & classify 185/s 3/s 61.7
V4 SILVA NR turn & align & classify 120/s 3/s 40
full SILVA NR align 42/s 3/s 14
full SILVA NR align & classify 35/s 0.65/s 58.3
full SILVA NR turn & align & classify 33/s 0.6/s 55
V4 test (38k) align 312/s 225/s 1.4
V4 test (38k) align & classify 265/s 25/s 10.6
V4 test (38k) turn & align & classify 260/s 25/s 10.4
full test (38k) align 58/s 45/s 1.3
full test (38k) align & classify 51/s 9.6/s 5.3
full test (38k) turn & align & classify 51/s 6/s 8.5

(Numbers from a Ryzen 1700 with 32GB and 16 threads)

Prerelease: speedups!!!

25 Mar 17:21
Compare
Choose a tag to compare
Pre-release

It's finally done. Please give it a spin.

With 1.6.0, the new internal search engine is becoming the default. The --search module has been parallelized and performance has been tweaked in many other places.

Towards an Internal Kmer Search Engine

07 Feb 02:59
Compare
Choose a tag to compare

Internal Kmer Search Update

With this release, the internal kmer search is nearing completion. The kmer-index is now persisted to disk, computed in parallel, and uses a presence/absence optimization to reduce its total size and search speed. It's many times faster than the original PT server based search. (You still need to use --num-pts though to make it use multiple threads). Tweaks to the way SINA interacts with ARB and caches sequences internally have reduced the memory usage of the kmer search indexing and use stages to allow working with the current SILVA Ref NR with on a 16GB machine.

Documentation Update

The documentation is now up to date with the current features. A man file is distributed with SINA and available via man sina from conda environments. Text-file versions are shipped in share/doc/sina, and a pretty html version rendered by sphinx is available at https://sina.readthedocs.io.

Evalutation Options Reinstated

The options --show-dist and --fs-msc-max have been re-instated to allow evaluating the accuracy of SINA. New unit tests are in place to verify that the accuracy doesn't accidentally drop. These will help making the switch to the internal kmer search without risking significant changes to the overall accuracy.

Changelog

  • update documentation (#20)
  • reinstate --show-dist
  • reinstate --fs-msc-max
  • add choice "exact" to --search-iupac
  • change default for --search-kmer-len to match --fs-kmer-len
  • parallelize launch of background PT servers
  • lower memory usage:
    • avoid redundant sequence caching by libARBDB
    • use compact aligned base (50% on internal sequence cache)
  • improve internal kmer search performace
    • add caching of kmer index on disk
    • parallelize kmer index construction
    • add presence/absence optimization
  • fix field align_ident_slv added for 100% matches even when
    not enabled
  • fix crash on overhang past alignment edge
  • fix libARBDB writing to stdout, clobbering sequence output
  • fix out-of-bounds access on iterator in NAST implementation
  • remove dependency on boost serialization library
  • build release binaries with GCC 7 and C++11 ABI
  • add integration tests watching for accuracy regressions (#25)

Full Changelog on ReadTheDocs

Parallel SINA

09 Nov 19:06
Compare
Choose a tag to compare

Parallel SINA is here!

Use --num-pts N to specify the number of PT servers you would like working in parallel. The rest of SINA will adapt dynamically to the available resources (if you must, adjust it with --threads).

Please remember that the PT server is rather memory hungry. If you set --num-pts too high, you will run out and SINA will crash.

Other Improvements:

Add search result to output:

Using --add-relatives N you can now ask SINA to add the search result sequences to the sequence output file. If you have --search enabled, it will use the n best results from the alignment based homology search. Otherwise, it will use the n sequences with the highest relative number of kmers shared with each query. Each reference sequence will be added only once.

Input / Output:

SINA will now read and write gzipped FASTA files transparently. You can also use - as input/output file name to pipe sequences through SINA.

Logging

SINA now has an actual logging facility. You can change it's verbosity with -q, and -v (repeat to increase or decrease further). The log file specified with --log-file will always be verbose (but not include debug messages).

Parallel SINA - Preview

30 Oct 03:29
Compare
Choose a tag to compare
Pre-release

Parallelization adds a whole new class of bugs that become possible. If this breaks, stalls, crashes or otherwise misbehaves, please create an issue!

  • process sequences in parallel (#17, #31)
  • add support for gzipped read/write (#29)
  • add support for "-" to read/write using pipes
  • remove internal pipeline in favor of TBB
  • add option --add-relatives; adds ref sequences to output (#19)
  • add logging with variable verbosity (#14)
  • be smart about locating arb_pt_server binary (#30)
  • add --add-relatives adding search result to output (#19)

Maintenance Release

20 Sep 02:09
Compare
Choose a tag to compare
  • report number of references discarded due to configured constraints
  • fix crash (regression) if no acceptable references found for a query
  • fix --search causes a program option error (#28)
  • fix race condition in terminating PT server