[WIP] Prototype: count k-mers in banded mode in a single pass #1818

standage · 2017-11-16T04:41:15Z

This PR is a Python prototype of the idea described in #1800. In brief, the idea is to count k-mers in the input reads and populate N counttables (each with ≈1/N of the k-mers) in a single pass.

This prototype implements a k-mer buffer that, when "full", writes to disk and flushes/resets. Once all the input reads are processed, the temp files are read back from disk 1-by-1 to populate a Counttable for each band.

The prototype works as expected, but the performance is painfully slow on large data sets. It would need a Cython or C++ implementation with threading support to be viable methinks. I would welcome any thoughts. cc @camillescott @betatim @ctb @luizirber

Is it mergeable?
make test Did it pass the tests?
make clean diff-cover If it introduces new functionality in
scripts/ is it tested?
make format diff_pylint_report cppcheck doc pydocstyle Is it well
formatted?
Did it change the command-line interface? Only backwards-compatible
additions are allowed without a major version increment. Changing file
formats also requires a major version number increment.
For substantial changes or changes to the command-line interface, is it
documented in CHANGELOG.md? See keepachangelog
for more details.
Was a spellchecker run on the source code and documentation after
changes were made?
Do the changes respect streaming IO? (Are they
tested for streaming IO?)

standage · 2017-11-16T05:03:55Z

curl -L https://osf.io/f5trh/download?version=1 -o reads.fq.gz
sandbox/count-band-single-pass.py \
    --ksize 25 --num-bands 4 --buffersize 1e7 --memory 1e7 \
    --outfmt "bandtest{}.ct" \
    reads.fq.gz

The procedure above will result in the same output as the Python commands below, which do 4 passes over the reads.

for i in range(4):
    counts = khmer.Counttable(25, 1e7 / 4, 4)
    counts.consume_seqfile_banding('reads.fq.gz', 4, i)
    of = 'bandtest{}.ct'.format(i)
    counts.save(of)

betatim · 2017-11-28T08:56:49Z

For my education: why split into bands but then create N counttables? I understand "let's make N bands because we have far too much data anyway" but that doesn't fit with then making N bands. Tim is confused.com

standage · 2017-11-28T17:28:49Z

We are investigating some use cases where we compute only on a single band, but there is at least one use case (motivated by kevlar and variant discovery) where we want to compute on each band independently. In this case, the banding provides us a way of randomly sharding the k-mers into N disjoint sets, each of which only requires 1/N of the memory required to store the entire data set.

This PR is looking for an efficient way to compute counts for all bands in a single first pass, rather than N passes.

Prototype: count k-mers in banded mode in a single pass

f4348f0

standage added the Work In Progress label Nov 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Prototype: count k-mers in banded mode in a single pass #1818

[WIP] Prototype: count k-mers in banded mode in a single pass #1818

standage commented Nov 16, 2017

standage commented Nov 16, 2017 •

edited

betatim commented Nov 28, 2017

standage commented Nov 28, 2017

[WIP] Prototype: count k-mers in banded mode in a single pass #1818

Are you sure you want to change the base?

[WIP] Prototype: count k-mers in banded mode in a single pass #1818

Conversation

standage commented Nov 16, 2017

standage commented Nov 16, 2017 • edited

betatim commented Nov 28, 2017

standage commented Nov 28, 2017

standage commented Nov 16, 2017 •

edited