add krakenuniq summary kmer filter #998

tomkinsc · 2019-12-03T15:47:45Z

To address broadinstitute/viral-classify#1, this adds a new command, krakenuniq_report_filter, to metagenomics.py:

usage: metagenomics.py subcommand krakenuniq_report_filter [-h]
                                                           [--fieldToFilterOn {num_reads,uniq_kmers}]
                                                           [--fieldToAdjust {num_reads,uniq_kmers} [{num_reads,uniq_kmers} ...]]
                                                           [--keepAboveN KEEP_THRESHOLD]
                                                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                                           [--version]
                                                           [--tmp_dir TMP_DIR]
                                                           [--tmp_dirKeep]
                                                           summary_file_in
                                                           summary_file_out

Filter a krakenuniq report by field to include rows above some threshold,
where contributions to the value from subordinate levels are first removed
from the value.

positional arguments:
  summary_file_in       Input KrakenUNiq-format summary text file with tab-
                        delimited fields and indented taxonomic levels.
  summary_file_out      Output KrakenUNiq-format summary text file with tab-
                        delimited fields and indented taxonomic levels.

optional arguments:
  -h, --help            show this help message and exit
  --fieldToFilterOn {num_reads,uniq_kmers}
                        The field to filter on (default: uniq_kmers).
  --fieldToAdjust {num_reads,uniq_kmers} [{num_reads,uniq_kmers} ...]
                        The field to adjust along with the --fieldToFilterOn
                        (default: ['num_reads']).
  --keepAboveN KEEP_THRESHOLD
                        Only taxa with values above this will be kept. Higher
                        taxonomic ranks will have their values reduced by the
                        values of removed sub-ranks (default: 100)
  --loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}
                        Verboseness of output. [default: INFO]
  --version, -V         show program's version number and exit
  --tmp_dir TMP_DIR     Base directory for temp files. [default:
                        /var/folders/bx/90px0g1n1v122slzjnyvrsk5gn42vm/T/]
  --tmp_dirKeep         Keep the tmp_dir if an exception occurs while running.
                        Default is to delete all temp files at the end, even
                        if there's a failure. (default: False)

The behavior of this command is such that using a depth-first traversal, the lowest rows in the report have the value in the field specified by --fieldToFilterOn (default: uniq_kmers) zeroed out if their value is below the threshold given by --keepAboveN (default: 100). Under the assumption that higher taxonomic levels have cumulative values including contributions from the zeroed-out rows, the values of the selected field in higher levels are reduced by the amount contained within lower-levels that were below the specified threshold (the subtraction is propagated up the tree to the root node). Since the traversal is depth-first, the higher levels are eventually re-evaluated to see if they no longer meet the threshold after being subjected to subtraction of their lower levels.

The hierarchy of rows is read based on the indentation of the taxName column since many rows do not have a formal taxonomic rank assigned (i.e. their rank is "no rank").

Secondarily, the fields specified by --fieldToAdjust (default: num_reads) are similarly adjusted using the conditional threshold established by --keepAboveN and --fieldToFilterOn: their values are subtracted similarly, with propagation up the tree.

After adjustment to these counts across the entire tree, the part-of-whole percentages for the rows are reflected to reflect the new read counts. The resulting tree is then written out to a new KrakenUniq-format report, filtered to include only those rows meeting the initial threshold criterion.

…ort_filter update read counts and percentages in metagenomics.py::krakenuniq_report_filter to reflect nodes removed based on filtering

dpark01 · 2019-12-03T19:36:35Z

Fun, you did it!

I think the whole time I was thinking of this little task, I always figured that the only way I'd really understand how it worked and whether it was working properly was to stare at how it handled a small suite of unit tests on really simple/stripped down inputs demonstrating the various edge cases. Something like:

a report that emerges unchanged after filtering (because its values are large enough)
a report that gets a few taxonomic leaves trimmed off (but no higher nodes)
a report that gets a higher node pruned in the first pass (before re-summing)
a report that gets a higher node pruned off after re-summing (due to pruned leaves)

Not sure if that's quite the right set of test cases so feel free to rethink it. But can you add unit tests?

Also: likely for a separate PR, but it'd be nice to add an optional param to provide a specific exclusion list of taxids (which would always remove those nodes including any lower ranking nodes beneath it.. and then recompute/sum upwards).

tomkinsc · 2019-12-03T19:39:19Z

Yup, I'll add unit tests—just wanted to get this open to avoid duplication of effort in case this was on anyone else's agenda.

add new test type: assertApproxEqualValuesInDelimitedFiles(self, file_one, file_two, dialect="tsv", numeric_rel_tol=1e-5, header_lines_to_skip=0

add test class TestKrakenUniqSummaryFilter, with first test case: test_unchanged_report

dpark01 · 2019-12-09T14:11:46Z

test/unit/test_metagenomics.py

+        self._test_report("should_have_leaves_trimmed.txt")
+
+    #def test_higher_node_trimmed_after_resumming(self):
+    #    self._test_report("should_have_higher_node_trimmed_after_resumming.txt")


anything wrong with these tests at the moment?

Nothing wrong--they're just not implemented yet (have to make the synthetic input/output).

dpark01 · 2019-12-09T14:12:05Z

test/__init__.py

@@ -111,6 +114,47 @@ def inputs(self, *fnames):
        '''Return the full filenames for files in the test input directory for this test class'''
        return [self.input(fname) for fname in fnames]

+    def assertApproxEqualValuesInDelimitedFiles(self, file_one, file_two, dialect="tsv", numeric_rel_tol=1e-5, header_lines_to_skip=0):


oh man -- this test function will definitely come in handy

extend assertApproxEqualValuesInDelimitedFiles() test to use dicts in the case where a header line of field names is available so value comparisons can be made independent of column order. This is only used if use_first_processed_line_for_fieldnames=True. is_number() moved to util.misc; zip_dicts() added to util.misc

yesimon · 2020-04-23T19:05:22Z

Don't mind if I pull this into viral-classify?

dpark01 · 2020-04-23T19:06:29Z

@yesimon please do -- but on a separate branch/pr from the refactor one (which will take longer to vet)

tomkinsc · 2020-11-17T06:02:14Z

Looks like this never got moved over to viral-classify; shall we?

tomkinsc added 5 commits December 3, 2019 00:47

initial commit of metagenomics.py::krakenuniq_report_filter

bc5abd7

update read counts and percentages in metagenomics.py::krakenuniq_rep…

63fc075

…ort_filter update read counts and percentages in metagenomics.py::krakenuniq_report_filter to reflect nodes removed based on filtering

parser_krakenuniq_report_filter help typo fix

4a4072c

cruft removal

da7f6ea

consistent util.fil.open_or_gzopen calls

82dac27

tomkinsc added 5 commits December 3, 2019 14:39

Merge branch 'master' into ct-krakenuniq-summary-kmer-filter

c7a5270

add new test type: assertApproxEqualValuesInDelimitedFiles

f153b0e

add new test type: assertApproxEqualValuesInDelimitedFiles(self, file_one, file_two, dialect="tsv", numeric_rel_tol=1e-5, header_lines_to_skip=0

pad each level with single space since we aren't collapsing doubles

1f3377e

add test class TestKrakenUniqSummaryFilter

f1bbad2

add test class TestKrakenUniqSummaryFilter, with first test case: test_unchanged_report

add test TestKrakenUniqSummaryFilter::test_leaves_trimmed

3742438

dpark01 reviewed Dec 9, 2019

View reviewed changes

tomkinsc added 4 commits December 9, 2019 10:12

Merge branch 'master' into ct-krakenuniq-summary-kmer-filter

5e8238c

regex raw string

bb7fb48

Merge branch 'master' into ct-krakenuniq-summary-kmer-filter

016ff03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add krakenuniq summary kmer filter #998

add krakenuniq summary kmer filter #998

tomkinsc commented Dec 3, 2019

dpark01 commented Dec 3, 2019

tomkinsc commented Dec 3, 2019

dpark01 Dec 9, 2019

tomkinsc Dec 9, 2019

dpark01 Dec 9, 2019

yesimon commented Apr 23, 2020

dpark01 commented Apr 23, 2020

tomkinsc commented Nov 17, 2020

add krakenuniq summary kmer filter #998

Are you sure you want to change the base?

add krakenuniq summary kmer filter #998

Conversation

tomkinsc commented Dec 3, 2019

dpark01 commented Dec 3, 2019

tomkinsc commented Dec 3, 2019

dpark01 Dec 9, 2019

Choose a reason for hiding this comment

tomkinsc Dec 9, 2019

Choose a reason for hiding this comment

dpark01 Dec 9, 2019

Choose a reason for hiding this comment

yesimon commented Apr 23, 2020

dpark01 commented Apr 23, 2020

tomkinsc commented Nov 17, 2020