Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add krakenuniq summary kmer filter #998

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

tomkinsc
Copy link
Member

@tomkinsc tomkinsc commented Dec 3, 2019

To address broadinstitute/viral-classify#1, this adds a new command, krakenuniq_report_filter, to metagenomics.py:

usage: metagenomics.py subcommand krakenuniq_report_filter [-h]
                                                           [--fieldToFilterOn {num_reads,uniq_kmers}]
                                                           [--fieldToAdjust {num_reads,uniq_kmers} [{num_reads,uniq_kmers} ...]]
                                                           [--keepAboveN KEEP_THRESHOLD]
                                                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                                           [--version]
                                                           [--tmp_dir TMP_DIR]
                                                           [--tmp_dirKeep]
                                                           summary_file_in
                                                           summary_file_out

Filter a krakenuniq report by field to include rows above some threshold,
where contributions to the value from subordinate levels are first removed
from the value.

positional arguments:
  summary_file_in       Input KrakenUNiq-format summary text file with tab-
                        delimited fields and indented taxonomic levels.
  summary_file_out      Output KrakenUNiq-format summary text file with tab-
                        delimited fields and indented taxonomic levels.

optional arguments:
  -h, --help            show this help message and exit
  --fieldToFilterOn {num_reads,uniq_kmers}
                        The field to filter on (default: uniq_kmers).
  --fieldToAdjust {num_reads,uniq_kmers} [{num_reads,uniq_kmers} ...]
                        The field to adjust along with the --fieldToFilterOn
                        (default: ['num_reads']).
  --keepAboveN KEEP_THRESHOLD
                        Only taxa with values above this will be kept. Higher
                        taxonomic ranks will have their values reduced by the
                        values of removed sub-ranks (default: 100)
  --loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}
                        Verboseness of output. [default: INFO]
  --version, -V         show program's version number and exit
  --tmp_dir TMP_DIR     Base directory for temp files. [default:
                        /var/folders/bx/90px0g1n1v122slzjnyvrsk5gn42vm/T/]
  --tmp_dirKeep         Keep the tmp_dir if an exception occurs while running.
                        Default is to delete all temp files at the end, even
                        if there's a failure. (default: False)

The behavior of this command is such that using a depth-first traversal, the lowest rows in the report have the value in the field specified by --fieldToFilterOn (default: uniq_kmers) zeroed out if their value is below the threshold given by --keepAboveN (default: 100). Under the assumption that higher taxonomic levels have cumulative values including contributions from the zeroed-out rows, the values of the selected field in higher levels are reduced by the amount contained within lower-levels that were below the specified threshold (the subtraction is propagated up the tree to the root node). Since the traversal is depth-first, the higher levels are eventually re-evaluated to see if they no longer meet the threshold after being subjected to subtraction of their lower levels.

The hierarchy of rows is read based on the indentation of the taxName column since many rows do not have a formal taxonomic rank assigned (i.e. their rank is "no rank").

Secondarily, the fields specified by --fieldToAdjust (default: num_reads) are similarly adjusted using the conditional threshold established by --keepAboveN and --fieldToFilterOn: their values are subtracted similarly, with propagation up the tree.

After adjustment to these counts across the entire tree, the part-of-whole percentages for the rows are reflected to reflect the new read counts. The resulting tree is then written out to a new KrakenUniq-format report, filtered to include only those rows meeting the initial threshold criterion.

@dpark01
Copy link
Member

dpark01 commented Dec 3, 2019

Fun, you did it!

I think the whole time I was thinking of this little task, I always figured that the only way I'd really understand how it worked and whether it was working properly was to stare at how it handled a small suite of unit tests on really simple/stripped down inputs demonstrating the various edge cases. Something like:

  1. a report that emerges unchanged after filtering (because its values are large enough)
  2. a report that gets a few taxonomic leaves trimmed off (but no higher nodes)
  3. a report that gets a higher node pruned in the first pass (before re-summing)
  4. a report that gets a higher node pruned off after re-summing (due to pruned leaves)

Not sure if that's quite the right set of test cases so feel free to rethink it. But can you add unit tests?

Also: likely for a separate PR, but it'd be nice to add an optional param to provide a specific exclusion list of taxids (which would always remove those nodes including any lower ranking nodes beneath it.. and then recompute/sum upwards).

@tomkinsc
Copy link
Member Author

tomkinsc commented Dec 3, 2019

Yup, I'll add unit tests—just wanted to get this open to avoid duplication of effort in case this was on anyone else's agenda.

add new test type: assertApproxEqualValuesInDelimitedFiles(self, file_one, file_two, dialect="tsv", numeric_rel_tol=1e-5, header_lines_to_skip=0
add test class TestKrakenUniqSummaryFilter, with first test case: test_unchanged_report
self._test_report("should_have_leaves_trimmed.txt")

#def test_higher_node_trimmed_after_resumming(self):
# self._test_report("should_have_higher_node_trimmed_after_resumming.txt")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything wrong with these tests at the moment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing wrong--they're just not implemented yet (have to make the synthetic input/output).

test/__init__.py Outdated
@@ -111,6 +114,47 @@ def inputs(self, *fnames):
'''Return the full filenames for files in the test input directory for this test class'''
return [self.input(fname) for fname in fnames]

def assertApproxEqualValuesInDelimitedFiles(self, file_one, file_two, dialect="tsv", numeric_rel_tol=1e-5, header_lines_to_skip=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh man -- this test function will definitely come in handy

extend assertApproxEqualValuesInDelimitedFiles() test to use dicts in the case where a header line of field names is available so value comparisons can be made independent of column order. This is only used if use_first_processed_line_for_fieldnames=True. is_number() moved to util.misc; zip_dicts() added to util.misc
@yesimon
Copy link
Contributor

yesimon commented Apr 23, 2020

Don't mind if I pull this into viral-classify?

@dpark01
Copy link
Member

dpark01 commented Apr 23, 2020

@yesimon please do -- but on a separate branch/pr from the refactor one (which will take longer to vet)

@tomkinsc
Copy link
Member Author

Looks like this never got moved over to viral-classify; shall we?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants