Build and classification parameters suggestions based on data type (aDNA) #288

ksavhughes · 2024-02-20T16:30:21Z

Hi again, another thing that I would absolutely love your input on is build and classification parameters based on the type of data I work with, which is ancient DNA, so the reads are short (average length around 70 bp; shortest are around 30 bp) and damaged.

If it would help, here is some more information about what we'd be using ganon for:
We are trying to use read classification tools for multiple aDNA projects - so keep in mind, everything is short and damaged. We’re thinking of building a custom database that has target species (along with human and Unive_Core) in the hopes of being able to use it to identify samples to the species level for bones that are morphologically hard to distinguish. We also want to build a much larger database with ALL the RefSeq genomes (limited to 3 assemblies per species) - we're probably going to split this database into categories since ganon can classify reads to multiple dbs. We were hoping to use this RefSeq db for sediment and coprolite projects - some examples of what we are trying to find include environment/host microbial communities, host diet, and for some of the sediment stuff we want to see if we can identify animal community too.

Since the reads are short, it would be nice to have shorter kmer lengths to maximize the number of possible kmers per read to match to reference kmers. But I can also see how the complexity/uniqueness of the kmers is decreased as you shorten them, possibly making spurious matches more prevalent. I’m trying to find that balance with the build, do you have a suggestion for kmer-size, minimizer-size, and possibly false positive rate for a db build for classifying ancient DNA?

On the classification side, I’m struggling with finding balance with the cutoff and filter options for similar reasons. Since ancient reads are short, it limits the possible number of kmers per read, meaning the possible kmers shared between a read and reference is also limited. So, finding a balance where I have strict enough parameters where spurious matches are filtered out but also where short reads can be classified and allow for potential errors in matching due to ancient damage. If you can at least point me in the right direction, that would be great! Maybe suggest a range for cutoff and filter that I can test out with some ancient data?

And for reads with multiple matches, would you use LCA or the EM algorithm? Is there a reason you’d use one over the other? Also, it seems like the EM and LCA algorithms to solve multiple-matches are not necessarily mutually exclusive? Is there a way to use them in a hierarchical way or apply one then the other? Do you recommend doing that? Anything else you can think of that I should consider based on my data type would be much appreciated!

pirovc · 2024-02-21T10:26:40Z

Nice project! I used ganon once for ancient DNA some years ago, so I'm aware of the difficulties with this type of data.

The smaller the k-mer size, the higher the chances of spurious matches. You could start by trying to reduce the window size (e.g. 23), since it works as a compression for the reference data at a cost of sensitivity. Your databases will get bigger but you gonna get more accurate results. You can go all the way and use window size = k-mer size, but keep in mind that will require lots of memory for larger reference sets like the RefSeq. You can try to lower the k-mer size to 17, 15, but I think anything below that you get too much noise. The false positive rate used by default is already very conservative, but the same, the lower you go, the more resources you need.

Regarding thresholds, it's very hard to give a specific advice because it depends on your reference, reads, parameters, etc. I would try to get a feeling for it analyzing the outcomes in a range from low values (e.g. --rel-cutoff 0.1) to high (0.9). I guess the --rel-filter complicates a bit this process, so I would keep that fixed at either 0 or 1 and work with --rel-cutoff first.

I would use the EM algorithm, it gives better results based on benchmarks in almost every scenario. With EM you get your reads classified at the max. possible taxonomic level of your database, while the LCA will be more conservative and report results everywhere in the taxonomy tree. LCA and EM are not necessarily mutually exclusive, you can have both:

with ganon classify use --multiple-matches lca --output-all. This will give you the LCA output in the .rep and .tre files. If you also need to have the classification of each read based on the LCA, use --output-one to get the .one.
to get the EM from the results above, run ganon reassign using the .all generated. This will create another .rep and .one based on EM algorithm. To get the .tre report, will you need to run ganon report in this output.

If this a long project involving lots of data, I think it's worth to create a aDNA simulated dataset similar to your expected data with some ground truth to evaluate the effect of parameter changes. Once you have that, you could use Metabench to execute ganon in a range of parameters and evaluate the outcome with the dashboard (like this used in the ganon2 pre-print).

Another thing that may be helpful for aDNA is gapped k-mers. Although planned, this is not yet a function in ganon.

ksavhughes · 2024-03-04T00:32:24Z

Hi again, thank you for both your responses! They've been a huge help!

If you've used ganon for ancient dna before, can you tell me what build and classification parameters you used for it? I'll still do my own testing of course (I've already have some simulated data I can play with), but it would be nice to include what you used in our tests!

I have a couple more questions after you've given me all this very helpful information. Honestly, thank you again for your thorough responses! It has helped me understand ganon in a much deeper way.

Do read kmers have to be an exact match of a reference minimizer to be considered a match? I know Kraken2 is exact matches, but then when the ExpectationMaximization algorithm is discussed in the paper, the terms "errors" and
"mismatches" sort of confuse me. It seems to me when you say error it means that if there are 19 unique kmers in a read and 17 of those match to one reference, in that instance this read-to-reference match has 2 errors, because 2 of its unique kmers do not match exactly to the reference's kmers. Is that right? Or are errors referring to mismatches between bases within a read kmer and reference kmer match?

Also, after the cutoff and filter thresholds are applied and there is only one match left, that means this is reported as a unique match and not a read with multiple matches. So that read would not undergo LCA or EM filtering?

pirovc · 2024-03-11T14:04:10Z

Hi again, thank you for both your responses! They've been a huge help!

If you've used ganon for ancient dna before, can you tell me what build and classification parameters you used for it? I'll still do my own testing of course (I've already have some simulated data I can play with), but it would be nice to include what you used in our tests!

The parameters in ganon back then were different and I don't have the details anymore (it was very experimental), but I did treat reads before with https://github.com/grenaud/leeHom and remember getting a good amount of reads classified.

I have a couple more questions after you've given me all this very helpful information. Honestly, thank you again for your thorough responses! It has helped me understand ganon in a much deeper way.

Do read kmers have to be an exact match of a reference minimizer to be considered a match? I know Kraken2 is exact matches, but then when the ExpectationMaximization algorithm is discussed in the paper, the terms "errors" and "mismatches" sort of confuse me. It seems to me when you say error it means that if there are 19 unique kmers in a read and 17 of those match to one reference, in that instance this read-to-reference match has 2 errors, because 2 of its unique kmers do not match exactly to the reference's kmers. Is that right? Or are errors referring to mismatches between bases within a read kmer and reference kmer match?

A single kmer match exists if the same exact kmer happens in the read and a reference. The number of (exact) kmer shared among a read and reference(s) is what controls the "match" between them (with the cutoff and filter).
In the first version of ganon we used the concept of errors based on the k-mer counting lemma. In the current version, ganon does not use this concept. The matches are just controlled by the cutoff and filter, which are relative thresholds. In your example, 17 out of 19 kmers would be around 89% of matching kmers, meaning that any --rel-cutoff value between 0 and 0.89 would consider this a match. With --rel-cutoff 0.9 this read would be not classified. The Expectation Maximization is a post-processing step that re-assigns reads with multiple matches and has no importance on the matching process.

Also, after the cutoff and filter thresholds are applied and there is only one match left, that means this is reported as a unique match and not a read with multiple matches. So that read would not undergo LCA or EM filtering?

Yes. If a read has just one match after the cutoff and filter it will be considered unique. A read with a unique match will not change after LCA or EM since it's already assigned to a unique target.

pirovc added the question Further information is requested label Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build and classification parameters suggestions based on data type (aDNA) #288

Build and classification parameters suggestions based on data type (aDNA) #288

ksavhughes commented Feb 20, 2024

pirovc commented Feb 21, 2024

ksavhughes commented Mar 4, 2024

pirovc commented Mar 11, 2024

Build and classification parameters suggestions based on data type (aDNA) #288

Build and classification parameters suggestions based on data type (aDNA) #288

Comments

ksavhughes commented Feb 20, 2024

pirovc commented Feb 21, 2024

ksavhughes commented Mar 4, 2024

pirovc commented Mar 11, 2024