Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: Improved parameter estimation for (H)PHMM (#371)
### Description This PR updates the (H)PHMM parameter estimation routine: - counts transitions a little differently then before: 1. conceptually expand cigar strings (i.e. `3M2D1I2M` → `1M1M1M1D1D1I1M1M`) 2. iterate consecutive pairs e.g. `(1M, 1M)` and look at rseq and qseq at the corresponding positions 3. classify that into any of the 82 transitions, e.g. `(1M, 1M)` together with `((T, T), (G, G))` corresponds to a `match T → match G` transition - this will only look at `N` records from the SAM/BAM/CRAM, where `N` is a number calculated as a multinomial sample size estimation as described in [https://www.jstor.org/stable/2683352](https://www.jstor.org/stable/2683352) When the number of alignments in the file is known (which is the case for indexed BAM files but not for indexed SAM/CRAM), can use a finite population correction version of the estimate instead. - currently, iterates through the whole file with a fixed step size. However, this isn't any faster than simply looking at every record, so do any combination of the following instead: 1. only use the *first* `N` records (possibly biased subsample, but definitely quicker) 2. for CRAM, set lowlevel htslib reading options to skip fields that aren't used for estimation purposes (potentially quicker, but only for CRAM?) 3. look at samtools subsample or simply use that as a preprocessing step 4. make better use of indices for random access ### QC <!-- Make sure that you can tick the boxes below. --> * [ ] The PR contains a test case for the changes or the changes are already covered by an existing test case. * [ ] The documentation at https://github.com/varlociraptor/varlociraptor.github.io is updated in a separate PR to reflect the changes or this is not necessary (e.g. if the change does neither modify the calling grammar nor the behavior or functionalities of Varlociraptor). --------- Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
- Loading branch information