Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupReadsByUmi may fail when marking duplicates including secondary/supplementary reads #964

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nh13
Copy link
Member

@nh13 nh13 commented Jan 30, 2024

There's an open issue in hts-specs about how we want to handle getting the primary alignment information when looking at a secondary or supplementary read: samtools/hts-specs#755

This PR adds the read primary "rp" tag to store the primary alignment for end of the current secondary/supplementary alignment, in the same format as the "SA" tag. The mate's primary alignment is stored in the "mp" tag. Both are currently lowercase as they are not reserved tags.

I have tested that ZipperBams will now add these, that SortBam will correctly sort in template-coordinate, and finally that GroupReadsByUmi passes. I added tests for GroupReadsByUmi and SamOrder.

Also, in my hands, secondary and supplementary records will never be output by GroupReadsByUmi as currently only primary alignments are output.

Copy link

codecov bot commented Jan 30, 2024

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (371db03) 95.62% compared to head (1b5753c) 95.64%.

Files Patch % Lines
src/main/scala/com/fulcrumgenomics/bam/Bams.scala 90.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #964      +/-   ##
==========================================
+ Coverage   95.62%   95.64%   +0.01%     
==========================================
  Files         126      126              
  Lines        7360     7392      +32     
  Branches      495      531      +36     
==========================================
+ Hits         7038     7070      +32     
  Misses        322      322              
Flag Coverage Δ
unittests 95.64% <95.91%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -41,6 +42,36 @@ import htsjdk.samtools.util.{CloserUtil, CoordMath, Murmur3, SequenceUtil}
import java.io.Closeable
import scala.math.{max, min}



case class Supplementary(refName: String, start: Int, positiveStrand: Boolean, cigar: Cigar, mapq: Int, nm: Int) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scaladocs needed later


case class Supplementary(refName: String, start: Int, positiveStrand: Boolean, cigar: Cigar, mapq: Int, nm: Int) {
def negativeStrand: Boolean = !positiveStrand
def refIndex(header: SAMFileHeader): Int = header.getSequence(refName).getSequenceIndex
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may invert this, and store refIndex instead, since when we have the SAM header when want refName (in toString)


def end: Int = start + cigar.lengthOnTarget - 1
def unclippedStart: Int = {
SAMUtils.getUnclippedStart(start, cigar.toHtsjdkCigar)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably could just compute these directly without having to route to htsjdk



def apply(sa: String): Supplementary = {
val parts = sa.split(",")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably good to check we get 6 parts

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Is validation of the type/value of each part also necessary?

for (primary <- r1; nonPrimary <- r2NonPrimary) {
SamPairUtil.setMateInformationOnSupplementalAlignment(nonPrimary.asSam, primary.asSam, true)
nonPrimary(SAMTag.MQ.name()) = primary.mapq
nonPrimary("mp") = Supplementary.toString(primary)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: store these tag definitions somewhere else

Comment on lines +34 to +35
import scala.reflect.runtime.universe.Template

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import scala.reflect.runtime.universe.Template

}
else {
TemplateCoordinateKey(mateChrom, readChrom, matePos, readPos, mateNeg, readNeg, lib, mid, rec.name, true)
// For non-secondary/non-supplementary alignments, use the info in the record. For secondary and supplementary
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: how can we simplify these two branches, since they're very similar

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you additionally set mp on the primary alignments, not just the supplementaries, (and also take my suggestion to define an apply for SamRecord 🙂 ) you could do the following:

val primary = if (!rec.secondary && !rec.supplementary) Supplementary(rec) else Supplementary(rec[String]("rp"))
val mate = Supplementary(rec[String]("mp"))

// Just the second branch, using the info from `Supplementary` instead of `SamRecord`
...

Comment on lines +204 to +205
val primary = Supplementary(rec[String]("rp"))
val mate = Supplementary(rec[String]("mp"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todo, better error message or fallback

…nd supplementary reads

Secondary and supplementary reads must use the coordinates of the
primary alignments within the template, otherwise they will not
guaranteed to be next the primary alignments in the file.  Therefore,
we've added the "rp" and "mp" tags to store the SA-tag equivalent
information for the primary alignment.  This keeps information about the
primary alignments with the secondary and supplementary alignments.
def negativeStrand: Boolean = !positiveStrand
def refIndex(header: SAMFileHeader): Int = header.getSequence(refName).getSequenceIndex

def end: Int = start + cigar.lengthOnTarget - 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question Is end inclusive or exclusive?

(And maybe add scaladoc to clarify)

I recently added an equivalent property to fgpyo and made it exclusive; if this is the same I think we don't want to subtract 1

https://github.com/fulcrumgenomics/fgpyo/blob/8738a1de868fc6c76a59ad68b29b4c537e660b97/fgpyo/sam/__init__.py#L557

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fgbio is generally 1-based inclusive, but I don't think we say that anywhere.

Comment on lines +61 to +69
object Supplementary {
/** Returns a formatted alignment as per the SA tag: `(rname ,pos ,strand ,CIGAR ,mapQ ,NM ;)+` */
def toString(rec: SamRecord): String = {
val strand = if (rec.positiveStrand) '+' else '-'
f"${rec.refName},${rec.start},${strand},${rec.cigar},${rec.mapq},${rec.getOrElse(SAMTag.NM.name(),0)}"
}


def apply(sa: String): Supplementary = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer to have two apply methods, one for SamRecord and one for String, and a class toString method that converts an instance of Supplementary to String



def apply(sa: String): Supplementary = {
val parts = sa.split(",")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Is validation of the type/value of each part also necessary?

@@ -107,11 +138,21 @@ case class Template(r1: Option[SamRecord],

/** Fixes mate information and sets mate cigar on all primary and supplementary (but not secondary) records. */
def fixMateInfo(): Unit = {
for (primary <- r1; supp <- r2Supplementals) {
SamPairUtil.setMateInformationOnSupplementalAlignment(supp.asSam, primary.asSam, true)
// Set all mate info on BOTH secondary and supplementary records, not just supplementary records. We also need to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 139 should be updated (or removed) to reflect this


def apply(sa: String): Supplementary = {
val parts = sa.split(",")
Supplementary(parts(0), parts(1).toInt, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Supplementary(parts(0), parts(1).toInt, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt)
Supplementary(parts(0), parts(1).toInt - 1, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt)

suggestion We need to subtract 1 if start is zero-based.

Without scaladoc I'm not sure, but I'm assuming it's zero based; and the SA tag is 1-based.

Pos is a 1-based coordinate.

https://samtools.github.io/hts-specs/SAMtags.pdf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: fgbio is generally 1-based inclusive

Comment on lines +145 to +155
for (primary <- r1; nonPrimary <- r2NonPrimary) {
SamPairUtil.setMateInformationOnSupplementalAlignment(nonPrimary.asSam, primary.asSam, true)
nonPrimary(SAMTag.MQ.name()) = primary.mapq
nonPrimary("mp") = Supplementary.toString(primary)
r2.foreach(r => nonPrimary("rp") = Supplementary.toString(r))
}
for (primary <- r2; supp <- r1Supplementals) {
SamPairUtil.setMateInformationOnSupplementalAlignment(supp.asSam, primary.asSam, true)
for (primary <- r2; nonPrimary <- r1NonPrimary) {
SamPairUtil.setMateInformationOnSupplementalAlignment(nonPrimary.asSam, primary.asSam, true)
nonPrimary(SAMTag.MQ.name()) = primary.mapq
nonPrimary("mp") = Supplementary.toString(primary)
r1.foreach(r => nonPrimary("rp") = Supplementary.toString(r))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question Would you find it more legible to extract these for loops into a helper so we don't repeat it twice?

}
else {
TemplateCoordinateKey(mateChrom, readChrom, matePos, readPos, mateNeg, readNeg, lib, mid, rec.name, true)
// For non-secondary/non-supplementary alignments, use the info in the record. For secondary and supplementary
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you additionally set mp on the primary alignments, not just the supplementaries, (and also take my suggestion to define an apply for SamRecord 🙂 ) you could do the following:

val primary = if (!rec.secondary && !rec.supplementary) Supplementary(rec) else Supplementary(rec[String]("rp"))
val mate = Supplementary(rec[String]("mp"))

// Just the second branch, using the info from `Supplementary` instead of `SamRecord`
...

@@ -719,7 +719,7 @@ class GroupReadsByUmi

// Then output the records in the right order (assigned tag, read name, r1, r2)
templatesByMi.keys.toSeq.sortBy(id => (id.length, id)).foreach(tag => {
templatesByMi(tag).sortBy(t => t.name).flatMap(t => t.primaryReads).foreach(rec => {
templatesByMi(tag).sortBy(t => t.name).flatMap(t => t.allReads).foreach(rec => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question Where are --include-supplementary and --include-secondary taken into consideration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the initial filter of the BAM file (see the _includeSecondaryReads variabel)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants