GroupReadsByUmi may fail when marking duplicates including secondary/supplementary reads #964

nh13 · 2024-01-30T01:15:59Z

There's an open issue in hts-specs about how we want to handle getting the primary alignment information when looking at a secondary or supplementary read: samtools/hts-specs#755

This PR adds the read primary "rp" tag to store the primary alignment for end of the current secondary/supplementary alignment, in the same format as the "SA" tag. The mate's primary alignment is stored in the "mp" tag. Both are currently lowercase as they are not reserved tags.

I have tested that ZipperBams will now add these, that SortBam will correctly sort in template-coordinate, and finally that GroupReadsByUmi passes. I added tests for GroupReadsByUmi and SamOrder.

Also, in my hands, secondary and supplementary records will never be output by GroupReadsByUmi as currently only primary alignments are output.

codecov · 2024-01-30T01:17:50Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (371db03) 95.62% compared to head (1b5753c) 95.64%.

Files	Patch %	Lines
src/main/scala/com/fulcrumgenomics/bam/Bams.scala	90.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #964      +/-   ##
==========================================
+ Coverage   95.62%   95.64%   +0.01%     
==========================================
  Files         126      126              
  Lines        7360     7392      +32     
  Branches      495      531      +36     
==========================================
+ Hits         7038     7070      +32     
  Misses        322      322

Flag	Coverage Δ
unittests	`95.64% <95.91%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nh13 · 2024-01-30T01:16:14Z

src/main/scala/com/fulcrumgenomics/bam/Bams.scala

@@ -41,6 +42,36 @@ import htsjdk.samtools.util.{CloserUtil, CoordMath, Murmur3, SequenceUtil}
 import java.io.Closeable
 import scala.math.{max, min}

+
+
+case class Supplementary(refName: String, start: Int, positiveStrand: Boolean, cigar: Cigar, mapq: Int, nm: Int) {


scaladocs needed later

nh13 · 2024-01-30T01:17:10Z

src/main/scala/com/fulcrumgenomics/bam/Bams.scala

+
+case class Supplementary(refName: String, start: Int, positiveStrand: Boolean, cigar: Cigar, mapq: Int, nm: Int) {
+  def negativeStrand: Boolean = !positiveStrand
+  def refIndex(header: SAMFileHeader): Int = header.getSequence(refName).getSequenceIndex


I may invert this, and store refIndex instead, since when we have the SAM header when want refName (in toString)

nh13 · 2024-01-30T01:17:31Z

src/main/scala/com/fulcrumgenomics/bam/Bams.scala

+
+  def end: Int = start + cigar.lengthOnTarget - 1
+  def unclippedStart: Int = {
+    SAMUtils.getUnclippedStart(start, cigar.toHtsjdkCigar)


we probably could just compute these directly without having to route to htsjdk

nh13 · 2024-01-30T01:18:19Z

src/main/scala/com/fulcrumgenomics/bam/Bams.scala

+
+
+  def apply(sa: String): Supplementary = {
+    val parts = sa.split(",")


probably good to check we get 6 parts

Agreed. Is validation of the type/value of each part also necessary?

nh13 · 2024-01-30T01:18:37Z

src/main/scala/com/fulcrumgenomics/bam/Bams.scala

+    for (primary <- r1; nonPrimary <- r2NonPrimary) {
+      SamPairUtil.setMateInformationOnSupplementalAlignment(nonPrimary.asSam, primary.asSam, true)
+      nonPrimary(SAMTag.MQ.name()) = primary.mapq
+      nonPrimary("mp") = Supplementary.toString(primary)


TODO: store these tag definitions somewhere else

nh13 · 2024-01-30T01:18:55Z

src/main/scala/com/fulcrumgenomics/bam/api/SamOrder.scala

+import scala.reflect.runtime.universe.Template
+


Suggested change

import scala.reflect.runtime.universe.Template

nh13 · 2024-01-30T01:19:26Z

src/main/scala/com/fulcrumgenomics/bam/api/SamOrder.scala

-      }
-      else {
-        TemplateCoordinateKey(mateChrom, readChrom, matePos, readPos, mateNeg, readNeg, lib, mid, rec.name, true)
+      // For non-secondary/non-supplementary alignments, use the info in the record.  For secondary and supplementary


todo: how can we simplify these two branches, since they're very similar

if you additionally set mp on the primary alignments, not just the supplementaries, (and also take my suggestion to define an apply for SamRecord 🙂 ) you could do the following:

val primary = if (!rec.secondary && !rec.supplementary) Supplementary(rec) else Supplementary(rec[String]("rp")) val mate = Supplementary(rec[String]("mp")) // Just the second branch, using the info from `Supplementary` instead of `SamRecord` ...

nh13 · 2024-01-31T21:17:20Z

src/main/scala/com/fulcrumgenomics/bam/api/SamOrder.scala

+        val primary   = Supplementary(rec[String]("rp"))
+        val mate      = Supplementary(rec[String]("mp"))


Todo, better error message or fallback

…nd supplementary reads Secondary and supplementary reads must use the coordinates of the primary alignments within the template, otherwise they will not guaranteed to be next the primary alignments in the file. Therefore, we've added the "rp" and "mp" tags to store the SA-tag equivalent information for the primary alignment. This keeps information about the primary alignments with the secondary and supplementary alignments.

msto · 2024-05-27T14:09:42Z

src/main/scala/com/fulcrumgenomics/bam/Bams.scala

+  def negativeStrand: Boolean = !positiveStrand
+  def refIndex(header: SAMFileHeader): Int = header.getSequence(refName).getSequenceIndex
+
+  def end: Int = start + cigar.lengthOnTarget - 1


question Is end inclusive or exclusive?

(And maybe add scaladoc to clarify)

I recently added an equivalent property to fgpyo and made it exclusive; if this is the same I think we don't want to subtract 1

https://github.com/fulcrumgenomics/fgpyo/blob/8738a1de868fc6c76a59ad68b29b4c537e660b97/fgpyo/sam/__init__.py#L557

fgbio is generally 1-based inclusive, but I don't think we say that anywhere.

msto · 2024-05-27T14:13:36Z

src/main/scala/com/fulcrumgenomics/bam/Bams.scala

+object Supplementary {
+  /** Returns a formatted alignment as per the SA tag: `(rname ,pos ,strand ,CIGAR ,mapQ ,NM ;)+` */
+  def toString(rec: SamRecord): String = {
+    val strand = if (rec.positiveStrand) '+' else '-'
+    f"${rec.refName},${rec.start},${strand},${rec.cigar},${rec.mapq},${rec.getOrElse(SAMTag.NM.name(),0)}"
+  }
+
+
+  def apply(sa: String): Supplementary = {


I think I would prefer to have two apply methods, one for SamRecord and one for String, and a class toString method that converts an instance of Supplementary to String

msto · 2024-05-27T14:14:33Z

src/main/scala/com/fulcrumgenomics/bam/Bams.scala

+
+
+  def apply(sa: String): Supplementary = {
+    val parts = sa.split(",")


Agreed. Is validation of the type/value of each part also necessary?

msto · 2024-05-27T14:14:47Z

src/main/scala/com/fulcrumgenomics/bam/Bams.scala

@@ -107,11 +138,21 @@ case class Template(r1: Option[SamRecord],

  /** Fixes mate information and sets mate cigar on all primary and supplementary (but not secondary) records. */
  def fixMateInfo(): Unit = {
-    for (primary <- r1; supp <- r2Supplementals) {
-      SamPairUtil.setMateInformationOnSupplementalAlignment(supp.asSam, primary.asSam, true)
+    // Set all mate info on BOTH secondary and supplementary records, not just supplementary records.  We also need to


The comment on line 139 should be updated (or removed) to reflect this

msto · 2024-05-27T15:31:15Z

src/main/scala/com/fulcrumgenomics/bam/Bams.scala

+
+  def apply(sa: String): Supplementary = {
+    val parts = sa.split(",")
+    Supplementary(parts(0), parts(1).toInt, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt)


Suggested change

Supplementary(parts(0), parts(1).toInt, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt)

Supplementary(parts(0), parts(1).toInt - 1, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt)

suggestion We need to subtract 1 if start is zero-based.

Without scaladoc I'm not sure, but I'm assuming it's zero based; and the SA tag is 1-based.

Pos is a 1-based coordinate.

https://samtools.github.io/hts-specs/SAMtags.pdf

ditto: fgbio is generally 1-based inclusive

msto · 2024-05-27T15:32:47Z

src/main/scala/com/fulcrumgenomics/bam/Bams.scala

+    for (primary <- r1; nonPrimary <- r2NonPrimary) {
+      SamPairUtil.setMateInformationOnSupplementalAlignment(nonPrimary.asSam, primary.asSam, true)
+      nonPrimary(SAMTag.MQ.name()) = primary.mapq
+      nonPrimary("mp") = Supplementary.toString(primary)
+      r2.foreach(r => nonPrimary("rp") = Supplementary.toString(r))
    }
-    for (primary <- r2; supp <- r1Supplementals) {
-      SamPairUtil.setMateInformationOnSupplementalAlignment(supp.asSam, primary.asSam, true)
+    for (primary <- r2; nonPrimary <- r1NonPrimary) {
+      SamPairUtil.setMateInformationOnSupplementalAlignment(nonPrimary.asSam, primary.asSam, true)
+      nonPrimary(SAMTag.MQ.name()) = primary.mapq
+      nonPrimary("mp") = Supplementary.toString(primary)
+      r1.foreach(r => nonPrimary("rp") = Supplementary.toString(r))


question Would you find it more legible to extract these for loops into a helper so we don't repeat it twice?

msto · 2024-05-27T15:39:12Z

src/main/scala/com/fulcrumgenomics/bam/api/SamOrder.scala

-      }
-      else {
-        TemplateCoordinateKey(mateChrom, readChrom, matePos, readPos, mateNeg, readNeg, lib, mid, rec.name, true)
+      // For non-secondary/non-supplementary alignments, use the info in the record.  For secondary and supplementary


if you additionally set mp on the primary alignments, not just the supplementaries, (and also take my suggestion to define an apply for SamRecord 🙂 ) you could do the following:

val primary = if (!rec.secondary && !rec.supplementary) Supplementary(rec) else Supplementary(rec[String]("rp")) val mate = Supplementary(rec[String]("mp")) // Just the second branch, using the info from `Supplementary` instead of `SamRecord` ...

msto · 2024-05-27T15:40:53Z

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

@@ -719,7 +719,7 @@ class GroupReadsByUmi

      // Then output the records in the right order (assigned tag, read name, r1, r2)
      templatesByMi.keys.toSeq.sortBy(id => (id.length, id)).foreach(tag => {
-        templatesByMi(tag).sortBy(t => t.name).flatMap(t => t.primaryReads).foreach(rec => {
+        templatesByMi(tag).sortBy(t => t.name).flatMap(t => t.allReads).foreach(rec => {


question Where are --include-supplementary and --include-secondary taken into consideration?

In the initial filter of the BAM file (see the _includeSecondaryReads variabel)

nh13 commented Jan 31, 2024

View reviewed changes

nh13 added 3 commits January 31, 2024 14:36

add a test to show issue with SamOrder#

9b30756

commit to show how GroupReadsByUmi fails

73a6b67

nh13 force-pushed the nh_markdup_order_issue branch from cd1460b to 1b5753c Compare January 31, 2024 21:37

nh13 mentioned this pull request Mar 29, 2024

How to retrieve the primary alignment for secondary and supplementary reads samtools/hts-specs#755

Open

nh13 mentioned this pull request May 22, 2024

GroupReadsByUmi duplicate marking may fail when secondary and supplementary alignments are included #961

Open

msto reviewed May 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupReadsByUmi may fail when marking duplicates including secondary/supplementary reads #964

GroupReadsByUmi may fail when marking duplicates including secondary/supplementary reads #964

nh13 commented Jan 30, 2024 •

edited

codecov bot commented Jan 30, 2024 •

edited

nh13 Jan 30, 2024

nh13 Jan 30, 2024

nh13 Jan 30, 2024

nh13 Jan 30, 2024

msto May 27, 2024

nh13 Jan 30, 2024

nh13 Jan 30, 2024

nh13 Jan 30, 2024

msto May 27, 2024

nh13 Jan 31, 2024

msto May 27, 2024

nh13 May 27, 2024

msto May 27, 2024

msto May 27, 2024

msto May 27, 2024

msto May 27, 2024

nh13 May 27, 2024

msto May 27, 2024

msto May 27, 2024

msto May 27, 2024

nh13 May 27, 2024



		def apply(sa: String): Supplementary = {
		val parts = sa.split(",")

		val primary = Supplementary(rec[String]("rp"))
		val mate = Supplementary(rec[String]("mp"))

	Supplementary(parts(0), parts(1).toInt, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt)
	Supplementary(parts(0), parts(1).toInt - 1, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt)

GroupReadsByUmi may fail when marking duplicates including secondary/supplementary reads #964

Are you sure you want to change the base?

GroupReadsByUmi may fail when marking duplicates including secondary/supplementary reads #964

Conversation

nh13 commented Jan 30, 2024 • edited

codecov bot commented Jan 30, 2024 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nh13 commented Jan 30, 2024 •

edited

codecov bot commented Jan 30, 2024 •

edited