Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMI-tools dedup and heavily soft-clipped CIGAR strings due to overlapping PE reads... #621

Open
SJJHK opened this issue Jan 5, 2024 · 1 comment

Comments

@SJJHK
Copy link

SJJHK commented Jan 5, 2024

Hi Ian and Tom,

Hope you are well.

We are running a splice junction analysis using human plasma samples processed on a Novaseq 6000 ( RNA sequencing with UMIs).

Analysis Pipeline in brief :

Trimmomatic (Adapter trim) - HISAT2 Pass 1 (Paired End output ) - Failed Paired ends processed by FASTP (tail crop , 4bp sliding window, Min Qual 30, MinReadLen 75) - HISAT2 Pass 2 on Failed Paired Ends (Paired End and Singleton output) - BAMs merged - QORTS QC - BamUtils ClipOverlap - Samtools filter for MAPQ 1 and 60 primary alignments - Splice junction analysis .

In ClipOverlap (https://genome.sph.umich.edu/wiki/BamUtil:_clipOverlap) settings include marking entirely clipped reads as unmapped (--unmapped). These MAPQ0s are filtered out in the Samtools filter above. (QORTS QC fails a read with a full soft clipped cigar string, hence we selected the --unmapped option.)

The problem lies herein - ClipOverlap results in heavily soft-clipped CIGAR strings ~ 70% of reads are overlapping paired ends -
these remaining paired end reads have R1 or R2 with a soft clip at the 3'overlap (we think predominately, due to the proper pair flag) and a small percentage of singletons. (Those singletons come from Hisat2 Pass 2 and BamUtils.

Here are the options for UMI-tools that we think are appropriate for deduplicating our data :

umi_tools dedup --stdin=xyz.bam --extract-umi-method read_id --umi-separator=":" --spliced-is-unique --soft-clip-threshold=100 --method unique --unmapped-reads use --chimeric-pairs use --unpaired-reads use --paired --buffer-whole-contig --log=LOGFILE > output.bam

Our concern is the soft clipping that features heavily in our current dataset. We must soft-clip with ClipOverlap because splicing software will count overlapping paired end reads twice when they cross the same splice junction.

Should we run UMItools before or after we run ClipOverlap?

We would like to alleviate the complexities you may think will arise for a heavily clipped CIGAR string.

Also, does single end UMI-tools dedup look at both the 3' and 5'? I think that GATK has this behaviour as well.

Many thanks in advance for any insight.

@SJJHK SJJHK changed the title UMI-tools and heavily soft-clipped CIGAR strings due to overlapping PE reads... UMI-tools dedup and heavily soft-clipped CIGAR strings due to overlapping PE reads... Jan 5, 2024
@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented Feb 19, 2024

Hi @SJJHK. Sorry for the delayed reply.

I tried scanning through the documentation for ClipOverlap, but could ascertain how it performs the clipping in every instance. I'd suggest running this after UMI-tools. To my mind, there's no downside to running UMI-tools first and it avoids any potential issues whereby ClipOverlap could make duplicates appear to be distinct.

UMI-tools deduplicates single end reads by considering their alignment start and read length (and optionally whether the read is spliced/unspliced). The alignment start take into account 3' soft-clipping to avoid issues with duplicate reads having alternative apparent alignment starts due to 3' soft clipping.

Paired-end reads are additionally deduplicated by considering the template length (TLEN) unless this behaviour is turned off. From SAM specs, TLEN = 'distance between the mapped end of the template and the mapped start of the template' ; '[..] exact definitions are implementation-defined.'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants