Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected results after cutadapt #1955

Closed
jesika0123 opened this issue May 13, 2024 · 4 comments
Closed

Unexpected results after cutadapt #1955

jesika0123 opened this issue May 13, 2024 · 4 comments

Comments

@jesika0123
Copy link

I am following the pipeline posted on "https://github.com/ErnakovichLab/dada2_ernakovichlab".

I am working with NovaSeq 2*250bp data.
Primer amplicon length: 347bp
Issue: rbind results (see below)
Solution: although I have coding experience, I am far from an expert, so please keep this in mind when reply. Thanks in advance for all help!

I have defined the primers as 1) FWD <- "18 bp with wobbles" and 2) REV <- "18 bp no wobbles" (sorry I can't share the actual bps).

Next I ran the following:

Function that creates a list of all orientations of the primers

allOrients <- function(primer) {
  # Create all orientations of the input sequence
  require(Biostrings)
  dna <- DNAString(primer)  #The Biostrings works w/ DNAString objects rather than character vectors
  orients <- c(Forward = dna, Complement = complement(dna), Reverse = reverse(dna), 
               RevComp = reverseComplement(dna))
  return(sapply(orients, toString))  #Convert back to character vector
}

# Save the primer orientations to pass to cutadapt
FWD.orients<- allOrients(FWD)
FWD.orients

REV.orients <- allOrients(REV)
REV.orients

Pre-filter to remove sequences with ambiguous bases (Ns)

fnFs.filtN <- file.path(preprocess, "filtN", basename(fnFs))
fnRs.filtN <- file.path(preprocess, "filtN", basename(fnRs))
filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = FALSE)

Count the number of times the primers appear in the forward and reverse read, while considering all possible primer orientations.

primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]), FWD.ReverseReads = sapply(FWD.orients,
primerHits, fn = fnRs.filtN[[1]]), REV.ForwardReads = sapply(REV.orients, primerHits,
fn = fnFs.filtN[[1]]), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))

Not the expect results:
Forward Complement Reverse RevComp
FWD.ForwardReads 314219 0 0 8724
FWD.ReverseReads 305418 0 0 8770
REV.ForwardReads 204238 0 0 424
REV.ReverseReads 210007 0 0 458

But...just to humor myself I keep going:

Create directory to hold the output from cutadapt

if (!dir.exists(trimmed)) dir.create(trimmed)
fnFs.cut <- file.path(trimmed, basename(fnFs))
fnRs.cut <- file.path(trimmed, basename(fnRs))

Save the reverse complements of the primers to variables

FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)

Trim FWD and the reverse-complement of REV off of R1 (forward reads)

R1.flags <- paste("-g", FWD, "-a", REV.RC)

Trim REV and the reverse-complement of FWD off of R2 (reverse reads)

R2.flags <- paste("-G", REV, "-A", FWD.RC)

Run Cutadapt

for (i in seq_along(fnFs)) {
system2(cutadapt, args = c(R1.flags, R2.flags, "-m", 20, #based on a suggestion by benjjneb "-n", 2, # -n 2 required to remove FWD and REV from reads
"-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
fnFs.filtN[i], fnRs.filtN[i])) # input files
}

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]),

  •   FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]), 
    
  •   REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]), 
    
  •   REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))
    

As expected: not the results I want:

                                Forward Complement Reverse RevComp

FWD.ForwardReads 0 0 0 8723
FWD.ReverseReads 305352 0 0 0
REV.ForwardReads 204238 0 0 0
REV.ReverseReads 0 0 0 458

@benjjneb
Copy link
Owner

Your primer-checking results suggest that you are working with a mixed orientation library. That is, the sequenced reads are coming from an even mixture of those in the expected orientation (FWD primer on R1 and rc(REV) primer on R2), and the opposite orientation (FWD primer on R2 and rc(REV) on R2). Your cutadapt command is only trimming primers when the reads are in the expected orientation.

The easiest solution is to discard untrimmed reads at the cutadapt step. This will throw away half the reads, but also avoid a number of complications downstream. For other approaches, see this previous thread #1357 and there are probably other examples on the forum as well.

@jesika0123
Copy link
Author

Thank you for your reply. I have continued to edit the cutadapt parameters and settings, but still cannot get all 0s at the "sanity check" after primer removal.

I have added your suggestion of adding --discard-untrimmed, but also have added other suggested parameters.

Questions:

  1. For the following script (below) primers are still present, but these are the best results so far (a full summary is below). What happens if it is not possible to get to all 0s at this step?

FWD.ForwardReads 0 0 0 0

FWD.ReverseReads 1 0 0 0

REV.ForwardReads 3 0 0 0

REV.ReverseReads 0 0 0 0

  1. Is there a way not to lose a large percentage of reads at this step? Or is just the beast of this type of sequence data?

Best script and summary of results:
for (i in seq_along(fnFs)) {
system2(cutadapt, args = c(R1.flags, R2.flags, "-j", 0, "--nextseq-trim=20", "--match-read-wildcards", "--discard-untrimmed", "-n", 2, "--minimum-length", 214, "-e", 0,
"-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
fnFs.filtN[i], fnRs.filtN[i])) # input files
}

Summary of testing different parameters

no minimal read length

Forward Complement Reverse RevComp

FWD.ForwardReads 0 0 0 7

FWD.ReverseReads 10 0 0 0

REV.ForwardReads 4 0 0 0

REV.ReverseReads 0 0 0 0

"--minimum-length", 214 (250 - 18 (forward primer) - 18 (reverse primer) vs 230) = same results and best results

Forward Complement Reverse RevComp

FWD.ForwardReads 0 0 0 0

FWD.ReverseReads 1 0 0 0

REV.ForwardReads 3 0 0 0

REV.ReverseReads 0 0 0 0

increasing -n from 2 to 16 did not change results

#Forward Complement Reverse RevComp
#FWD.ForwardReads 0 0 0 7
#FWD.ReverseReads 10 0 0 0
#REV.ForwardReads 4 0 0 0
#REV.ReverseReads 0 0 0 0

-e from 0 to default changed results

#Forward Complement Reverse RevComp
#FWD.ForwardReads 0 0 0 7
#FWD.ReverseReads 12 0 0 0
#REV.ForwardReads 5 0 0 0
#REV.ReverseReads 0 0 0 1

@benjjneb
Copy link
Owner

Don't worry about getting perfect zeros. The goal is to get rid of almost all the primers, and you're achieving that.

A few stray ones show up because there are minor differences in how primers are matched by cutadapt and the R code, and possibly some malformed amplicons where the primers might appear in the middle of the sequences. But these are at very low numbers, so not a concern.

@jesika0123
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants