Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop in Qualityscore in the middle of the reads #1935

Open
MadsBjornsen opened this issue Apr 25, 2024 · 8 comments
Open

Drop in Qualityscore in the middle of the reads #1935

MadsBjornsen opened this issue Apr 25, 2024 · 8 comments

Comments

@MadsBjornsen
Copy link

Hi, I am looking at some data from the 2 x 150 bp V4 region, and when I look at the quality plot, we see a drop in the middle of the read for both forward and reverse. Furthermore, we see that many of our reads don't merge and have a relatively high percentage of chimeras (between 35% - 23%). I have tried different settings for the filtering:

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(135,125), maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)

             input    filtered  denoisedF   	denoisedR    	merged        nonchim

S217 28311 27556 26812 27263 8361 7182
S218 71293 69817 68873 69245 5743 3848
S219 54009 52129 51461 51417 9244 18130
S220 42999 41111 40682 40562 18860 16787
S221 46393 44781 44175 44475 11427 10013
S222 44044 42420 42022 41865 18219 17392

sum(seqtab.nochim)/sum(seqtab)
0.770275

Here, we left out the truncLen part:
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,
maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE)

        input      filtered   denoisedF   denoisedR    	merged        nonchim

S217 28311 26963 26240 26518 18448 13343
S218 71293 68615 67456 67761 54402 26459
S219 54009 51042 50224 50386 31696 25513
S220 42999 40072 39625 39559 26085 22037
S221 46393 43918 42427 43474 31217 22158
S222 44044 41441 41008 40993 27124 22091

sum(seqtab.nochim)/sum(seqtab)
0.6565934

Forward quality plot
Reverse quality plot

Can the reason for the sudden drop in the quality of the reads in the middle be a sequencing error, and can this explain the high number of chimeras found in the data?

Thanks for the feedback.

@benjjneb
Copy link
Owner

I don't have a satisfying answer, but yes the abnormal quality drop in the middle of the reads could be connected to a lot of reads lost at merging and chimera. But I don't know what the underlying mechanism would be, as I've never seen this type of quality profile before. One thing to check is whether the reads pre- and post-quality-drop both look like bacterial 16S sequences, and whether there is any low complexity sequence in the data (see the dada2 plotComplexity and seqComplexity functions for a quick way to look for that).

@MadsBjornsen
Copy link
Author

Thanks for the feedback. The output from the plotComplexity for the first 2 samples looks like this:

S217_R2_001
S217_R1_001

S218_R1_001
S218_R2_001

When I try the seqComplexity I get this error:

hist(seqComplexity("S217_R1_001.fastq"), 100)
Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, :
key 50 (char '2') not in lookup table
In addition: Warning message:
In seqComplexity("S217_R1_001.fastq") :
Not all sequences were A/C/G/T only, which can interfere with the calculation of the Shannon richness.

I have attached the first two forward and reverse samples

S217_R1_001.fastq.gz
S217_R2_001.fastq.gz

S218_R1_001.fastq.gz
S218_R2_001.fastq.gz

I might try to get the raw data for demultiplexing my self.

@benjjneb
Copy link
Owner

Those first two plots in particular look troubling. Bimodal distributions of complexity scores suggest there is a mixture between normal (high complexity) sequences, and sequences that partially contain low complexity regions. This should not be observed in V4 data.

To use seqComplexity you'll need to read in the sequences first (it doesn't work on filenames), so sq <- getSequences("my/fastq/file.fq"); seqComplexity(sq) should get you a vector of complexity scores for the sequences in that file. Then depending on how comfortable you are with R, you can poke around the set of sequences in the low mode, and in the high mode of complexity to see if anything obvious jumps out.

@MadsBjornsen
Copy link
Author

Ahh, thanks for that; I did as you suggested

S217_R1_001-seq
S217_R2_001-seq

S218_R1_001-seq
S218_R2_001-seq

I furthermore did a BLAST of some of the sequences I got from one of the samples and received back bacteria.
I have read in other feedback you have given that you might recommend using the function rm.lowcomplex in the filterAndtrim function, would that be an idea?

Thanks

@benjjneb
Copy link
Owner

I have read in other feedback you have given that you might recommend using the function rm.lowcomplex in the filterAndtrim function, would that be an idea?

Yes, you would set the rm.lowcomplex threshold based on the plotComplexity histograms to remove the low score mode. So maybe at something like 13? from the above.

@MadsBjornsen
Copy link
Author

Thanks for the input so far!
I have tried different parameters to see what would give me the best results but I keep loosing a lot of reads doing the merged part, would it be better to only work with the reversed read and skip the forward read as it seems to be there we have the problems?

However, by doing so, will I then run into troubles when assigning taxa as I am working with v4 2x150 bp?

Finally, if I use truncLen I seem to loos more reads when merging compared to when I use truncLen, however, my segtab.nochim/segtab percentage is much better when using truncLen

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,
maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE, rm.lowcomplex = 13)

I get a percentage:
sum(seqtab.nochim)/sum(seqtab) = 0.6312996

 input filtered denoisedF denoisedR merged nonchim

S217 28311 12809 12289 12379 10879 7446
S218 71293 65764 64674 64858 53490 25977
S219 54009 18146 17493 17646 13899 8675
S220 42999 10944 10584 10577 7945 5242
S221 46393 23366 22046 22968 20470 12945
S222 44044 11088 10811 10838 9797 6395

where when I use truncLen i get this ratio:

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(117,147),
maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE, rm.lowcomplex = 13)

sum(seqtab.nochim)/sum(seqtab) = 0.7294229

 input filtered denoisedF denoisedR merged nonchim

S217 28311 21459 20811 21053 5206 4391
S218 71293 67665 65995 67084 9482 6057
S219 54009 32892 32441 32399 10286 8630
S220 42999 23861 23342 23476 7828 7076
S221 46393 32658 32128 32292 5764 4759
S222 44044 24482 24056 24157 9003 6568

Sorry for the long post and thanks again!

@benjjneb
Copy link
Owner

I should have caught this earlier, but the truncLen you are using is too small. If using the common EMP V4 primer set, the length of the sequenced amplicon is 251-255 nts. The mergePairs function requires a minimum of 12 nts of overlap to merge the forward and reverse reads by default. So, it is necessary to have at least 255+12=267 nts between the forward and reverse reads after truncation, and some buffer is recommended. Your total nts post truncation are 117+147=264 and 135+125=260 in the filterAndTrim calls you posted above. This is what is driving the very low merging rates. So if you want to explore truncLen, you need to set a hard requirement that the forward+reverse truncation lengths add up to a minimum of 270 (or you can mess with minOverlap I guess).

sum(seqtab.nochim)/sum(seqtab) = 0.6312996

This is a high rate of chimeric reads. But I would check that after fixing the truncLen issue that it stays this high. We do see up to 25% of reads as chimeric in some amplicon sequencing data (although this suggests that improved PCR protocols would be appropriate in the future).

@MadsBjornsen
Copy link
Author

Thanks for that; I had set the minOverlap lower, all the way down to 3, in the examples I sent you to see if that would make anything different; sorry for not informing you of that. I will run it again with an appropriate truncLen that gives a minimum of 270 and see how that goes.

The data is some I have received from someone else to combine with some other data for a paper, so I don't know how the laboratory protocols have been done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants