Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing most reads after given r2 adapter #547

Open
Flying-Doggy opened this issue Feb 18, 2024 · 1 comment
Open

Missing most reads after given r2 adapter #547

Flying-Doggy opened this issue Feb 18, 2024 · 1 comment

Comments

@Flying-Doggy
Copy link

The r2 reads is specified with GTGAGTGATGGTTGAGGTAGTGTGGAG at 5'. So I use --adapter_sequence_r2 GTGAGTGATGGTTGAGGTAGTGTGGAG to identify the valid reads, and the command is shown below:

fastp \
--in1 ${sampleID}_R1.fq.gz \
--in2 ${sampleID}_R2.fq.gz \
--out1 ${sampleID}_clean_R1.fq.gz \
--out2 ${sampleID}_clean_R2.fq.gz \
--json ${sampleID}.json \
--html ${sampleID}.html \
--trim_poly_g --poly_g_min_len 10 \
--trim_poly_x --poly_x_min_len 10 \
--cut_front --cut_tail --cut_window_size 4 \
--qualified_quality_phred 30 \
--low_complexity_filter --complexity_threshold 30 \
--thread 10 \
--adapter_sequence_r2  GTGAGTGATGGTTGAGGTAGTGTGGAG

The result suggests only 1/6 reads(~6000000, R1+R2) remains, 5/6 reads were filtered because of short length.

Filtering result:
reads passed filter: 6881134
reads failed due to low quality: 60130
reads failed due to too many N: 0
reads failed due to too short: 30160650
reads failed due to low complexity: 1218
reads with adapter trimmed: 21527581
bases trimmed due to adapters: 2374478381
reads with polyX in 3' end: 472347
bases trimmed in polyX tail: 5270559

Then, I use grep to sum the number of reads with GTGAGTGATGGTTGAGGTAGTGTGGAG at R2 head and find 17606348 R2 reads are start with valid adapter. So I think there might be other reason caused missing reads.

zcat 0116A_R2.fq.gz | grep ^GTGAGTGATGGTTGAGGTAGTGTGGAG | wc -l
17606348

To discovery which factor caused low filtered reads, I removed --adapter_sequence_r2 GTGAGTGATGGTTGAGGTAGTGTGGAG and ran fastp again. However, much more reads were acquired.

fastp \
--in1 ${sampleID}_R1.fq.gz \
--in2 ${sampleID}_R2.fq.gz \
--out1 ${sampleID}_clean_R1.fq.gz \
--out2 ${sampleID}_clean_R2.fq.gz \
--json ${sampleID}.json \
--html ${sampleID}.html \
--trim_poly_g --poly_g_min_len 10 \
--trim_poly_x --poly_x_min_len 10 \
--cut_front --cut_tail --cut_window_size 4 \
--qualified_quality_phred 30 \
--low_complexity_filter --complexity_threshold 30 \
--thread 10 \
Filtering result:
reads passed filter: 30428586
reads failed due to low quality: 221476
reads failed due to too many N: 0
reads failed due to too short: 608
reads failed due to low complexity: 1332
reads with adapter trimmed: 5489442
bases trimmed due to adapters: 109359880
reads with polyX in 3' end: 819659
bases trimmed in polyX tail: 9294422

I want to know why many valid reads miss after adding --adapter_sequence_r2 GTGAGTGATGGTTGAGGTAGTGTGGAG .
And in the output clean data, many reads still have adapter sequence.

zcat 0116A_clean_R2.fq.gz | grep ^GTGAGTGATGGTTGAGGTAGTGTGGAG | wc -l
3047290
@Flying-Doggy
Copy link
Author

Flying-Doggy commented Feb 19, 2024

for example, target adapter 'GTGAGTGATGGTTGAGGTAGTGTGGAG' is located at the 5' of the E200012434L1C001R00100009337 read2. And it's expected to get trimmed reads2 like 'CGGGGTTATAGTGTGAGATTTTGTTTTAAGAATAAAAAAATTTTAAAATAAGATAATTTTATTTTTATATAAATTATTTTAGAGTATAATAAAAGGAAAATTTTTAAATTTATTATATAAAGT', but the fastp trimmed the whole E200012434L1C001R00100009337 read2.
---input---

@E200012434L1C001R00100009337/1
ACAACAAACCGAAATCGCGCCACTACACTCCAACCTAAAAAACAACGAAACTCCGTCTCAAAAAAAAAACAAAAAACAAAAAATTAAAATAAAACCAACTTTATATAATAAATTTAAAAATTTTCCTTTTATTATACTCTAAAATAATTT
+
GFGGFGGGFFGFGGGFGFGFGGFGGFGFGFGGGFGGGGGGGGFGGGGGGGGGGFGFGGGFGGGGGGGGGGGGGGGGFGGGGGGFGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGBFGGGGFGGGGGFFFGGGGGGFGCGG
**@E200012434L1C001R00100009337/2**
GTGAGTGATGGTTGAGGTAGTGTGGAGCGGGGTTATAGTGTGAGATTTTGTTTTAAGAATAAAAAAATTTTAAAATAAGATAATTTTATTTTTATATAAATTATTTTAGAGTATAATAAAAGGAAAATTTTTAAATTTATTATATAAAGT
+
FEFFFFFF?GFFFFFFFFFFFFCGFFFEGFFFFFGEGFFFFFFFF>FFFF=FGGGFGFFFFFFFFGFFCFFGFGFFFFGF>GFEFDFGFEFFGFEFFGAFFFG?FFFFFFF>G>GFFGGFGFDFFF.#E@CCFF?<FFGFFGGGFGFGFF

----fastp_output----

@E200012434L1C001R00100009337/1 paired_read_is_failing
ACAACAAACCGAAATCGCGCCACTACACTCCAACCTAAAAAACAACGAAACTCCGTCTCAAAAAAAAAACAAAAAACAAAAAATTAAAATAAAACCAACTTTATATAATAAATTTAAAAATTTTCCTTTTATTATACTCTAAAATAATTT
+
GFGGFGGGFFGFGGGFGFGFGGFGGFGFGFGGGFGGGGGGGGFGGGGGGGGGGFGFGGGFGGGGGGGGGGGGGGGGFGGGGGGFGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGBFGGGGFGGGGGFFFGGGGGGFGCGG
**@E200012434L1C001R00100009337/2 failed_too_short**

+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant