Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepency in filtering restults and reads after filtering #528

Open
xapple opened this issue Oct 12, 2023 · 2 comments
Open

Discrepency in filtering restults and reads after filtering #528

xapple opened this issue Oct 12, 2023 · 2 comments

Comments

@xapple
Copy link

xapple commented Oct 12, 2023

Here is the head of the file stats_fastp.json for a random single-end Illumina sequencing sample:

{
        "summary": {
                "fastp_version": "0.23.4",
                "sequencing": "single end (75 cycles)",
                "before_filtering": {
                        "total_reads":19014947,
                        "total_bases":1426121025,
                        "q20_bases":1368126463,
                        "q30_bases":1340057991,
                        "q20_rate":0.959334,
                        "q30_rate":0.939652,
                        "read1_mean_length":75,
                        "gc_content":0.501123
                },
                "after_filtering": {
                        "total_reads":10933431,
                        "total_bases":780019338,
                        "q20_bases":758743644,
                        "q30_bases":744983654,
                        "q20_rate":0.972724,
                        "q30_rate":0.955084,
                        "read1_mean_length":71,
                        "gc_content":0.498169
                }
        },
        "filtering_result": {
                "passed_filter_reads": 18724357,
                "low_quality_reads": 1329,
                "too_many_N_reads": 7,
                "too_short_reads": 289254,
                "too_long_reads": 0
        },

after running it through fastp with the following command:

$ fastp --detect_adapter_for_pe --overrepresentation_analysis --dedup --correction --cut_right --thread 10 --in1 fwd.fastq.gz --out1 clean/fwd.fastq.gz --unpaired1 clean/fwd.fastq.singletons.fastq --html stats_fastp.html --json stats_fastp.json

We can see that after_filtering there are 10'933'431 reads left in the cleaned FASTQ. However the filtering_result category tells us that as many as 18'724'357 passed the filter. This is a huge mismatch. What happened to the 8 or so million reads? Why did they get removed?

@GaryZhangYue
Copy link

GaryZhangYue commented Oct 19, 2023

I have the same issue here. It happened after I included flags to filter out the duplicated reads and low complexity reads. Without the two flags, the numbers seemed match each other

Read1 after filtering:
total reads: 9899483
total bases: 969140514
Q20 bases: 930274034(95.9896%)
Q30 bases: 849902013(87.6965%)

Read2 after filtering:
total reads: 9899483
total bases: 968730404
Q20 bases: 922809589(95.2597%)
Q30 bases: 846947592(87.4286%)

Filtering result:
reads passed filter: 19798966
reads failed due to low quality: 3232674
reads failed due to too many N: 206
reads failed due to too short: 111888936
reads with adapter trimmed: 58014749
bases trimmed due to adapters: 1885062968

Duplication rate: 79.9009%

Maybe it is due to the deduplication?

@rikrdo89
Copy link

rikrdo89 commented Oct 23, 2023

I am seeing also a discrepancy in those results. I do have --dedup parameter when I run fastp, but if duplicates are being removed, maybe the final results should reflect that.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants