Discrepency in filtering restults and reads after filtering #528

xapple · 2023-10-12T13:19:52Z

Here is the head of the file stats_fastp.json for a random single-end Illumina sequencing sample:

{
        "summary": {
                "fastp_version": "0.23.4",
                "sequencing": "single end (75 cycles)",
                "before_filtering": {
                        "total_reads":19014947,
                        "total_bases":1426121025,
                        "q20_bases":1368126463,
                        "q30_bases":1340057991,
                        "q20_rate":0.959334,
                        "q30_rate":0.939652,
                        "read1_mean_length":75,
                        "gc_content":0.501123
                },
                "after_filtering": {
                        "total_reads":10933431,
                        "total_bases":780019338,
                        "q20_bases":758743644,
                        "q30_bases":744983654,
                        "q20_rate":0.972724,
                        "q30_rate":0.955084,
                        "read1_mean_length":71,
                        "gc_content":0.498169
                }
        },
        "filtering_result": {
                "passed_filter_reads": 18724357,
                "low_quality_reads": 1329,
                "too_many_N_reads": 7,
                "too_short_reads": 289254,
                "too_long_reads": 0
        },

after running it through fastp with the following command:

$ fastp --detect_adapter_for_pe --overrepresentation_analysis --dedup --correction --cut_right --thread 10 --in1 fwd.fastq.gz --out1 clean/fwd.fastq.gz --unpaired1 clean/fwd.fastq.singletons.fastq --html stats_fastp.html --json stats_fastp.json

We can see that after_filtering there are 10'933'431 reads left in the cleaned FASTQ. However the filtering_result category tells us that as many as 18'724'357 passed the filter. This is a huge mismatch. What happened to the 8 or so million reads? Why did they get removed?

The text was updated successfully, but these errors were encountered:

GaryZhangYue · 2023-10-19T01:12:57Z

I have the same issue here. It happened after I included flags to filter out the duplicated reads and low complexity reads. Without the two flags, the numbers seemed match each other

Read1 after filtering:
total reads: 9899483
total bases: 969140514
Q20 bases: 930274034(95.9896%)
Q30 bases: 849902013(87.6965%)

Read2 after filtering:
total reads: 9899483
total bases: 968730404
Q20 bases: 922809589(95.2597%)
Q30 bases: 846947592(87.4286%)

Filtering result:
reads passed filter: 19798966
reads failed due to low quality: 3232674
reads failed due to too many N: 206
reads failed due to too short: 111888936
reads with adapter trimmed: 58014749
bases trimmed due to adapters: 1885062968

Duplication rate: 79.9009%

Maybe it is due to the deduplication?

rikrdo89 · 2023-10-23T13:48:14Z

I am seeing also a discrepancy in those results. I do have --dedup parameter when I run fastp, but if duplicates are being removed, maybe the final results should reflect that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepency in filtering restults and reads after filtering #528

Discrepency in filtering restults and reads after filtering #528

xapple commented Oct 12, 2023

GaryZhangYue commented Oct 19, 2023 •

edited

rikrdo89 commented Oct 23, 2023 •

edited

Discrepency in filtering restults and reads after filtering #528

Discrepency in filtering restults and reads after filtering #528

Comments

xapple commented Oct 12, 2023

GaryZhangYue commented Oct 19, 2023 • edited

rikrdo89 commented Oct 23, 2023 • edited

GaryZhangYue commented Oct 19, 2023 •

edited

rikrdo89 commented Oct 23, 2023 •

edited