Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning by removing 'Dangling ends' fragments only? #164

Open
yermek2030 opened this issue Jul 28, 2023 · 7 comments
Open

Cleaning by removing 'Dangling ends' fragments only? #164

yermek2030 opened this issue Jul 28, 2023 · 7 comments

Comments

@yermek2030
Copy link

Dear members of Vaquerizas lab,

I was wondering if there is a way to perform cleaning of the FASTQ or BAM files from the 'Dangling ends" fragments only? The necessity stems from the fact that I have performed BAM file alignment using standard BWA MEME and SAMTOOLS. Now when I plot the 'fragment size vs density' plot using FAN-C I have a peak at around ~100-110bp (Figure1), which disrupts the expected bell-like shape of the distribution we get with data from public repositories. Given that all the adaptors were removed prior to generating BAM files, I suspect that the peak comes from 'Dangling ends' as classified by the QC plot output of FAN-C, because the barplot for this type of contaminant is the only one that significantly differs from the (presumably 'clean') public datasets we use. If possible I would like to test my hypothesis by finding and removing 'dangling ends' fragments only.
Many thanks.
github

@kaukrise
Copy link
Collaborator

kaukrise commented Aug 17, 2023

Yes you can use fanc pairs for this. First, create a .pairs file as outline in the docs.

Then use fanc pairs filter options again to filter the object. You will probably see the greatest effect with point 1:

  1. fanc pairs --filter-self-ligations removes all reads where both ends map to the same fragment
  2. fanc pairs --ligation-error-plot creates "ligation error plot" as shown here
  3. From the ligation error plot, choose appropriate cutoffs for ligation errors (see docs and filter using fanc pairs -i <inward cutoff> -o <outward cutoff>

@SimoneO98
Copy link

Hi ,

@yermek2030 I encountered a similar problem so I was wondering if you succeeded in removing the peak around 100 bp? (and with which settings -i, -o, -d). Because I followed the instructions @kaukrise suggested above but the peak is still present. (using -i 10000 -o 10000 -d 5000 -l)

Additionally, @kaukrise I was wondering If you maybe have an explanation why I also have a peak around 10 bp? I don't understand why this peak is present and how I can remove this.
image

Thank you in advance for helping me!

@yermek2030
Copy link
Author

yermek2030 commented Sep 5, 2023 via email

@kaukrise
Copy link
Collaborator

kaukrise commented Sep 5, 2023

Hi everyone,

could you please post the complete commands you used to do the filtering? It is difficult to guess where exactly the filtering might have failed otherwise.

Thank you!

@SOlsthoorn I don't know why you have a peak around 10bp, either. Your restriction profile looks a bit unusual in general to me - the mean insert size is quite small, and there are no large fragments at all. But this could be related to your particular protocol

@SimoneO98
Copy link

Hi @kaukrise,

First I created a pairs file using:
fanc pairs bam_1.bam bam_2.bam pairs_output.pairs -g hg38_DpnII_fragments.bed -m -us -q 3 -t 20

Then I made a ligation error plot and a restriction site distance plot:
image
image

and based on that I filtered the pairs file:
fanc pairs pairs_output.pairs -i 10000 -o 10000 -d 5000 -l -p 2 -s stats.txt
resulting in the previous restriction site distance plot I send earlier.

So it seems like the peak around 100bp decreases a bit, but not completely disappeared and the peak around 10bp only increased in density ?

Yes, the restriction profile does indeed appear unusual. I thought that perhaps the mean would shift a bit more towards 200 bp, and the pattern would resemble a more normal distribution if the high peaks at the beginning were removed.
In the protocol we only removed fragments <200bp and when we look at the distribution of the bioanalyzer a peak around 800 bp appears.

Thanks for your help!

@SimoneO98
Copy link

Hi @kaukrise,

Sorry to bother you again with my issue. But could you maybe provide a more detailed explanation of the restriction site distance plot, what it represents and how it's constructed? On the FAN-C website, there is mention of using the -d parameter to filter read pairs based on their cumulative distance to the nearest restriction site. It's suggested that generating this re-dist plot, using only 10,000 read pairs, is valuable for determining -d and identifying library issues. Could you elaborate on this process? Maybe if I understand this better it can help me figure out the origin of short range peaks and find a way to remove them.

Thanks again in advance for your help!

@yermek2030
Copy link
Author

Dear @kaukrise and @SimoneO98,

Thank you for your patience.

  1. I cleaned my fastq files using Trimmomatic:

trimmomatic PE CTKOTd6r1_S11_R1_001.fastq CTKOTd6r1_S11_R2_001.fastq CTKOTd6r1_S11_R1.paired.fastq CTKOTd6r1_S11_R1.unpaired.fastq CTKOTd6r1_S11_R2.paired.fastq CTKOTd6r1_S11_R2.unpaired.fastq -trimlog trimlog_test ILLUMINACLIP:adapters/TruSeq3-PE-2.fa:2:30:10:2:True LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:25
awk '$6>0 {print}' trimlog_test | wc
TruSeq2-PE
4900918 29405508 305969367
TruSeq3-PE-2
4900918 29405508 305969367
TruSeq3-PE-2_Minlen 120:
0 0 0
TruSeq2-PE_Minlen 120:
0 0 0
TruSeq2-PE_Minlen 101:
0 0 0
TruSeq3-PE-2_Minlen_75:
3045614 18273684 190132512
TruSeq3-PE_Minlen_75:
3045614 18273684 190132512

  1. Separately aligned forward and reverse FASTQ files using BWA:

bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 ../0_genomes/mm39/mm39.fa cleaned_by_TruSeq3-PE-2-75/CTKOTd6r1_S11_R1.paired.fastq | samtools view -Shb - > mm39_bwa_out/CTKOTd6r1_S11_R1_75trimmed.bam
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 ../0_genomes/mm39/mm39.fa cleaned_by_TruSeq3-PE-2-75/CTKOTd6r1_S11_R2.paired.fastq | samtools view -Shb - > mm39_bwa_out/CTKOTd6r1_S11_R2_75trimmed.bam

  1. Created 'pairs' file according to the manual:

fanc pairs CTKOTd6r1_S11_R1_75trimmed.bam CTKOTd6r1_S11_R2_75trimmed.bam CTKOTd6r1_S11_R1_paired.pairs -g ~/myzdisk/0_genomes/mm39/mm39_DpnII.bed -us -q 3

  1. Plotted the graphs according to the manual:

fanc pairs --re-dist-plot CTKOTd6r1_S11_paired_re-dist.png CTKOTd6r1_S11_paired.pairs
fanc pairs --statistics-plot CTKOTd6r1_S11_paired.pairs_stats.png CTKOTd6r1_S11_paired.pairs
fanc pairs --ligation-error-plot CTKOTd6r1_S11_paired.pairs_ligation-err.png CTKOTd6r1_S11_paired.pairs

  1. Followed @kaukrise advice and tried to filter self-ligations:

fanc pairs --filter-self-ligations CTKOTd6r1_S11_paired.pairs

  1. Re-plotted the image after filtering the pairs file. No new file was generated as a result of filtering, so I guessed that the existing file was modified:

fanc pairs --re-dist-plot CTKOTd6r1_S11_paired_re-dist.png CTKOTd6r1_S11_paired.pairs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants