Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peak calling parameters might not be ideal #169

Open
ktrns opened this issue Feb 17, 2022 · 3 comments
Open

Peak calling parameters might not be ideal #169

ktrns opened this issue Feb 17, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@ktrns
Copy link

ktrns commented Feb 17, 2022

Dear all,

We had started to discuss this topic in Slack, but I decided to raise this issue so the comments won't get lost.

In brief, you currently use the macs2 settings originally suggested for ATAC-seq data: -f "BAMPE" --nomodel. I have spent some time on reading, and I find this discussion very helpful:
https://twitter.com/XiChenUoM/status/1336658454866325506

In brief, by using -f "BAMPE" we focus on only 1 of the 2 cut sites of the fragment, which does not seem to be the most appropriate way to handle ATAC-seq data.

Based on what I read, converting bam to bed, and then use -f bed (or bedpe, still have to figure that out) together with --shift -75 --extsize 150 is better. You would basically use both reads R1 and R2, and create 150bp reads with the mid-point on the cut-site of the transposase. Another twitter reply to the above thread said that Encode3 used --shift -37 --extsize 73.

I will likely start to try this to see if it gives an improvement.

Thanks a lot for your valuable work on the pipeline!
Katrin

@drpatelh drpatelh added the enhancement New feature or request label Nov 15, 2022
@ktrns ktrns added the bug Something isn't working label May 22, 2023
@ktrns
Copy link
Author

ktrns commented May 22, 2023

Dear all,

I have had a closer look into this issue. I believe that the way macs2 is used for this pipeline is wrong. To understand my point it is very helpful to look at the link I pasted above.

For ChIP-seq, we are interested in the region between paired reads. In contrast, for ATAC-seq, we are interested in the individual reads as they represent the cut sites of the Tn5 transposase and hence open chromatin.

As can be seen in the insert size distribution of ATAC-seq data (wavy pattern), many read pairs span one or more nucleosomes. In its current state, the pipeline produces bigWig files and calls macs2 based on fragments, i.e. regions spanned by read pairs. This means that often what we detect are nucleosomes rather than open chromatin.

One reason the issue was never really obvious in IGV is that also the bigWig files are generated based on fragments in the nf-core code. This way the "wrong" peaks match the "wrong” coverage tracks.

I gave it a try and called macs2 as suggested above in my first post:

# bam to bed
# We want to treat each cut site as a single event
bedtools bamtobed -i sampleA.sorted.bam >> sampleA.sorted.bed

# macs2
macs2 \
  callpeak \
  --keep-dup all --nomodel \
  -f BED \
  --shift -75 --extsize 150 \
  --name sampleA \
  --treatment sampleA.bed

In the attached image, you see the following tracks from top to bottom:

  • consensus peaks produced by the pipeline (black; there were more samples than the one shown)
  • bigWig produced by the pipeline (dark blue)
  • narrow peaks detected by the pipeline (lighter blue)
  • coverage track generated by me based on a single-end bed file (which itself is based on the bam file by the pipeline) (salmon coloured)
  • macs peaks generated by me (pink) with --shift -75 --extsize 150
  • macs peaks generated by me (pink) with --shift -100 --extsize 200
  • loaded bam file produced by the pipeline
Screenshot 2023-05-22 at 20 29 48

I should say that also this representation is not ideal, as in theory the visualised reads are running into the nucleosome, whereas macs2 uses one end of the read and generates artificial reads around the cut site itself. Therefore reads and peaks still don’t match exactly.

The example still shows the effect of using fragments rather than the cut sites. The fragments (blue tracks) pretty much include/span (very likely) one nuclesome.

I hope all of this makes some sense to you :-). Please share your opinion on this, am I misinterpreting the results? I am curious to see what you think.

I will try a few more examples, and see if I can fix this systematically for the pipeline.

Best & many thanks
Katrin

ktrns added a commit to ktrns/atacseq that referenced this issue Jul 3, 2023
…ixed nf-core#168 (write genome fa and fai for IGV), fixed nf-core#169 (peak calling).
@JoseEspinosa JoseEspinosa removed the bug Something isn't working label Jul 25, 2023
@olechnwin
Copy link

@ktrns and @JoseEspinosa,

This may be a silly question but in the region that is shown above, we are assuming the depleted region between the two peaks is where the nucleosome is, right? Can't it be where TF bind instead? How do you tell the difference?

Also, if it is TF binding, having a peak that only capture the open chromatin region on either side of the TF will not include their motifs. When we do motif discovery on the peaks, then the expected TF motifs won't be found?

Also in this suggested macs callpeak parameter, we no longer call broad peaks?

TIA for your insights.

@SergioRodLla
Copy link

Hello,

I saw that some papers in which they use macs2 for ATAC-seq they previously filter the fragment sizes to get a set corresponding to nucleosome free regions (NFR) and then they apply macs2 callpeak to them. Do you think this would solve the issue that @ktrns mentions about "visualised reads are running into the nucleosome"?
Let me know what you think about this.

Best regards,
Sergio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants