Peak calling parameters might not be ideal #169

ktrns · 2022-02-17T13:43:45Z

Dear all,

We had started to discuss this topic in Slack, but I decided to raise this issue so the comments won't get lost.

In brief, you currently use the macs2 settings originally suggested for ATAC-seq data: -f "BAMPE" --nomodel. I have spent some time on reading, and I find this discussion very helpful:
https://twitter.com/XiChenUoM/status/1336658454866325506

In brief, by using -f "BAMPE" we focus on only 1 of the 2 cut sites of the fragment, which does not seem to be the most appropriate way to handle ATAC-seq data.

Based on what I read, converting bam to bed, and then use -f bed (or bedpe, still have to figure that out) together with --shift -75 --extsize 150 is better. You would basically use both reads R1 and R2, and create 150bp reads with the mid-point on the cut-site of the transposase. Another twitter reply to the above thread said that Encode3 used --shift -37 --extsize 73.

I will likely start to try this to see if it gives an improvement.

Thanks a lot for your valuable work on the pipeline!
Katrin

The text was updated successfully, but these errors were encountered:

ktrns · 2023-05-22T18:58:47Z

Dear all,

I have had a closer look into this issue. I believe that the way macs2 is used for this pipeline is wrong. To understand my point it is very helpful to look at the link I pasted above.

For ChIP-seq, we are interested in the region between paired reads. In contrast, for ATAC-seq, we are interested in the individual reads as they represent the cut sites of the Tn5 transposase and hence open chromatin.

As can be seen in the insert size distribution of ATAC-seq data (wavy pattern), many read pairs span one or more nucleosomes. In its current state, the pipeline produces bigWig files and calls macs2 based on fragments, i.e. regions spanned by read pairs. This means that often what we detect are nucleosomes rather than open chromatin.

One reason the issue was never really obvious in IGV is that also the bigWig files are generated based on fragments in the nf-core code. This way the "wrong" peaks match the "wrong” coverage tracks.

I gave it a try and called macs2 as suggested above in my first post:

# bam to bed
# We want to treat each cut site as a single event
bedtools bamtobed -i sampleA.sorted.bam >> sampleA.sorted.bed

# macs2
macs2 \
  callpeak \
  --keep-dup all --nomodel \
  -f BED \
  --shift -75 --extsize 150 \
  --name sampleA \
  --treatment sampleA.bed

In the attached image, you see the following tracks from top to bottom:

consensus peaks produced by the pipeline (black; there were more samples than the one shown)
bigWig produced by the pipeline (dark blue)
narrow peaks detected by the pipeline (lighter blue)
coverage track generated by me based on a single-end bed file (which itself is based on the bam file by the pipeline) (salmon coloured)
macs peaks generated by me (pink) with --shift -75 --extsize 150
macs peaks generated by me (pink) with --shift -100 --extsize 200
loaded bam file produced by the pipeline

I should say that also this representation is not ideal, as in theory the visualised reads are running into the nucleosome, whereas macs2 uses one end of the read and generates artificial reads around the cut site itself. Therefore reads and peaks still don’t match exactly.

The example still shows the effect of using fragments rather than the cut sites. The fragments (blue tracks) pretty much include/span (very likely) one nuclesome.

I hope all of this makes some sense to you :-). Please share your opinion on this, am I misinterpreting the results? I am curious to see what you think.

I will try a few more examples, and see if I can fix this systematically for the pipeline.

Best & many thanks
Katrin

…ixed nf-core#168 (write genome fa and fai for IGV), fixed nf-core#169 (peak calling).

olechnwin · 2023-08-03T17:51:45Z

@ktrns and @JoseEspinosa,

This may be a silly question but in the region that is shown above, we are assuming the depleted region between the two peaks is where the nucleosome is, right? Can't it be where TF bind instead? How do you tell the difference?

Also, if it is TF binding, having a peak that only capture the open chromatin region on either side of the TF will not include their motifs. When we do motif discovery on the peaks, then the expected TF motifs won't be found?

Also in this suggested macs callpeak parameter, we no longer call broad peaks?

TIA for your insights.

Revert "DO NOT MERGE yet: Discuss fixes of #164 #168 #169 "

SergioRodLla · 2024-01-24T08:48:34Z

Hello,

I saw that some papers in which they use macs2 for ATAC-seq they previously filter the fragment sizes to get a set corresponding to nucleosome free regions (NFR) and then they apply macs2 callpeak to them. Do you think this would solve the issue that @ktrns mentions about "visualised reads are running into the nucleosome"?
Let me know what you think about this.

Best regards,
Sergio

drpatelh added the enhancement New feature or request label Nov 15, 2022

ktrns added the bug Something isn't working label May 22, 2023

ktrns added a commit to ktrns/atacseq that referenced this issue Jul 3, 2023

Merged ktrns/master into ktrns/dev. Fixed nf-core#164 (4/5 bp shift), f…

f14eeb8

…ixed nf-core#168 (write genome fa and fai for IGV), fixed nf-core#169 (peak calling).

ktrns mentioned this issue Jul 5, 2023

DO NOT MERGE yet: Discuss fixes of #164 #168 #169 #301

Merged

JoseEspinosa removed the bug Something isn't working label Jul 25, 2023

samuelruizperez mentioned this issue Aug 22, 2023

Adding peak-calling with Genrich (fix #108) and multi-mapping read analysis #331

Draft

8 tasks

bjlang added a commit that referenced this issue Sep 6, 2023

Revert "DO NOT MERGE yet: Discuss fixes of #164 #168 #169 "

1030a66

JoseEspinosa added a commit that referenced this issue Sep 6, 2023

Merge pull request #333 from nf-core/revert-301-dev

e034e1e

Revert "DO NOT MERGE yet: Discuss fixes of #164 #168 #169 "

bjlang added a commit that referenced this issue Sep 6, 2023

Revert "DO NOT MERGE yet: Discuss fixes of #164 #168 #169 "

17cc561

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peak calling parameters might not be ideal #169

Peak calling parameters might not be ideal #169

ktrns commented Feb 17, 2022

ktrns commented May 22, 2023

olechnwin commented Aug 3, 2023

SergioRodLla commented Jan 24, 2024

Peak calling parameters might not be ideal #169

Peak calling parameters might not be ideal #169

Comments

ktrns commented Feb 17, 2022

ktrns commented May 22, 2023

olechnwin commented Aug 3, 2023

SergioRodLla commented Jan 24, 2024