Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BR: #79

Open
EllieDuan opened this issue Nov 19, 2020 · 12 comments
Open

BR: #79

EllieDuan opened this issue Nov 19, 2020 · 12 comments

Comments

@EllieDuan
Copy link

Describe the bug

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at HMMR_ATAC.Main_HMMR_Driver.main(Main_HMMR_Driver.java:425)

To Reproduce
This error only show up when used --trim 3 option

System (please complete the following information):

  • OS: Linux
  • java version: 1.8.0_111
  • HMMRATAC Version 1.2.10

Additional context

@HagaiHargil
Copy link

Hi, any ideas how can one fix it, or go around it? I'm obviously only interested in the open areas (i.e. no nucleosomes) of the chromosome I'm analyzing.

@EllieDuan
Copy link
Author

Hi, any ideas how can one fix it, or go around it? I'm obviously only interested in the open areas (i.e. no nucleosomes) of the chromosome I'm analyzing.

Not really helpful here regarding HMMRATAC.

I ended up selected open regions using Samtools for insert size less than 100bp: https://accio.github.io/bioinformatics/2020/03/10/filter-bam-by-insert-size.html

Then calling peaks on these open regions bam files using Macs2.

@HagaiHargil
Copy link

Thanks for the reply, I'll try that out.

Why didn't you call the peaks on the new file using HMMRATAC? Is the output of your workflow incompatible with what HMMRATAC is looking for?

@EvanTarbell
Copy link
Collaborator

Sorry for the delay. Let me start by asking if this data was size selected? Ie on a gel etc.

@EllieDuan
Copy link
Author

Thanks for the reply, I'll try that out.

Why didn't you call the peaks on the new file using HMMRATAC? Is the output of your workflow incompatible with what HMMRATAC is looking for?

I think HMMRATAC works better for input without size selection for their model, based on their paper: https://academic.oup.com/nar/article/47/16/e91/5519166, therefore, I decided to go back the traditional way. The bam file after Samtool size selection should be compatible with HMMRATAC.

@EllieDuan
Copy link
Author

Sorry for the delay. Let me start by asking if this data was size selected? Ie on a gel etc.

The bam file I input in HMMRATAC that run into this error message was not being size selected, the library for sequencing is also not selected for insert size. It contains all short and long fragments. Thank you.

@EvanTarbell
Copy link
Collaborator

The bam file I input in HMMRATAC that run into this error message was not being size selected, the library for sequencing is also not selected for insert size. It contains all short and long fragments. Thank you.

I think there is a misunderstanding about the —trim option. It isn’t a parameter for excluding nucleosomes or isolating open regions. It is meant to allow the processing of libraries that were subjected to some size selection. So if your data wasn’t, I don’t recommend using that option. If you are only interested in open regions, and not the flanking nucleosomes, then use the coordinates at columns 7-8 of the gappedPeak file, which designates the open regions alone. But the algorithm should be run with the whole library and without —trim.

As an aside, regardless of the software, you shouldn’t do any size selection with ATAC-seq, physical or computational. We were able to show, in the HMMRATAC paper, that size selection results in a loss of precision compared to non selected data

@EllieDuan
Copy link
Author

The bam file I input in HMMRATAC that run into this error message was not being size selected, the library for sequencing is also not selected for insert size. It contains all short and long fragments. Thank you.

I think there is a misunderstanding about the —trim option. It isn’t a parameter for excluding nucleosomes or isolating open regions. It is meant to allow the processing of libraries that were subjected to some size selection. So if your data wasn’t, I don’t recommend using that option. If you are only interested in open regions, and not the flanking nucleosomes, then use the coordinates at columns 7-8 of the gappedPeak file, which designates the open regions alone. But the algorithm should be run with the whole library and without —trim.

As an aside, regardless of the software, you shouldn’t do any size selection with ATAC-seq, physical or computational. We were able to show, in the HMMRATAC paper, that size selection results in a loss of precision compared to non selected data

Thank you for the clarification, this makes much sense now! Many thanks!

@HagaiHargil
Copy link

@EvanTarbell Thanks a lot for the info! I'll be sure to look at the gappedPeaks file. BTW, regardless, the trim option is still buggy and causing the crash @EllieDuan described here.

@EvanTarbell
Copy link
Collaborator

Yes, i understand there is an issue with —trim
I’m going to reopen this issue so I can address whatever bug caused the original error

@EvanTarbell EvanTarbell reopened this Dec 31, 2020
@HagaiHargil
Copy link

HagaiHargil commented Jan 1, 2021

Hello again @EvanTarbell,

I just want to make sure I got your previous explanation regarding the open chromatin regions right.

What I did was to run HMMRATAC on an ENCODE ATAC-Seq dataset to see if I can re-create standard fragment length distribution plots as seen in many ATAC-Seq-centered works. The results show that the average fragament size is much larger than expected, and I hardly see any fragments with a length below 150 bp, which should point at the nucleosome-free regions.

Frag length

I'm reaching out since I'm (obviously) very new to this field and I'm not sure if I'm missing something basic. I'd be happy if you could make sure that the dataset I'm looking at should indeed produce that fragment size histogram I was referring to.

Thanks again and sorry if this is out of scope.

@HagaiHargil
Copy link

For future reference, I did arrive at some sort of a solution. That same BAM file already contains the correct fragment length distribution, we just have to extract it out. I found this blog post with the following intimidating one-liner:

samtools view ATAC_f2q30_sorted.bam | awk '$9>0' | cut -f 9 | sort | uniq -c | sort -b -k2,2n | sed -e 's/^[ \t]*//' > fragment_length_count.txt

This generates a text file that can be histogrammed, and the resulting histogram does indeed contain the necessary periodicity. As a caveat, I'm not sure that this is best-practice, or for that matter even correct :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants