Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

background noise too high for my data? #98

Open
zmz1988 opened this issue May 16, 2022 · 2 comments
Open

background noise too high for my data? #98

zmz1988 opened this issue May 16, 2022 · 2 comments

Comments

@zmz1988
Copy link

zmz1988 commented May 16, 2022

Hello, thanks for the nice and user-friendly tool! I have atac-seq data (150 bp pair-end) from small plant genome, and have been trying HMMRATAC for a while. So the problem for me is that I can't get the model right, as I had tried many combination of -l, -u and even -z. I showed one of the example as below.

Version:        1.2.10
Arguments Used:
-b      Sample.atac.forHMMRATAC.bam
-i     Sample.atac.forHMMRATAC.bam.bai
-g      Sample.genome
-o      Sample_60_25
--bedgraph      True
-u      60
-l      25
Fragment Expectation Maximum Done
Mean    50.0    StdDevs 20.0
Mean    190.63425024911237      StdDevs 62.502097145190604
Mean    400.60895549134216      StdDevs 50.93694387902569
Mean    729.022089951699        StdDevs 145.67453137113827
ScalingFactor   103.086173
Training Regions found and Zscore regions for exclusion found
Training Fragment Pileup completed
Kmeans Model:
HMM with 3 state(s)

State 0
  Pi: 0.3333333333333333
  Aij: 0.333 0.333 0.333
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.531 1.169 1.604 4.339 ]

State 1
  Pi: 0.3333333333333333
  Aij: 0.333 0.333 0.333
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 1.075 1.678 1.373 0.931 ]

State 2
  Pi: 0.3333333333333333
  Aij: 0.333 0.333 0.333
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 2.349 4.864 4.519 4.93 ]

Model created and refined. See Can_60_25.model
Model:
HMM with 3 state(s)

State 0
  Pi: 0.3333333333333333
  Aij: 0.979 0.015 0.006
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.501 0.855 0.831 1.056 ]

State 1
  Pi: 0.3333333333333333
  Aij: 0.012 0.985 0.003
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 1.495 2.338 1.724 0.847 ]

State 2
  Pi: 0.3333333333333333
  Aij: 0.015 0.009 0.976
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 1.065 1.737 2.032 2.792 ]

Genome split and subtracted masked regions
0 round viterbi done
37 round viterbi done
Total time (seconds)=   4519

I had tried at least ten different combinations of -l and -u, and none of them can get the ideal model you described in your paper. After checking some of the log files showed in the issues channel, I realised that most of people have mean values of 0 for state 0 (which is the background or starting model if I understand correctly?). So does this mean that the background noise of my data is pretty high?

I also attach a insert size picture here (our data were generated though the purification of nuclei, and the insert size calculation is for the clean bam after multiple filtering)
Can.dedup.clean.bam.insertsize.hist.pdf

Could you please give me some advice how I could solve the problem of my data? Thanks a lot in advance!

@EvanTarbell
Copy link
Collaborator

Hi,
A few things. 1. the fragment length distribution is off. You don't see any multi-modal behavior in your plot. Generally, this indicates that there is something wrong with the data. This is one of the ways people QC ATAC-seq data (including with ATACQC and from the original ATAC-seq authors). Because of this, I can't say for sure that any of my following suggestions will help nor can I be sure that any other tool wouldn't produce misleading results.
2. You are right that the first state does seem to be causing trouble. Not so much in that the values aren't zeroes, but that they aren't the smallest values. For example, values 3 and 4 for that state are higher than values 3 and 4 for state 2 and value 4 is almost as high as value 4 in state 3. One option might be to remove the fourth (and maybe 3rd) value from the estimation. You would use the --trim option to remove the fourth value (--trim 1) or both 3rd and 4th (--trim 2).
3. Another option is that the algorithm is having trouble finding training sites that don't show this behavior, in which case you could give it a list of training sites to use to train a more appropriate model. If you happen to have some known enhancers/promoters/cis-regulatory elements, you could input them with the -t option.

@zmz1988
Copy link
Author

zmz1988 commented May 16, 2022

Thanks a lot! here is the distribution figure. Before I was not sure why I can't see the pattern of multi-modal. Now with your confirmation, I'm kinda sure now. Thanks! I will try with the --trim and see what happen. Unfortunately we don't have other data for the known elements to feed -t option.
Can dedup clean FragSize

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants