background noise too high for my data? #98

zmz1988 · 2022-05-16T11:55:13Z

Hello, thanks for the nice and user-friendly tool! I have atac-seq data (150 bp pair-end) from small plant genome, and have been trying HMMRATAC for a while. So the problem for me is that I can't get the model right, as I had tried many combination of -l, -u and even -z. I showed one of the example as below.

Version:        1.2.10
Arguments Used:
-b      Sample.atac.forHMMRATAC.bam
-i     Sample.atac.forHMMRATAC.bam.bai
-g      Sample.genome
-o      Sample_60_25
--bedgraph      True
-u      60
-l      25
Fragment Expectation Maximum Done
Mean    50.0    StdDevs 20.0
Mean    190.63425024911237      StdDevs 62.502097145190604
Mean    400.60895549134216      StdDevs 50.93694387902569
Mean    729.022089951699        StdDevs 145.67453137113827
ScalingFactor   103.086173
Training Regions found and Zscore regions for exclusion found
Training Fragment Pileup completed
Kmeans Model:
HMM with 3 state(s)

State 0
  Pi: 0.3333333333333333
  Aij: 0.333 0.333 0.333
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.531 1.169 1.604 4.339 ]

State 1
  Pi: 0.3333333333333333
  Aij: 0.333 0.333 0.333
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 1.075 1.678 1.373 0.931 ]

State 2
  Pi: 0.3333333333333333
  Aij: 0.333 0.333 0.333
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 2.349 4.864 4.519 4.93 ]

Model created and refined. See Can_60_25.model
Model:
HMM with 3 state(s)

State 0
  Pi: 0.3333333333333333
  Aij: 0.979 0.015 0.006
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.501 0.855 0.831 1.056 ]

State 1
  Pi: 0.3333333333333333
  Aij: 0.012 0.985 0.003
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 1.495 2.338 1.724 0.847 ]

State 2
  Pi: 0.3333333333333333
  Aij: 0.015 0.009 0.976
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 1.065 1.737 2.032 2.792 ]

Genome split and subtracted masked regions
0 round viterbi done
37 round viterbi done
Total time (seconds)=   4519

I had tried at least ten different combinations of -l and -u, and none of them can get the ideal model you described in your paper. After checking some of the log files showed in the issues channel, I realised that most of people have mean values of 0 for state 0 (which is the background or starting model if I understand correctly?). So does this mean that the background noise of my data is pretty high?

I also attach a insert size picture here (our data were generated though the purification of nuclei, and the insert size calculation is for the clean bam after multiple filtering)
Can.dedup.clean.bam.insertsize.hist.pdf

Could you please give me some advice how I could solve the problem of my data? Thanks a lot in advance!

The text was updated successfully, but these errors were encountered:

EvanTarbell · 2022-05-16T14:04:16Z

Hi,
A few things. 1. the fragment length distribution is off. You don't see any multi-modal behavior in your plot. Generally, this indicates that there is something wrong with the data. This is one of the ways people QC ATAC-seq data (including with ATACQC and from the original ATAC-seq authors). Because of this, I can't say for sure that any of my following suggestions will help nor can I be sure that any other tool wouldn't produce misleading results.
2. You are right that the first state does seem to be causing trouble. Not so much in that the values aren't zeroes, but that they aren't the smallest values. For example, values 3 and 4 for that state are higher than values 3 and 4 for state 2 and value 4 is almost as high as value 4 in state 3. One option might be to remove the fourth (and maybe 3rd) value from the estimation. You would use the --trim option to remove the fourth value (--trim 1) or both 3rd and 4th (--trim 2).
3. Another option is that the algorithm is having trouble finding training sites that don't show this behavior, in which case you could give it a list of training sites to use to train a more appropriate model. If you happen to have some known enhancers/promoters/cis-regulatory elements, you could input them with the -t option.

zmz1988 · 2022-05-16T15:26:19Z

Thanks a lot! here is the distribution figure. Before I was not sure why I can't see the pattern of multi-modal. Now with your confirmation, I'm kinda sure now. Thanks! I will try with the --trim and see what happen. Unfortunately we don't have other data for the known elements to feed -t option.

zmz1988 added the General Question label May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

background noise too high for my data? #98

background noise too high for my data? #98

zmz1988 commented May 16, 2022 •

edited

EvanTarbell commented May 16, 2022

zmz1988 commented May 16, 2022

background noise too high for my data? #98

background noise too high for my data? #98

Comments

zmz1988 commented May 16, 2022 • edited

EvanTarbell commented May 16, 2022

zmz1988 commented May 16, 2022

zmz1988 commented May 16, 2022 •

edited