Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About FIMO finding putative TFBS in the tutorial. #12

Open
alexyfyf opened this issue Aug 25, 2018 · 3 comments
Open

About FIMO finding putative TFBS in the tutorial. #12

alexyfyf opened this issue Aug 25, 2018 · 3 comments
Labels

Comments

@alexyfyf
Copy link

I read the CENTIPEDE manuscript. I think the author scanned all putative TFBS across the genome.

However, in your tutorial, you suggested using FIMO only to obtain TFBS in peak regions.

Is it proper to that? Do you have any comment about this?

Thank you.

@slowkow
Copy link
Owner

slowkow commented Aug 25, 2018

Thanks for the question!

At the time of writing, I was mostly concerned with how to get the data into the right format so that we can run CENTIPEDE in the first place.

  1. It would be fantastic if you could provide a concrete example from the text that highlights the difference between the authors' manuscript and my tutorial. When I created this tutorial, it was not very clear to me exactly how the method was used. I tried my best, but it is possible that I made several mistakes.

  2. You might consider updating this tutorial to be more similar to the original manuscript, as you indicated. If you'd like to make a pull request, I'd be very happy to review it. Thanks for your consideration!

@alexyfyf
Copy link
Author

Thanks for you reply.

CENTIPEDE applies a hierarchical Bayesian mixture model to infer regions of the genome that are bound by
particular transcription factors. It starts by identifying a set of candidate binding sites (e.g., sites that match a
certain position weight matrix (PWM)), and then aims to classify the sites according to whether each site is bound
or not bound by a TF. CENTIPEDE is an unsupervised learning algorithm that discriminates between two different
types of motif instances using as much relevant information as possible. In brief, the procedure is as follows:
1. Scan the genome for all approximate matches to a target PWM of interest. Each site that matches the PWM
is considered a candidate binding site (Section 2.1).
We scanned the human genome sequence (hg18) for matches to each PWM using our implementation of the
following commonly used formula [2]:

This is from the supplementary data of CENTIPEDE paper. So I assume the author is scanning the whole genome other than peak region(hotspots).

But I'm not sure how much difference it will make. I haven't tested on any data yet.

@slowkow
Copy link
Owner

slowkow commented Aug 27, 2018

After reading the supplement again, I think you're right. Thanks for pointing out the difference.

It seems the authors consider all sites that have an approximate PWM match, regardless of other evidence such as ChIP-seq data.

In contrast, my tutorial only considers sites that have strong evidence of a DNase-seq peak.

Looking back on this, I probably found it a bit odd to consider that a site can be classified as "bound by a TF" even though it does not have any DNase-seq data. That might be the reason that I decided to run the analysis only on genomic loci with DNase-seq peaks.

I think it might be interesting to see how you decide to set up your own analysis. Please feel free to share your findings!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants