Skip to content

Releases: kundajelab/tfmodisco

Small fixes

16 Dec 03:46
Compare
Choose a tag to compare
Small fixes Pre-release
Pre-release

Corresponds to PR #76

Fix for error when slicing coordinates for revcomp when coordinates go over the edges of the sequence

Also makes a fix for backwards compatibility with numpy version where np.pad requires mode to be provided as an argument

New cluster merging strategy, less aggressive seqlet pruning

13 Nov 00:13
Compare
Choose a tag to compare

Corresponds to PR #73

Changes:

  • Seqlet pruning updated (function trim_to_positions_with_min_support in modisco.core). Previously, the limits of min_support would be determined by making a histogram of the locations to which seqlet centers align to, and then trimming away positions that didn't have some minimum support. But a location may be supported by the flanks of seqlets, even if is not supported by seqlet centers. Updating this to look at the support from any seqlet overlap greatly reduces the amount of seqlets that get unnecessarily trimmed away.
  • The previous merging strategy had two components: it looked at both the similarity of motifs as measured by cross-correlation of their contribution score tracks, as well as the density of the clusters (clusters that are less tightly packed should be merged more readily). The density was measured using a t-sne-like strategy, which was a bit ad-hoc and produced values that were hard to interpret intuitively. Now, I still retain the cross-correlation-like similarity, but the 'density' notion is quantified by looking at the distribution of within-cluster and between-cluster pairwise seqlet similarities.

Other small changes:

  • Previously, the aforementioned cross-correlation metric in the pattern merging function was implemented by calling scipy.signal.correlate2d, which doesn't do a normalization (thus, correlation values weren't limited to the range -1 to 1). This was ok because I would normalize each track prior to calling scipy.signal.correlate2d - but as a result, the values were scaled according to the number of tracks (e.g. if there were two tasks, each task would generate a contribution score track, and I would have to divide the correlation values to by 2 to put them in the -1 to 1 range). Previously, this scaling was all adjusted for under-the-hood. Now, I just switched to avoid using scipy.signal.correlate2d so that there is no need for all that adjustment.
  • plot_weights_given_ax now has default values specified for many of the arguments, so it is easier to call

Bugfix, TF2 compatibility, access to motifs pre final reassigment

08 Oct 04:33
c25d398
Compare
Choose a tag to compare

Corresponds to PR #70

Description of changes:

  • When I did refactoring to include support for MEME initialization, I had a stray line that effectively caused the "sign consistency check" (which discards motifs for which the signs of the overall contribution scores disagrees with what you expect for the metacluster - such motifs can arise because seqlets get recentered during the various intermediate processing steps) to be bypassed (this effectively means a few extra motifs that seemed to have the wrong sign could have been returned). Related to the error encountered in #66
  • Made some minor fixes for tensorflow 2 support
  • The final step of tf-modisco is a "reassignment" step where motifs that have a small number of seqlets are disbanded, and an attempt is made to "reassign" their seqlets to the other motifs. If they so desire, users can now access what the tfmodisco motifs are prior to this final reassignment step.

Agkm implementation, ic-based motif centering

26 Aug 20:18
Compare
Choose a tag to compare

Corresponds to PR #63. Should fix some issues where modisco seems to produce very low-IC motifs; the problem was arising during motif post-processing when the motif was previously recentered around the region of highest average importance; this would sometimes go awry because the high average importance may have been driven by only a few seqlets; now, the motif centering is done based on information content.

There's also support for computing advanced gapped kmer embeddings (which work better than the regular gapped kmer embeddings and also use less memory), but it is still in pure python and I am looking at ways to speed it up.

Interactive plots for visualizing heterogeneity within a motif

09 Jul 08:45
cd53d27
Compare
Choose a tag to compare

Corresponds to Pull Request #62. Seqlets comprising a motif are visualized in a tsne plot, and the user can select a subset of the seqlets (by dragging a rectangle around them on the plot) to aggregate and visualize on the fly. Good for dissecting heterogeneity within a motif.

Visualizing a subset of seqlets within the TAL motif from the TAL-GATA toy dataset:
Screenshot 2020-07-09 at 1 47 12 AM

Can have seqlet embeddings based on filter activations

14 May 22:53
3697287
Compare
Choose a tag to compare

Corresponds to PR #61. Instead of deriving an embedding for coarse-grained similarity embedding using gapped k-mers, can derive the embedding from a neural network model (e.g. a by averaging the conv filter activations). Example notebook in https://github.com/kundajelab/tfmodisco/blob/36972870853e6631b2d32f1e489676a8241b385c/examples/simulated_TAL_GATA_deeplearning/TF_MoDISco_TAL_GATA_With_Filter_Embeddings.ipynb.

Minor fixes, travis tests running successfully

28 Apr 02:37
e6fbbbb
Compare
Choose a tag to compare

Changes:

  • Fixed .travis.yml such that the continuous integration works (including tests that involve invoking MEME)
  • Cleaned up obsolete tests
  • Added fix for case where reverse-complement tracks aren't present: fc8370e
  • Added python 2 fixes: c285e7f and a267cbe

Updated MEME arguments, leiden init, dependency list

27 Apr 14:07
Compare
Choose a tag to compare
  • Incorporates changes from PR #60, which added the -revcomp flag to MEME if "revcomp=True" was specified in TfModiscoWorkflow (is true by default), and also switched -mod to zoops (zoops stands for "zero or one occurrences per sequence"; this concords with the default for the web and also seems more appropriate for seqlets than the anr mode, which stands for "any number of repetitions")
  • Updated the Leiden clustering to take the best of both worlds over the singleton initialization (i.e. what is done without preclustering using MEME) and the MEME initialization.
  • Updated dependency list in setup.py to be more complete
  • Updated the test suite. Attempted to add a travis build but it looks like installing MEME via travis is nontrivial.

Support for MEME-based initialization, Leiden community detection

22 Apr 04:44
1bfc63a
Compare
Choose a tag to compare

Corresponds to PR #57, notes duplicated below:

An initial clustering can be specified using the initclusterer_factory argument of TfModiscoSeqletsToPatternsFactory. See this notebook for an example. Here's an example for MEME-based initialization (which is what's supported at the time of writing):

initclusterer_factory=modisco.clusterinit.memeinit.MemeInitClustererFactory(    
   meme_command="meme", base_outdir="meme_out",   
   max_num_seqlets_to_use=10000,
   nmotifs=10,
   n_jobs=4)

Explanation of the arguments:

  • meme_command: this is just meme if the meme executable is in the PATH; if it's not in the path, then meme_command should specify the full path to the executable, e.g. /software/meme/5.0.1/bin/meme on the kundajelab servers.
  • base_outdir: output directory for writing the meme results (will be relative to the current working directory unless an absolute path is provided). Within this directory, subdirectories will be created for each metacluster.
  • max_num_seqlets_to_use: to prevent MEME from taking too long, the number of seqlets to use for running MEME will be capped to this.
  • nmotifs: the number of motifs for MEME to find. Only significant motifs (e value < 0.05) will be used for the clustering.
  • njobs: specifies the value of the -p argument of MEME, and also specifies the number of parallel jobs to launch when doing motif scanning with the MEME PWMs.

The cluster initialization with MEME is achieved as follows: the PWMs produced by MEME are used to scan all the seqlets, and only PWM matches that exceed the Bayes optimal threshold specified by MEME are considered. Seqlets that contain no PWM matches are assigned to their own cluster. The remaining seqlets are each assigned to a cluster corresponding to the PWM for which they had the strongest match by log-odds score.

The cluster initialization affects the TF-MoDISco workflow in two places:

  • First, the fine-grained similarity is computed not just on the set of nearest-neighbors that have the highest coarse-grained similarity across all seqlets, but also on the set of nearest-neighbors that have the highest coarse-grained-similarity within each initialized cluster.
  • Second, it is used to initialize Leiden community detection.

Empirically, this seems to result in TF-MoDISco clusters that get the "best of both worlds" from MEME and TF-MoDISco.

Other changes:

  • Moved from Louvain -> Leiden for the main community detection step. Note that I am no longer doing consensus clustering with Leiden because it didn't appear to work well (consistent with this discussion on twitter); instead, I am just taking the best modularity over 50 runs of Leiden with different random seeds. To go back to using Louvain for the main community detection step, set the use_louvain argument to True in TfModiscoSeqletsToPatternsFactory - but note that the cluster initialization functionality isn't supported with Louvain.*
  • Updated the Nanog notebook to showcase the MEME initialization functionality
  • Updated the Nanog notebook to use better normalization (I'm now just doing mean normalization across ACGT at each position, which I think is more intuitive and has a similar effect as the normalization I described in the GkmExplain paper). Also updated the notebook to apply normalization to the importance scores of the dinuc-shuffled null (previously, the scores for the null distribution weren't normalized)
  • Added tests for the MEME-based initialization

*The reason I don't support cluster initialization with Louvain is that, when using Louvain, the number of clusters can only decrease from one iteration to the next (with Leiden, the number of clusters can go up because there's a cluster-splitting step - in other words, if initialization was used with Louvain, the number of discovered clusters would be capped at the number of clusters present during initialization, which is undesirable). By the way, Louvain is still used in the "spurious merging detection" step of the post-processing; the reason is that in this step I attempt to split each cluster into two subclusters, and when using Louvain this cap on the number of subclusters can be achieved by initializing Louvain to have only 2 clusters (since the number of clusters in Louvain only decreases with each iteration).

Dependency fix for leidenalg and tqdm

22 Apr 11:52
Compare
Choose a tag to compare
Pre-release

Corresponds to PR #59. Updated setup.py to include leidenalg and tqdm.