Releases · kundajelab/tfmodisco

Seqlet pruning updated (function trim_to_positions_with_min_support in modisco.core). Previously, the limits of min_support would be determined by making a histogram of the locations to which seqlet centers align to, and then trimming away positions that didn't have some minimum support. But a location may be supported by the flanks of seqlets, even if is not supported by seqlet centers. Updating this to look at the support from any seqlet overlap greatly reduces the amount of seqlets that get unnecessarily trimmed away.
The previous merging strategy had two components: it looked at both the similarity of motifs as measured by cross-correlation of their contribution score tracks, as well as the density of the clusters (clusters that are less tightly packed should be merged more readily). The density was measured using a t-sne-like strategy, which was a bit ad-hoc and produced values that were hard to interpret intuitively. Now, I still retain the cross-correlation-like similarity, but the 'density' notion is quantified by looking at the distribution of within-cluster and between-cluster pairwise seqlet similarities.

Other small changes:

Previously, the aforementioned cross-correlation metric in the pattern merging function was implemented by calling scipy.signal.correlate2d, which doesn't do a normalization (thus, correlation values weren't limited to the range -1 to 1). This was ok because I would normalize each track prior to calling scipy.signal.correlate2d - but as a result, the values were scaled according to the number of tracks (e.g. if there were two tasks, each task would generate a contribution score track, and I would have to divide the correlation values to by 2 to put them in the -1 to 1 range). Previously, this scaling was all adjusted for under-the-hood. Now, I just switched to avoid using scipy.signal.correlate2d so that there is no need for all that adjustment.
plot_weights_given_ax now has default values specified for many of the arguments, so it is easier to call

Assets 2

08 Oct 04:33

AvantiShri

v0.5.8.1

c25d398

Bugfix, TF2 compatibility, access to motifs pre final reassigment Pre-release

Pre-release

Corresponds to PR #70

Description of changes:

When I did refactoring to include support for MEME initialization, I had a stray line that effectively caused the "sign consistency check" (which discards motifs for which the signs of the overall contribution scores disagrees with what you expect for the metacluster - such motifs can arise because seqlets get recentered during the various intermediate processing steps) to be bypassed (this effectively means a few extra motifs that seemed to have the wrong sign could have been returned). Related to the error encountered in #66
Made some minor fixes for tensorflow 2 support
The final step of tf-modisco is a "reassignment" step where motifs that have a small number of seqlets are disbanded, and an attempt is made to "reassign" their seqlets to the other motifs. If they so desire, users can now access what the tfmodisco motifs are prior to this final reassignment step.

Assets 2

26 Aug 20:18

AvantiShri

v0.5.8.0

fd40813

Agkm implementation, ic-based motif centering Pre-release

Pre-release

Corresponds to PR #63. Should fix some issues where modisco seems to produce very low-IC motifs; the problem was arising during motif post-processing when the motif was previously recentered around the region of highest average importance; this would sometimes go awry because the high average importance may have been driven by only a few seqlets; now, the motif centering is done based on information content.

There's also support for computing advanced gapped kmer embeddings (which work better than the regular gapped kmer embeddings and also use less memory), but it is still in pure python and I am looking at ways to speed it up.

Assets 2

09 Jul 08:45

AvantiShri

v0.5.7.1

cd53d27

Interactive plots for visualizing heterogeneity within a motif Pre-release

Pre-release

Corresponds to Pull Request #62. Seqlets comprising a motif are visualized in a tsne plot, and the user can select a subset of the seqlets (by dragging a rectangle around them on the plot) to aggregate and visualize on the fly. Good for dissecting heterogeneity within a motif.

Visualizing a subset of seqlets within the TAL motif from the TAL-GATA toy dataset:

Assets 2

14 May 22:53

AvantiShri

v0.5.7.0

3697287

Can have seqlet embeddings based on filter activations Pre-release

Pre-release

Corresponds to PR #61. Instead of deriving an embedding for coarse-grained similarity embedding using gapped k-mers, can derive the embedding from a neural network model (e.g. a by averaging the conv filter activations). Example notebook in https://github.com/kundajelab/tfmodisco/blob/36972870853e6631b2d32f1e489676a8241b385c/examples/simulated_TAL_GATA_deeplearning/TF_MoDISco_TAL_GATA_With_Filter_Embeddings.ipynb.

Assets 2

28 Apr 02:37

AvantiShri

v0.5.6.5

e6fbbbb

Minor fixes, travis tests running successfully Pre-release

Pre-release

Changes:

Fixed .travis.yml such that the continuous integration works (including tests that involve invoking MEME)
Cleaned up obsolete tests
Added fix for case where reverse-complement tracks aren't present: fc8370e
Added python 2 fixes: c285e7f and a267cbe

Assets 2

27 Apr 14:07

AvantiShri

v0.5.6.4

2a9e996

Updated MEME arguments, leiden init, dependency list Pre-release

Pre-release

Incorporates changes from PR #60, which added the -revcomp flag to MEME if "revcomp=True" was specified in TfModiscoWorkflow (is true by default), and also switched -mod to zoops (zoops stands for "zero or one occurrences per sequence"; this concords with the default for the web and also seems more appropriate for seqlets than the anr mode, which stands for "any number of repetitions")
Updated the Leiden clustering to take the best of both worlds over the singleton initialization (i.e. what is done without preclustering using MEME) and the MEME initialization.
Updated dependency list in setup.py to be more complete
Updated the test suite. Attempted to add a travis build but it looks like installing MEME via travis is nontrivial.

Assets 2

22 Apr 04:44

AvantiShri

v0.5.6.0

1bfc63a

Support for MEME-based initialization, Leiden community detection Pre-release

Pre-release

Corresponds to PR #57, notes duplicated below:

An initial clustering can be specified using the initclusterer_factory argument of TfModiscoSeqletsToPatternsFactory. See this notebook for an example. Here's an example for MEME-based initialization (which is what's supported at the time of writing):

initclusterer_factory=modisco.clusterinit.memeinit.MemeInitClustererFactory(    
   meme_command="meme", base_outdir="meme_out",   
   max_num_seqlets_to_use=10000,
   nmotifs=10,
   n_jobs=4)

Explanation of the arguments:

meme_command: this is just meme if the meme executable is in the PATH; if it's not in the path, then meme_command should specify the full path to the executable, e.g. /software/meme/5.0.1/bin/meme on the kundajelab servers.
base_outdir: output directory for writing the meme results (will be relative to the current working directory unless an absolute path is provided). Within this directory, subdirectories will be created for each metacluster.
max_num_seqlets_to_use: to prevent MEME from taking too long, the number of seqlets to use for running MEME will be capped to this.
nmotifs: the number of motifs for MEME to find. Only significant motifs (e value < 0.05) will be used for the clustering.
njobs: specifies the value of the -p argument of MEME, and also specifies the number of parallel jobs to launch when doing motif scanning with the MEME PWMs.

The cluster initialization with MEME is achieved as follows: the PWMs produced by MEME are used to scan all the seqlets, and only PWM matches that exceed the Bayes optimal threshold specified by MEME are considered. Seqlets that contain no PWM matches are assigned to their own cluster. The remaining seqlets are each assigned to a cluster corresponding to the PWM for which they had the strongest match by log-odds score.

The cluster initialization affects the TF-MoDISco workflow in two places:

First, the fine-grained similarity is computed not just on the set of nearest-neighbors that have the highest coarse-grained similarity across all seqlets, but also on the set of nearest-neighbors that have the highest coarse-grained-similarity within each initialized cluster.
Second, it is used to initialize Leiden community detection.

Empirically, this seems to result in TF-MoDISco clusters that get the "best of both worlds" from MEME and TF-MoDISco.

Other changes:

Moved from Louvain -> Leiden for the main community detection step. Note that I am no longer doing consensus clustering with Leiden because it didn't appear to work well (consistent with this discussion on twitter); instead, I am just taking the best modularity over 50 runs of Leiden with different random seeds. To go back to using Louvain for the main community detection step, set the use_louvain argument to True in TfModiscoSeqletsToPatternsFactory - but note that the cluster initialization functionality isn't supported with Louvain.*
Updated the Nanog notebook to showcase the MEME initialization functionality
Updated the Nanog notebook to use better normalization (I'm now just doing mean normalization across ACGT at each position, which I think is more intuitive and has a similar effect as the normalization I described in the GkmExplain paper). Also updated the notebook to apply normalization to the importance scores of the dinuc-shuffled null (previously, the scores for the null distribution weren't normalized)
Added tests for the MEME-based initialization

*The reason I don't support cluster initialization with Louvain is that, when using Louvain, the number of clusters can only decrease from one iteration to the next (with Leiden, the number of clusters can go up because there's a cluster-splitting step - in other words, if initialization was used with Louvain, the number of discovered clusters would be capped at the number of clusters present during initialization, which is undesirable). By the way, Louvain is still used in the "spurious merging detection" step of the post-processing; the reason is that in this step I attempt to split each cluster into two subclusters, and when using Louvain this cap on the number of subclusters can be achieved by initializing Louvain to have only 2 clusters (since the number of clusters in Louvain only decreases with each iteration).

Assets 2

22 Apr 11:52

AvantiShri

0.5.6.2

20bfff4

Dependency fix for leidenalg and tqdm Pre-release

Pre-release

Corresponds to PR #59. Updated setup.py to include leidenalg and tqdm.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: kundajelab/tfmodisco

Small fixes

New cluster merging strategy, less aggressive seqlet pruning

Bugfix, TF2 compatibility, access to motifs pre final reassigment

Agkm implementation, ic-based motif centering

Interactive plots for visualizing heterogeneity within a motif

Can have seqlet embeddings based on filter activations

Minor fixes, travis tests running successfully

Updated MEME arguments, leiden init, dependency list

Support for MEME-based initialization, Leiden community detection

Dependency fix for leidenalg and tqdm