Releases · kundajelab/tfmodisco

08 Mar 06:52

v0.5.14.0-devfixed

6a45caa

(in dev) new spurious merging detection, on-the-fly flank filling, exploring different merging criteria Pre-release

Pre-release

Corresponds to PR #88. The previous edition of this tag did not specify the correct target (i.e. the dev branch) so I'm re-doing it.

Three main changes:

"DynamicDistanceSimilarPatternsCollapser" had a component that involved comparing within-cluster similarities to between-cluster similarities, and looking at a metric derived from the auROC for differentiating within-cluster motifs from between-cluster motifs. Previously, for a pair of motif, the average of this auROC-derived similarity was taken. However, that can cause motifs with low within-cluster similarity to "swallow" motifs with high within-cluster similarity (evidence in this notebook: https://nbviewer.jupyter.org/github/kundajelab/tfmodisco_bio_experiments/blob/1f633b03c22860e5822e9ca5b013e68d0286332c/bpnet/trial1/TryBpNet_v0.5.14.0_studymergecriteria.ipynb - look at the pairwise similarities in the "Cross-contamination matrix" and notice the similarities of the first motif to everything else are very high even though the reciprocal similarities are very low). Thus, I decided to take the min rather than the average.
Previously, when seqlets were merged together in a greedy fashion, the on-the-fly aggregated motif only averaged the importance scores over the seqlets that aligned to a particular position, and then only at the end would "expand" all the seqlets to fill out the full motif. The caveat here is that for motifs like Nanog, which can have two instances in the same seqlet (due to periodic binding), it can create two "modes" in the on-the-fly aggregated seqlet, and each mode looks very strong as its being computed on-the-fly (because the aggregation is only done over the seqlets that align to the position), but when the flank expansion is conducted at the very end, both modes end up with degraded information content (because the seqlets that were aligning to one mode don't always turn out to have a motif instance at the other mode after expansion). The solution is to peform the flank expansion on-the-fly. Compare the IC of the motifs, esp the nanog motif, in https://nbviewer.jupyter.org/github/kundajelab/tfmodisco_bio_experiments/blob/1f633b03c22860e5822e9ca5b013e68d0286332c/bpnet/trial1/TryBpNet_v0.5.14.0_ontheflyflankfill.ipynb, which has on-the-fly flank filling, vs. https://nbviewer.jupyter.org/github/kundajelab/tfmodisco_bio_experiments/blob/1f633b03c22860e5822e9ca5b013e68d0286332c/bpnet/trial1/TryBpNet_v0.5.14.0_noontheflyflankfill.ipynb, which didn't have the flank filling)
"Spurious merging detection" is now performed by running the subclustering procedure for each motif, immediately followed by the "DynamicDistanceSimilarPatternsCollapser" procedure on the submotifs. Thus, DynamicDistanceSimilarPatternsCollapser is run in two places: during the spurious merging detection and during the overall redundancy reduction step. This substantially increases the diversity of motifs returned (contrast the above results with previous 0.5.13.0 results at: https://nbviewer.jupyter.org/github/kundajelab/tfmodisco_bio_experiments/blob/b3b4d7b240b8e398597100581ae791eec0a13b61/bpnet/trial1/TryBpNet_v0.5.13.0.ipynb)

However, I'm not fully satisfied with the motif merging criteria which is why this will be an "in dev" release

Assets 2

27 Feb 11:35

AvantiShri

v0.5.13.2

98c1ebf

Fixed n_jobs bug that arose with old gapped kmer embdder Pre-release

Pre-release

Corresponds to PR #87

I had changed the API a bit so that the user didn't have to specify the number of threads in two different places (once for the main tfmodisco workflow and once for the advanced gapped kmer embdders) - however, the API change that I made caused an error with the old gapped kmer embedder.

I also updated the gkmexplain example notebooks to use the new "advanced" gapped kmer embedders. To recap: the old "gapped kmer" embedding method was looking at all gapped kmers of a fixed "word length", and became very memory inefficient when this word length was increased; the new gapped kmer method (which I refer to as "advanced gapped kmer embeddings") can look at much wider gapped kmer lengths while remaining memory efficient.

Assets 2

25 Feb 17:44

AvantiShri

v0.5.13.1

6bce0a3

Fixed v0.5.13 package install, fixed num threads for agkm embedding Pre-release

Pre-release

Corresponds to PR #86

Package installation was broken because run_leiden wasn't listed under scripts
Number of threads used for Agkm embedding had to be specified separately from the number of threads used elsewhere, leading to the default number of threads being different than the user-specified number. This has been fixed.

Assets 2

19 Feb 09:23

AvantiShri

v0.5.13.0

2ecd704

Parallelizing Leiden Runs Pre-release

Pre-release

Corresponds to PR #85.

Leiden is run with multiple different random seeds (and the best partition is used) for robustness. Prior to this PR, those runs were not parallelized because trying to parallelize leidenalg.find_partition naively via joblib results in a TypeError: cannot pickle ‘PyCapsule’ object error. In this PR, parallelism is achieved by making calls to a dedicated script that runs leiden community detection (one that is called using subprocess.Popen).

Results on bpnet nanog task are here (gives the same results as before, but spends noticeably less time on the Leiden clustering steps): http://nbviewer.jupyter.org/github/kundajelab/tfmodisco_bio_experiments/blob/b3b4d7b240b8e398597100581ae791eec0a13b61/bpnet/trial1/TryBpNet_v0.5.13.0.ipynb
(Contrast with https://nbviewer.jupyter.org/github/kundajelab/tfmodisco_bio_experiments/blob/2ba855b85eddc4c4d7b5e3296c6e12cce04a705d/bpnet/trial1/TryBpNet_v0.5.11.0_reducemem.ipynb)

Assets 2

18 Feb 20:23

AvantiShri

v0.5.12.0

09e8fec

Motif subclustering to reveal within-motif heterogeneity Pre-release

Pre-release

Corresponds to PR #84

Performs density-adapted clustering (Leiden) + computes a tsne embedding. By default, perplexity of 30 is used for both. If TF-MoDISco is run with this version from the beginning, the subclustering will automatically be computed for each motif. Otherwise, motifs computed with a previous version can be loaded and the subclutering computed post-hoc.
Uses all pairwise continuous jaccard similarities for computing the similarity matrix, but is memory efficient because the number of nearest neighbors used to perform the density adaptation is far smaller (it's perplexity*3 + 1) than the total number of nodes in the graph (only the similarities for the necessary number of nearest neighbors are kept in memory)
Added support for saving and loading the subclustering from file

Notebook demonstrating the change on the example tal-gata notebook: https://github.com/kundajelab/tfmodisco/blob/c2c6001b8a2608ee5224ac7faeb69d5fd72f78f5/examples/simulated_TAL_GATA_deeplearning/TF_MoDISco_TAL_GATA.ipynb

Notebook demonstrating how to compute the subclustering post-hoc:
http://mitra.stanford.edu/kundaje/avanti/tfmodisco_bio_experiments/bpnet/trial1/TryBpNet_v0.5.12.0_add_in_subclustering.html
(github permalink: https://github.com/kundajelab/tfmodisco_bio_experiments/blob/54df6faa20773d91107e7b645649e2145e3fb0de/bpnet/trial1/TryBpNet_v0.5.12.0_add_in_subclustering.ipynb)

Assets 2

11 Feb 21:52

AvantiShri

v0.5.11.0

a0ec95b

Memory reduction with sparse matrices Pre-release

Pre-release

Corresponds to PR #83. Results should remain identical compared to v0.5.10.2.

Assets 2

03 Feb 22:34

AvantiShri

v0.5.10.2

3e5f5e8

Reverting to previous defaults for seqlet identification, some minor fixes Pre-release

Pre-release

Corresponds to PR #82

Default settings for variable-length seqlet identification seem not-great, so reverting to the previous default of fixed-length seqlet identification (which has been battle-tested more)
Putting in the fix caught by #81
Another minor fix for cases where there are lots of ties in the scores during seqlet identification (which could occur with the variable-length seqlet identification)

Assets 2

22 Jan 23:05

AvantiShri

v0.5.10.1

1058bac

Changing sum-to-1 error to a warning for calculating information content Pre-release

Pre-release

Corresponds to PR #80

This error typically occurs when the user has sequences that are not perfectly one-hot encoded - i.e. some columns are all-zeros (usually because the one-hot encoding procedure mapped Ns to all-zeros). This can result in a PPM where the probabilities don't sum to 1 in all the rows. The error is thrown when computing the information content for visualization purposes. The information content can still be calculated by simply renormalizing the rows to sum to 1, so that's what this workaround does (after printing a warning). The user should still make sure that they are ok with this behavior.

Assets 2

20 Jan 22:16

AvantiShri

v0.5.10.0

150f4a4

Lower Mem AGKM + Variable Len Seqlet ID + Initial Exemplar-Based Hit Scoring Pre-release

Pre-release

Corresponds to PR #78

AGKM embeddings are now the default (also avoids importing tensorflow unless user wants to use the previous gapped kmer embeddings)
cap on the agkm embeddings size; reduces memory use
Implementation of variable-length seqlet identification
Initial implementation of exemplar-based hit-scoring

Assets 2

20 Dec 21:29

AvantiShri

v0.5.9.2

17cbafe

Fix for nan/division-by-zero errors in case where there are regions of all-zero importance Pre-release

Pre-release

Corresponds to PR #77

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: kundajelab/tfmodisco

(in dev) new spurious merging detection, on-the-fly flank filling, exploring different merging criteria

Fixed n_jobs bug that arose with old gapped kmer embdder

Fixed v0.5.13 package install, fixed num threads for agkm embedding

Parallelizing Leiden Runs

Motif subclustering to reveal within-motif heterogeneity

Memory reduction with sparse matrices

Reverting to previous defaults for seqlet identification, some minor fixes

Changing sum-to-1 error to a warning for calculating information content

Lower Mem AGKM + Variable Len Seqlet ID + Initial Exemplar-Based Hit Scoring

Fix for nan/division-by-zero errors in case where there are regions of all-zero importance