Skip to content

Releases: kundajelab/tfmodisco

Make pairwise distances passed to scikit nearest neighbors nonnegative

23 May 22:33
Compare
Choose a tag to compare

Minor bugfix release corresponding to pull request #40

@rosaxma received the error ValueError: Negative values in data passed to 'pairwise_distances'. Precomputed distance need to have non-negative values when scikit's NearestNeighbors functions were called. This fix shifts all the distances upwards so that they are all nonnegative, which appears to eliminate the error without affecting the results. I am not sure why this error wasn't encountered before - it may have to do with the particular version of scikit.

Created v0.2.1-alpha tag at the request of people using older version

22 Mar 17:05
e2c536e
Compare
Choose a tag to compare

The major changes since this release are in the form of thresholding. Specifically, some key changes are:

  • At the time of this release, Laplace distribution thresholding was not in use. Rather, thresholding was based on finding a point of high curvature
  • At the time of this release, the limit of 20K seqlets was applied to the number of seqlets generated per task, and it took only the most important 20K seqlets for each task. Since then, the limit of 20K has been applied per metacluster (as this is most directly related to clustering time), and it is not guaranteed to be taking the most important seqlets - rather, the first 20K seqlets are taken from the ordering generated when the SeqletsOverlapResolver OrderedDict gets unrolled, which is effectively going to be ordering the seqlets by the index of the sequence they originate from, with priority given to the first task specified by the user. I know this aspect is opaque (I only realized it recently myself because the feature that limits seqlets by metacluster was implemented in an external pull request, and I didn't drill into how the ordering was done at the time when I approved the feature). The reason I have not yet forced ordering by the important seqlets is that I am concerned that doing so might under-sample weaker-affinity motifs that may be of interest. My hope is to just go straight to scaling up TF-MoDISCo with a "subsample, soak & repeat" strategy (that is, we subsample seqlets, find highly represented motifs, "soak up" seqlets from the full set that match these motifs, then repeat with the remaining seqlets).

Another major difference was that the backend was in theano, though this should not alter the results.

@suragnair

Added support for different manual positive and negative thresholds

18 Mar 01:45
a12b47b
Compare
Choose a tag to compare

Release corresponds to pull request #39

Percentile-based thresholding is triggered if the number of passing windows produced through null-distribution-based thresholding does not fall within min_passing_windows_frac and max_passing_windows_frac. By default, the percentiles are taken w.r.t. the absolute values. This feature adds an argument separate_pos_neg_thresholds which can be set to True when instantiating a TfModiscoWorkflow object to take the percentiles for positive values and negative values separately, as opposed to taking the percentiles w.r.t. the absolute values. The default value of the argument is False, for backward compatibility. A notebook testing out the feature is at https://github.com/kundajelab/tfmodisco/blob/68bef1575ddec5f55e7605f64fd3753d43d2ca5c/test/nb_test/NoRevcompAndSepPosNegThresh.ipynb

There were a couple of other very minor changes that can cause differences within numerical precision. The first was that in window_sum_function in line 103 of coordproducers.py, the running window sums are now computed using np.cumsum, rather than with a python loop. The second was that in lines 548 and 549 of coordproducers.py, the criterion for meeting the threshold has been changed to y > pos_threshold and y < neg_threshold, whereas previously it was y >= pos_threshold and y <= neg_threshold.

Added support for NOT using reverse complements when computing the similarity matrix

04 Mar 01:39
ffe7952
Compare
Choose a tag to compare

Pull request here: #38

To avoid using reverse complements (e.g. if working with splicing motifs), set the argument revcomp=False when calling a TfModiscoWorkflow instance on your data. If reloading a saved TfModisco results object, then you also have to set revcomp=False when calling prep_track_set. Otherwise, the revcomp argument is by default True (for backwards compatibility). Permalink to a notebook demonstrating the functionality is here: https://github.com/kundajelab/tfmodisco/blob/d88a1dba7f59f6dc8f62aa267ac42eb5e53037d4/test/nb_test/NoRevcomp.ipynb

Added min_metacluster_size_frac

26 Feb 00:59
2164657
Compare
Choose a tag to compare
Pre-release

From @Avsecz's pull request: #37

Improved null distributions

12 Feb 02:26
f4f94d6
Compare
Choose a tag to compare
Pre-release

Key changes:

screen shot 2019-02-11 at 6 10 40 pm

  • Previously, for metaclustering, scores for different tasks would be normalized using the cdf of the laplace distribution that was fit to that task. Because there is not necessarily a laplace distribution anymore, I now just use the percentile of the magnitude of the score for normalization. The key lines are:
    https://github.com/kundajelab/tfmodisco/blob/f4f94d6dbb82d7d320068d91ffa30a12e9faadf3/modisco/coordproducers.py#L452-L453

  • For the case where the user does want to use the laplace distribution for the null, I fixed its over-aggressive tendency; previously, the curve for the laplace distribution would often lie above the true distribution, which is clearly inappropriate. This should be fixed now by looking at percentiles along the entire distribution, computing the corresponding laplace curve that would best fit each percentile, and then taking the curve with the steepest decrease. Of course, the laplace distribution may still not be an appropriate fit, but at least it will be less aggressive.

  • Previously, I would determine the FDR by looking at the proportion of null values above a particular threshold relative to the proportion of true values above that threshold. One potential drawback of lumping everything above a threshold together is that the FDR for values that are just barely above the threshold may be considerably worse than the FDR for values that are well above the threshold. To get around this, I fit an Isotonic Regression curve to get point estimates of the probability that a seqlet is a true positive given its importance score, and use the point estimates to draw the FDR threshold. That way, the FDR is controlled for values both at the threshold as well as values above the threshold.

  • Default FDR cutoff is now 0.2 rather than 0.05, as in my experience the 0.05 cutoff will tend to miss low-affinity seqlets even when the null distribution is a good fit.

As an aside, I am very amused to discover that although the version corresponding the the TF-MoDISco arxiv technical note was 0.4.2.2, and that is clearly the version that the hyperlinks point to, the title and abstract both say 0.4.4.2. I was clearly very sleep deprived when I wrote that up.

First version on pypi

29 Nov 08:31
Compare
Choose a tag to compare
First version on pypi Pre-release
Pre-release

Had to add a MANIFEST.in to make sure the louvain binaries got included. Works on google colab as demontrated in this notebook where it's used in conjunction with gkmexplain: https://github.com/kundajelab/gkmexplain/blob/6782c7b6dfc077962c59d60b60bc23ddbdf9f61a/lsgkmexplain_NFE2.ipynb

This version can be installed and run on Colaboratory

06 Oct 22:26
e5268ab
Compare
Choose a tag to compare

Reverse-complement seqlet loading bugfix

06 Oct 20:40
95fba25
Compare
Choose a tag to compare

Corresponding to this pull request: #29

Tensorflow backend that actually works

19 Sep 04:41
f0124e8
Compare
Choose a tag to compare
Pre-release

Fixes bug in tensorflow backend caused due to difference in dimension ordering between theano and tensorflow.