Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN or High value for dtype float32 #33

Open
TommyJW opened this issue Oct 25, 2018 · 4 comments
Open

NaN or High value for dtype float32 #33

TommyJW opened this issue Oct 25, 2018 · 4 comments

Comments

@TommyJW
Copy link

TommyJW commented Oct 25, 2018

I've repeatedly encountered motif discovery failing in Round 2 with a NaN or value too high exception.

To isolate, I've tried several models with different layers and layer structures fitted and scored with DeepLIFT. I've also tried adjusting from the default parameters (as noted in the notebook) and the parameters used by the notebook. I have yet to find a pattern in the failures.

Additionally, I've tried different subsets of the same data, and the complete data set. I've also tried alternate sequence data sets. The only thing I've noticed is the smaller subset dataset tends to produce the error less, but this is inconsistent with the alternate datasets that are inherently small.

For comparisons:
I have a HOTAIR dataset "GSE31332_hotair_oe_peaks" of 832 sequences we'll call this the full set
Longest sequence 2551
Shortest sequence 756 padded with 0s
I have subsetted it to 155 sequences we'll call this one the 'small' subset

I'm not sure what information would be helpful to identify what I can improve in preprocessing or parameters passed to avoid the exception.

If it will help I can also bundle a notebook and dataset that includes the whole workflow from classification to motif discovery. I could also provide the raw output from the motif discovery calls.

Thanks

@AvantiShri
Copy link
Collaborator

Hi Tommy,

Could you send me a sample small dataset that produces the errors? I can use google colab to then try it out and figure out the source of the issue

@TommyJW
Copy link
Author

TommyJW commented Oct 26, 2018

NaN_Zipped.zip

Attached is a notebook that goes through the whole workflow, and a small sample dataset.

In the Build Keras Model cell, try switching between the different models. I've gotten different results as to which ones work or don't work with TF MoDISco depending on system environment, dataset used, and subsetting data. I've also built a randomizing function to pass different parameters to the main MoDISco call in another notebook and I'm currently analyzing the data for any pattern there.

@AvantiShri
Copy link
Collaborator

AvantiShri commented Oct 27, 2018

Hi Tommy,

Here's a notebook where I was able to run TF-MoDISco using the model that it was supposed to fail with. The key thing I did was to trim out the zeros from sequences that were shorter than the maximum length (TF-MoDISco can handle variable length sequences): https://gist.github.com/AvantiShri/6428ca274e55c8d242f3429ee9ca42be

I also made some tweaks to the parameters so that it produced motifs for both metacluster 0 (negative activity) and metacluster 1 (positive activity). Maybe the results will make more sense to you since you are more familiar with the biology of the problem. The main pattern that jumps out, which you also seemed to find based on your visualizations, is that different segments of the sequence have different GC-content preferences. Beyond that, it's hard to tell what may be real at such a small number of sequences. In general, the patterns that have more seqlets mapping to them are more likely to be real.

Let me know if you have more questions.

@AvantiShri
Copy link
Collaborator

AvantiShri commented Oct 27, 2018

Also, it sounds like you are studying lncRNAs - this is obviously a very different kind of dataset than the TF-binding datasets I developed TF-MoDISco on, so if there are assumptions made in TF-MoDISco that don't apply in these other contexts, I'd be interested to hear about them (I may not have the bandwidth to work on other applications at this stage, but I'd be happy to give advice on how the algorithm could be tweaked for different purposes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants