Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about hypothetical contribs + "cannot do a non-empty take from an empty axes" error #95

Open
Sakurag1l opened this issue Sep 10, 2021 · 9 comments

Comments

@Sakurag1l
Copy link

Hi Avanti, When I modify part of the example code and then have an error to be reported here. I don't know why. Could you help me out?

tfmodisco_results = modisco.tfmodisco_workflow.workflow.TfModiscoWorkflow(
                    #Slight modifications from the default settings
                    sliding_window_size=5,
                    flank_size=5,
                    target_seqlet_fdr=0.15,
                    seqlets_to_patterns_factory=
                     modisco.tfmodisco_workflow.seqlets_to_patterns.TfModiscoSeqletsToPatternsFactory(
                        #Note: as of version 0.5.6.0, it's possible to use the results of a motif discovery
                        # software like MEME to improve the TF-MoDISco clustering. To use the meme-based
                        # initialization, you would specify the initclusterer_factory as shown in the
                        # commented-out code below:
                        # initclusterer_factory=modisco.clusterinit.memeinit.MemeInitClustererFactory(
                        #    meme_command="meme", base_outdir="meme_out",
                        #    max_num_seqlets_to_use=10000, nmotifs=10, n_jobs=1),
                        trim_to_window_size=5,
                        initial_flank_to_add=5,
                        final_flank_to_add=5,
                        final_min_cluster_size=6,
                        n_cores=10)
                )(
                 task_names=["0"],#, "task1", "task2"],
                 contrib_scores=task_to_scores,
                 hypothetical_contribs=task_to_hyp_scores,
                 one_hot=one_hot_seqs,
                 null_per_pos_scores=null_per_pos_scores)

IndexError: cannot do a non-empty take from an empty axes.

@AvantiShri
Copy link
Collaborator

Hi @Sakurag1l, can you post the full stack trace (i.e. the full error message, which will pinpoint the line of code that threw the error)?

@Sakurag1l
Copy link
Author

Thank you. I've solved that problem. But I have a question, if my hypothetical_contribs is the same as contrib_scores, does it matter?

@AvantiShri
Copy link
Collaborator

You will still get motifs, but you should expect they will be more split up. Say a motif can have either an A or a T at a particular position; without hypothetical scores, the version with A may be reported as a separate motif from the version with T. The hypothetical scores are meant to help tf-modisco recognize when the underlying motif is the same even though the actual kmer sequence may be different, because the hypothetical scores act as an "autocomplete" of the sequence (in that they reveal what type of pattern the network was detecting at a particular location).

@Sakurag1l
Copy link
Author

Thanks! Excuse me, are there other ways to generate hypothesis scores?

@AvantiShri
Copy link
Collaborator

There are. What is the method you are currently using to generate your contribution scores?

@Sakurag1l
Copy link
Author

Saliency , DeepLIFT and Saturation Mutagenesis

@Sakurag1l
Copy link
Author

Sakurag1l commented Oct 20, 2021

task_to_hyp_contrib_scores = hypothetical_contribs_many_refs_func( task_idx=1, input_data_sequences=seqs, num_refs_per_seq=10, batch_size=50, progress_update=4000, )

Does this seqs one-hot encoding only support N*4 when I use Deeplift to generate hypothetical scores.
When I was training the model, my one-hot seqs shape is4*N*1.

@AvantiShri
Copy link
Collaborator

So the basic idea of hypothetical importance scores is that they are an estimate of what the contribution score would be if a different base were present in the sequence at a given position.

If "saliency" = "gradient * input", then the gradient on all the possible bases (i.e. not just the bases that are present in the actual sequence) would be your hypothetical scores. I assume you compute your saliency score right now by calculating the gradients on the input and then masking by the actual one-hot encoded sequence; just don't do the masking step and you have "hypothetical" scores. These scores literally give you the model's sensitivity to the other bases that could be present in the sequence.

If using DeepLIFT, then there are a few ways to generate hypothetical scores:

If you used the original deeplift repository, you can get hypothetical scores as demonstrated in: https://github.com/kundajelab/tfmodisco/blob/master/examples/simulated_TAL_GATA_deeplearning/Generate%20Importance%20Scores.ipynb

To your question, yes, I assumed "N x channels". Can you add some permute/reshape layers to the beginning of your model to get it to take inputs of the "N x 4" shape? If you do that, then you can use the following DeepSHAP notebook to get hypothetical scores (you won't be able to use the deeplift implementation in the deeplift repository as I haven't implemented permute and reshape layers in it, but the DeepSHAP implementation of deeplift will work): https://colab.research.google.com/github/AvantiShri/shap/blob/5fdad0651176cdbf1acd6c697604a71529695211/notebooks/deep_explainer/Tensorflow%20DeepExplainer%20Genomics%20Example%20With%20Hypothetical%20Importance%20Scores.ipynb

If using Saturation Mutagenesis: assuming you mutate each base to all 3 other possible bases, and you record the model output for all the bases, then it should be straightforward to compute what the mutagenesis score would have been if the other possible bases were present. You should compute your saturation mutagenesis scores in a way that retains sign information. What I like to do is to compute the value of the output logit over all 4 bases at each position (for one of the 4 bases this will be the same as the original model output) , and then subtract the mean logit over all 4 from each of the 4 values. Bases that increase the output relative to the mean will get a positive score, and bases that decrease the output will get a negative score.

Does that make sense?

@AvantiShri AvantiShri changed the title cannot do a non-empty take from an empty axes. question about hypothetical contribs + "cannot do a non-empty take from an empty axes" error Oct 20, 2021
@Sakurag1l
Copy link
Author

OK! Thanks!! I'm going to try to use DeepLIFT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants