Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sf.define_domains() - ValueError: The number of observations cannot be determined on an empty distance matrix. #17

Open
m-petersen opened this issue Apr 29, 2024 · 5 comments

Comments

@m-petersen
Copy link

Hi,

thanks for this great package! I am trying to use SAFE for the annotation of metadata on a low-dimensional graph representation of microbiome abundance information (similar to: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1871-4). For this I have used Mapper to obtain a graph representation of the microbiome information (nodes represent subject groups, edges represent overlapping subjects between the nodes). Now, I want to annotate this graph with node-level information on age, sex and other covariates with safepy. Loading the graph and annotation information works fine and I can also obtain the enrichment landscapes of individual covariates with sf.plot_sample_attributes().

However, if I want to plot a composite landscape I get an error I do not understand completely. sf.define_top_attributes() runs without an error. sf.define_domains() results in the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[/tmp/ipykernel_6330/3930251677.py](https://file+.vscode-resource.vscode-cdn.net/tmp/ipykernel_6330/3930251677.py) in 
----> 1 sf.define_domains(attribute_distance_threshold = 0.65)

[~/miniconda3/envs/brainstat/lib/python3.7/site-packages/safepy/safe.py](https://file+.vscode-resource.vscode-cdn.net/home/marvin/mount/hdd8tb/CSI_HCHS/2023_oral_microbiome/code/~/miniconda3/envs/brainstat/lib/python3.7/site-packages/safepy/safe.py) in define_domains(self, **kwargs)
    661 
    662         m = self.nes_binary[:, self.attributes['top']].T
--> 663         Z = linkage(m, method='average', metric=self.attribute_distance_metric)
    664         max_d = np.max(Z[:, 2] * self.attribute_distance_threshold)
    665         domains = fcluster(Z, max_d, criterion='distance')

[~/miniconda3/envs/brainstat/lib/python3.7/site-packages/scipy/cluster/hierarchy.py](https://file+.vscode-resource.vscode-cdn.net/home/marvin/mount/hdd8tb/CSI_HCHS/2023_oral_microbiome/code/~/miniconda3/envs/brainstat/lib/python3.7/site-packages/scipy/cluster/hierarchy.py) in linkage(y, method, metric, optimal_ordering)
   1066                          "finite values.")
   1067 
-> 1068     n = int(distance.num_obs_y(y))
   1069     method_code = _LINKAGE_METHODS[method]
   1070 

[~/miniconda3/envs/brainstat/lib/python3.7/site-packages/scipy/spatial/distance.py](https://file+.vscode-resource.vscode-cdn.net/home/marvin/mount/hdd8tb/CSI_HCHS/2023_oral_microbiome/code/~/miniconda3/envs/brainstat/lib/python3.7/site-packages/scipy/spatial/distance.py) in num_obs_y(Y)
   2570     k = Y.shape[0]
   2571     if k == 0:
-> 2572         raise ValueError("The number of observations cannot be determined on "
   2573                          "an empty distance matrix.")
   2574     d = int(np.ceil(np.sqrt(k * 2)))

ValueError: The number of observations cannot be determined on an empty distance matrix.

Based on my understanding of the code, no top nodes were identified with sf.define_top_attributes() based on my data as the criterion of 1 connected component was not met (in my analysis the metadata variables have more than 1 connected components). What I am aiming for is a composite contour plot to investigate based on the enrichment landscapes whether the dominance of certain microbes is linked to specific covariates. Is it possible to change some of the default parameters to obtain composite maps for my use case? Do you think the annotation of subject-level networks with your package is valid in general or am I missing something?

Your help would be highly appreciated. Many thanks in advance! You can find the corresponding jupyter notebook containing the code and the error via this link. :)

@abarysh
Copy link
Contributor

abarysh commented May 3, 2024

Hi Marvin, thank you for flagging this. First of all, beautiful network! :) Second, it might be worth trying to skip the sf.define_top_attributes all together and use all attributes to define the domains. To do it, you run sf.compute_pvalues() and then set sf.attributes['top'] = True (instead of running sf.define_top_attributes(). As you can probably imagine, sf.define_top_attributes() is a way to filter down the attributes when you have many and you expect some of them to be redundant. But, in your case, you may want to keep all of them (or, if the overall number is not too high, pick manually which ones to keep). In that case, you can just set the top column in sf.attributes to True for the attributes you want to keep. Let me know if you have other questions, I can help troubleshoot.

@m-petersen
Copy link
Author

m-petersen commented May 6, 2024

Hi Anastasia,

many thanks for your reply! Setting sf.attributes['top'] = True worked. :)

If you don't mind, I have additional questions regarding SAFE in the context of my analysis. Many variables appear to exhibit extensive enrichment across the network, with over 90% of the nodes being significant. Given that previous analyses using SAFE have concentrated on localized effects, I am unsure how to interpret this widespread enrichment. Would it be correct to conclude that if a variable is enriched across most of the network, the topology of the network very effectively captures the interindividual variance of that specific variable (as the graph is on participant-level)? Furthermore, it seems like every time I rerun the analysis script the results differ a little bit. I have defined the randomSeed in safe_default.ini. Do I have to set an additional random seed somehow? Again, thank you in advance!

@abarysh
Copy link
Contributor

abarysh commented May 15, 2024

Hi Marvin,

Here're my thoughts:

  1. I'm a bit skeptical about 90% of the nodes in the network being significant. Are you attributes binary or quantitative values? How does it look if you plot the raw values (rather than enrichments)? E.g., sf.plot_sample_attributes(show_raw_data=True). I'm happy to have a look myself (if you're ok sharing one of the attributes showing this behavior).

  2. It depends on how different are the results. If your attributes are quantitative, then enrichment is computed through random permutations, which will be a little different run after run (this difference is minimized if you increase the number of permutations, e.g., from the default 1,000 to 10,000). If your attributes are binary, then enrichment is computed with a hypergenometric test, which should produce exactly the same result every time. Unfortunately, so far, the randomSeed in safe_default.ini only affects the network layout (if you don't provide node coordinates yourself), but we should use it to seed the permutations as well (thank you for highlighting this).

@m-petersen
Copy link
Author

Hi Anastasia,

many thanks for your insights. Much appreciated!

  1. The attributes are quantitative values. The raw value distribution largely corresponds with the enrichment pattern (lower raw values -> negative enrichment / significantly lower than chance; higher raw values -> positive enrichment / significantly higher than chance). Overall, the strongly enriching variables follow a left-right gradient on the network and when comparing the enrichment of all the variables we are investigating the pattern makes sense (correlated variables follow in their enrichment a matching left-right (or right-left) trajectory on the network). While I am not able to share the variables that show the strongest enrichment due to data privacy reasons I have put together the code, the network and another attribute/variable (free-water, an imaging marker of microstructural brain integrity which correlates with the other variables and shows relatively strong enrichment 50-60% of nodes) in this drive (https://drive.google.com/drive/folders/1rSyJ9KzHEGBoo2XTfsAdoJTeE-qQu08z?usp=sharing) for you to test. Many thanks in advance.

  2. I see. The differences I observe between runs are very minor but it would be great to have them fully stable anyways. I will take a look in the code and see whether I can localize opportunities to set the random seeds. Increasing the number of permutations to 10000 has further stabilized the results.

@abarysh
Copy link
Contributor

abarysh commented May 28, 2024

I had a look at safe.ipynb (didn't run it though) and noticed that you're defining safe_significance_thresh as -np.log10(p_value) / -np.log10(1/(num_permutations+1)). It's not clear if it's used anywhere, but, if it is, are you sure that's what you want to do? Typically, significance threshold corrected for multiple permutations (Bonferroni-style) would be -np.log10(p_value/num_permutations). I wonder if this is causing so many nodes on your network to appear significant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants