Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

passing your own cluster labels: VIA performance with different clustering methods. #29

Open
barveaditya opened this issue Dec 6, 2022 · 10 comments

Comments

@barveaditya
Copy link

Hi Shobi,

Excellent work here!! I have a question, rather than an issue. I want to pass cluster labels from a separate clustering, how do I do that. This is also referenced in the article supplementary material - Supplementary Note 6: VIA performance with different clustering methods.

I am using for exploratory purposes with entirely new kind of cell types, so I do not know much about the population. I would like to understand that first and then pass it on to VIA. Do you have it referenced anywhere or maybe an example?

Thank you
Adi

@ShobiStassen
Copy link
Owner

hi Adi,

thanks for bringing this up. Yes indeed for the paper we tested using separate cluster labels and this works fine - though it's usually nice to have a fairly granular (not too coarse) clustering. In the current pip version of via we havent yet allowed for different clustering but I can very easily fix that for you if you give me a day to just make sure that it runs without any glitches. We would effectively just bypass the PARC clustering stage and use your own clusters. Alternatively, while you wait for me to work on this, you can pass your own cluster labels in the true_label parameter and then let via do its inbuilt PARC clustering

Shobi

@wangjiawen2013
Copy link

Looking forward to this feature!
+1

@barveaditya
Copy link
Author

Hi Shobi,

Yes sure, let me know when you fix it, would be a great functionality to have.
On passing clusters using true_label, VIA would again re-cluster right? I would basically have an accuracy readout.

As a suggestion - one of the enhancements would be to make both PARC and VIA single -cell agnostic. To give you context, I also work in patient electronic health records area, where one analyses baseline patient characteristics (like a snapshot of single cell readouts) as well as longitudinal data. There aren't many methods that allow this, except for ClinTrajan (ref - https://academic.oup.com/gigascience/article/9/11/giaa128/6006352). I think making this agnostic of single-cells would be pretty great. I have tried running PARC on patient data PCs and it runs well. You could test functionality using the two open datasets in the ClinTrajan paper above and see how your methods do. I am happy to jump on a call to discuss this fyurther if you wish. It could result in a pretty nice publication as well. I work in Novartis and am reachable at barveaditya@gmail.com.

Hope this helps,
Adi

@ShobiStassen
Copy link
Owner

ShobiStassen commented Dec 12, 2022

hi @barveaditya Adi,

Thank you for sharing the paper - Let me have a read through and yes of course happy to discuss further!
In the meantime, to not keep you waiting, please try v0.1.64 of via by installing again on pip and let me know if the label passing works for you. Let me know if you run into any problems with this.
Basically when you initialize via you need to pass a list of labels using labels = [your list of cluster labels that you wish to used instead of the inbuilt clustering] such that each sample has an integer label (cluster membership). Currently you need to provide a list of integers in this parameter.

What i mean by passing your precomputed clustering into the "true_label" parameter was so that in the plots of the viagraph/milestone etc you will be able to compare the composition of your clusters for each of the via clusters in the clutsergraph plot. Like you said, for exploratory data, the true-labels are often just a "best guess annotation" based on DEGs of a certain clustering output to provide some indication of the cell types in the dataset.

Shobi

@wangjiawen2013
Copy link

wangjiawen2013 commented Dec 19, 2022

Hi @ShobiStassen ,
the parameter "labels" didn't work for me, when I set an integer labels list, the following error occurred:

TypeError Traceback (most recent call last)
in
2 too_big_factor=0.3, root_user=root_user, preserve_disconnected=True, pseudotime_threshold_TS=30, num_threads=num_threads,
3 dataset=dataset, random_seed=random_seed, resolution_parameter=0.2)
----> 4 v0.run_VIA()

~/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py in run_VIA(self)
2963 self.knn_struct = _construct_knn(self.data, knn=self.knn, distance=self.distance, num_threads=self.num_threads)
2964 st = time.time()
-> 2965 self.run_subPARC()
2966 run_time = time.time() - st
2967 print(f'{datetime.now()}\tTime elapsed {round(run_time,1)} seconds')

~/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py in run_subPARC(self)
2360 self.node_degree_list = node_deg_list
2361 print(f"{datetime.now()}\tBegin projection of pseudotime and lineage likelihood")
-> 2362 self.single_cell_bp, self.single_cell_pt_markov = self.project_branch_probability_sc(bp_array, df_graph['markov_pt'].values)
2363 #print('scmarkov', self.single_cell_pt_markov[0:10])
2364

~/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py in project_branch_probability_sc(self, bp_array_clus, pt)
902 rows, cols, weights = [], [], []
903 for i, row in enumerate(neighbors):
--> 904 neighboring_clus = self.labels[row]
905 for c in set(list(neighboring_clus)):
906 rows.append(i)

TypeError: only integer scalar arrays can be converted to a scalar index

@wangjiawen2013
Copy link

Does this link help:
https://www.jianshu.com/p/4c4039aa6020

@MinatoKobashi
Copy link
Contributor

This might be caused by the problem of indexing a list since indexing is not allowed on list. You can convert the list to a numpy array and pass the array to labels.

@ShobiStassen
Copy link
Owner

ShobiStassen commented Dec 20, 2022

Hi, i think Minato is right. Try to convert your list to an ndarray of shape (ndamples,) using np.asarray().

In the examples.py code there is a short example on the toy data in lines 748

Also please note that if you are specifying terminal groups or cells there are two params depending on if you are specifying single cell indices or group level based on true label:
Screenshot_20221221-062540_Chrome

@wangjiawen2013
Copy link

wangjiawen2013 commented Dec 21, 2022

Converting the list to a numpy array solved the problem. I tried using a set of kmeans labels, this time VIA run successfully, but when I used other labels, VIA still failed. I think VIA needs some pre-requisites on the labels.

2022-12-21 13:56:11.494170 Running VIA over input data of 564 (samples) x 30 (features)
2022-12-21 13:56:11.494384 Knngraph has 30 neighbors
2022-12-21 13:56:11.632841 Finished global pruning of 30-knn graph used for clustering at level of 0.5. Kept 65.5 % of edges.
2022-12-21 13:56:11.635684 Number of connected components used for clustergraph is 1
<built-in method now of type object at 0x7fe868736960> Using predfined labels provided by user
2022-12-21 13:56:11.655055 Making cluster graph. Global cluster graph pruning level: 0.5
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/utils_via.py:239: RuntimeWarning: divide by zero encountered in double_scalars
weights = [(w + w_min) / scale_factor for w in weights]
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/utils_via.py:87: RuntimeWarning: invalid value encountered in subtract
Tcsr.data -= np.min(Tcsr.data) - 1
2022-12-21 13:56:11.656922 Graph has 17 connected components before pruning
2022-12-21 13:56:11.657907 Graph has 18 connected components before pruning
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:230: RuntimeWarning: invalid value encountered in subtract
x = asanyarray(arr - arrmean)
2022-12-21 13:56:11.658318 0.0% links trimmed from local pruning relative to start
2022-12-21 13:56:11.659886 Starting make edgebundle viagraph...
2022-12-21 13:56:11.659916 Make via clustergraph edgebundle
2022-12-21 13:56:11.817418 Hammer dims: Nodes shape: (18, 2) Edges shape: (2, 3)
2022-12-21 13:56:11.820362 Computing lazy-teleporting expected hitting times
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3441: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:263: RuntimeWarning: Degrees of freedom <= 0 for slice
keepdims=keepdims, where=where)
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:223: RuntimeWarning: invalid value encountered in true_divide
subok=False)
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:254: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Process Process-134:
Traceback (most recent call last):
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py", line 144, in simulate_markov_sub
nextState = np.random.choice(range(P.shape[0]), p=P[currentState])
File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
Process Process-135:
Traceback (most recent call last):
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py", line 144, in simulate_markov_sub
nextState = np.random.choice(range(P.shape[0]), p=P[currentState])
File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
Process Process-136:
Traceback (most recent call last):
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py", line 144, in simulate_markov_sub
nextState = np.random.choice(range(P.shape[0]), p=P[currentState])
File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
Process Process-137:
Traceback (most recent call last):
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py", line 144, in simulate_markov_sub
nextState = np.random.choice(range(P.shape[0]), p=P[currentState])
File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
Process Process-138:
Traceback (most recent call last):
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py", line 144, in simulate_markov_sub
nextState = np.random.choice(range(P.shape[0]), p=P[currentState])
File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN

@wangjiawen2013
Copy link

wangjiawen2013 commented Dec 21, 2022

I have an anndata object, then I wanna subset some clusters to do trajectories inference.
I met an issue when using palantir. When I subset the anndata object to infer pseudotime of a subset of clusters, It's needed to recompute neighbors.
So in this case VIA need to to recompute knn again when users use labels to make it run successfully?

Repository owner deleted a comment from wangjiawen2013 Jan 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants