passing your own cluster labels: VIA performance with different clustering methods. #29

barveaditya · 2022-12-06T16:13:04Z

Hi Shobi,

Excellent work here!! I have a question, rather than an issue. I want to pass cluster labels from a separate clustering, how do I do that. This is also referenced in the article supplementary material - Supplementary Note 6: VIA performance with different clustering methods.

I am using for exploratory purposes with entirely new kind of cell types, so I do not know much about the population. I would like to understand that first and then pass it on to VIA. Do you have it referenced anywhere or maybe an example?

Thank you
Adi

ShobiStassen · 2022-12-07T10:18:28Z

hi Adi,

thanks for bringing this up. Yes indeed for the paper we tested using separate cluster labels and this works fine - though it's usually nice to have a fairly granular (not too coarse) clustering. In the current pip version of via we havent yet allowed for different clustering but I can very easily fix that for you if you give me a day to just make sure that it runs without any glitches. We would effectively just bypass the PARC clustering stage and use your own clusters. Alternatively, while you wait for me to work on this, you can pass your own cluster labels in the true_label parameter and then let via do its inbuilt PARC clustering

Shobi

wangjiawen2013 · 2022-12-10T16:19:00Z

Looking forward to this feature!
+1

barveaditya · 2022-12-11T12:23:08Z

Hi Shobi,

Yes sure, let me know when you fix it, would be a great functionality to have.
On passing clusters using true_label, VIA would again re-cluster right? I would basically have an accuracy readout.

As a suggestion - one of the enhancements would be to make both PARC and VIA single -cell agnostic. To give you context, I also work in patient electronic health records area, where one analyses baseline patient characteristics (like a snapshot of single cell readouts) as well as longitudinal data. There aren't many methods that allow this, except for ClinTrajan (ref - https://academic.oup.com/gigascience/article/9/11/giaa128/6006352). I think making this agnostic of single-cells would be pretty great. I have tried running PARC on patient data PCs and it runs well. You could test functionality using the two open datasets in the ClinTrajan paper above and see how your methods do. I am happy to jump on a call to discuss this fyurther if you wish. It could result in a pretty nice publication as well. I work in Novartis and am reachable at barveaditya@gmail.com.

Hope this helps,
Adi

ShobiStassen · 2022-12-12T05:07:50Z

hi @barveaditya Adi,

Thank you for sharing the paper - Let me have a read through and yes of course happy to discuss further!
In the meantime, to not keep you waiting, please try v0.1.64 of via by installing again on pip and let me know if the label passing works for you. Let me know if you run into any problems with this.
Basically when you initialize via you need to pass a list of labels using labels = [your list of cluster labels that you wish to used instead of the inbuilt clustering] such that each sample has an integer label (cluster membership). Currently you need to provide a list of integers in this parameter.

What i mean by passing your precomputed clustering into the "true_label" parameter was so that in the plots of the viagraph/milestone etc you will be able to compare the composition of your clusters for each of the via clusters in the clutsergraph plot. Like you said, for exploratory data, the true-labels are often just a "best guess annotation" based on DEGs of a certain clustering output to provide some indication of the cell types in the dataset.

Shobi

wangjiawen2013 · 2022-12-19T13:43:36Z

Hi @ShobiStassen ,
the parameter "labels" didn't work for me, when I set an integer labels list, the following error occurred:

TypeError Traceback (most recent call last)
in
2 too_big_factor=0.3, root_user=root_user, preserve_disconnected=True, pseudotime_threshold_TS=30, num_threads=num_threads,
3 dataset=dataset, random_seed=random_seed, resolution_parameter=0.2)
----> 4 v0.run_VIA()

~/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py in run_VIA(self)
2963 self.knn_struct = _construct_knn(self.data, knn=self.knn, distance=self.distance, num_threads=self.num_threads)
2964 st = time.time()
-> 2965 self.run_subPARC()
2966 run_time = time.time() - st
2967 print(f'{datetime.now()}\tTime elapsed {round(run_time,1)} seconds')

~/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py in run_subPARC(self)
2360 self.node_degree_list = node_deg_list
2361 print(f"{datetime.now()}\tBegin projection of pseudotime and lineage likelihood")
-> 2362 self.single_cell_bp, self.single_cell_pt_markov = self.project_branch_probability_sc(bp_array, df_graph['markov_pt'].values)
2363 #print('scmarkov', self.single_cell_pt_markov[0:10])
2364

~/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py in project_branch_probability_sc(self, bp_array_clus, pt)
902 rows, cols, weights = [], [], []
903 for i, row in enumerate(neighbors):
--> 904 neighboring_clus = self.labels[row]
905 for c in set(list(neighboring_clus)):
906 rows.append(i)

TypeError: only integer scalar arrays can be converted to a scalar index

wangjiawen2013 · 2022-12-19T13:50:01Z

Does this link help:
https://www.jianshu.com/p/4c4039aa6020

MinatoKobashi · 2022-12-20T13:11:56Z

This might be caused by the problem of indexing a list since indexing is not allowed on list. You can convert the list to a numpy array and pass the array to labels.

ShobiStassen · 2022-12-20T21:13:53Z

Hi, i think Minato is right. Try to convert your list to an ndarray of shape (ndamples,) using np.asarray().

In the examples.py code there is a short example on the toy data in lines 748

Also please note that if you are specifying terminal groups or cells there are two params depending on if you are specifying single cell indices or group level based on true label:

wangjiawen2013 · 2022-12-21T06:01:06Z

Converting the list to a numpy array solved the problem. I tried using a set of kmeans labels, this time VIA run successfully, but when I used other labels, VIA still failed. I think VIA needs some pre-requisites on the labels.

2022-12-21 13:56:11.494170 Running VIA over input data of 564 (samples) x 30 (features)
2022-12-21 13:56:11.494384 Knngraph has 30 neighbors
2022-12-21 13:56:11.632841 Finished global pruning of 30-knn graph used for clustering at level of 0.5. Kept 65.5 % of edges.
2022-12-21 13:56:11.635684 Number of connected components used for clustergraph is 1
<built-in method now of type object at 0x7fe868736960> Using predfined labels provided by user
2022-12-21 13:56:11.655055 Making cluster graph. Global cluster graph pruning level: 0.5
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/utils_via.py:239: RuntimeWarning: divide by zero encountered in double_scalars
weights = [(w + w_min) / scale_factor for w in weights]
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/utils_via.py:87: RuntimeWarning: invalid value encountered in subtract
Tcsr.data -= np.min(Tcsr.data) - 1
2022-12-21 13:56:11.656922 Graph has 17 connected components before pruning
2022-12-21 13:56:11.657907 Graph has 18 connected components before pruning
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:230: RuntimeWarning: invalid value encountered in subtract
x = asanyarray(arr - arrmean)
2022-12-21 13:56:11.658318 0.0% links trimmed from local pruning relative to start
2022-12-21 13:56:11.659886 Starting make edgebundle viagraph...
2022-12-21 13:56:11.659916 Make via clustergraph edgebundle
2022-12-21 13:56:11.817418 Hammer dims: Nodes shape: (18, 2) Edges shape: (2, 3)
2022-12-21 13:56:11.820362 Computing lazy-teleporting expected hitting times
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3441: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:263: RuntimeWarning: Degrees of freedom <= 0 for slice
keepdims=keepdims, where=where)
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:223: RuntimeWarning: invalid value encountered in true_divide
subok=False)
/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:254: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Process Process-134:
Traceback (most recent call last):
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py", line 144, in simulate_markov_sub
nextState = np.random.choice(range(P.shape[0]), p=P[currentState])
File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
Process Process-135:
Traceback (most recent call last):
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py", line 144, in simulate_markov_sub
nextState = np.random.choice(range(P.shape[0]), p=P[currentState])
File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
Process Process-136:
Traceback (most recent call last):
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py", line 144, in simulate_markov_sub
nextState = np.random.choice(range(P.shape[0]), p=P[currentState])
File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
Process Process-137:
Traceback (most recent call last):
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py", line 144, in simulate_markov_sub
nextState = np.random.choice(range(P.shape[0]), p=P[currentState])
File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
Process Process-138:
Traceback (most recent call last):
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/wangjw/programs/miniconda3/envs/py37/lib/python3.7/site-packages/pyVIA/core.py", line 144, in simulate_markov_sub
nextState = np.random.choice(range(P.shape[0]), p=P[currentState])
File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN

wangjiawen2013 · 2022-12-21T06:09:15Z

I have an anndata object, then I wanna subset some clusters to do trajectories inference.
I met an issue when using palantir. When I subset the anndata object to infer pseudotime of a subset of clusters, It's needed to recompute neighbors.
So in this case VIA need to to recompute knn again when users use labels to make it run successfully?

Repository owner deleted a comment from wangjiawen2013 Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

passing your own cluster labels: VIA performance with different clustering methods. #29

passing your own cluster labels: VIA performance with different clustering methods. #29

barveaditya commented Dec 6, 2022

ShobiStassen commented Dec 7, 2022

wangjiawen2013 commented Dec 10, 2022

barveaditya commented Dec 11, 2022

ShobiStassen commented Dec 12, 2022 •

edited

wangjiawen2013 commented Dec 19, 2022 •

edited

wangjiawen2013 commented Dec 19, 2022

MinatoKobashi commented Dec 20, 2022

ShobiStassen commented Dec 20, 2022 •

edited

wangjiawen2013 commented Dec 21, 2022 •

edited

wangjiawen2013 commented Dec 21, 2022 •

edited

passing your own cluster labels: VIA performance with different clustering methods. #29

passing your own cluster labels: VIA performance with different clustering methods. #29

Comments

barveaditya commented Dec 6, 2022

ShobiStassen commented Dec 7, 2022

wangjiawen2013 commented Dec 10, 2022

barveaditya commented Dec 11, 2022

ShobiStassen commented Dec 12, 2022 • edited

wangjiawen2013 commented Dec 19, 2022 • edited

wangjiawen2013 commented Dec 19, 2022

MinatoKobashi commented Dec 20, 2022

ShobiStassen commented Dec 20, 2022 • edited

wangjiawen2013 commented Dec 21, 2022 • edited

wangjiawen2013 commented Dec 21, 2022 • edited

ShobiStassen commented Dec 12, 2022 •

edited

wangjiawen2013 commented Dec 19, 2022 •

edited

ShobiStassen commented Dec 20, 2022 •

edited

wangjiawen2013 commented Dec 21, 2022 •

edited

wangjiawen2013 commented Dec 21, 2022 •

edited