allpairs output not congruent with sphere clustering output? #13

claudiuskerth · 2016-03-10T13:48:08Z

Hi Eduard,

I have run the latest starcode version from the allpairs branch on an input file of distinct sequences (same as referenced in "cluster size" bug report) with the command line:

starcode -d8 -t22 -i sort_uniqed_reads.txt -o starcode_edges.out

I count 96,071 distinct sequences (nodes) in the output file. There are 134,492 distinct sequences in the input file. So I infer from that 38,421 (134492 - 96,071) sequences are singletons, i. e. have no tau-match with any other sequence in the input.

To check that, I looked into the output from a run with the newest (i. e. cluster size corrected) version of starcode from the master branch created with the following command:

starcode -d8 -t22 -s --print-clusters -i sort_uniqed_reads.txt -o starcode.out

I assume that with sphere clustering, any sequence that does not have a tau-match will be put in a cluster on its own. Counting the number of "clusters" with only one sequence I get 41,908. Shouldn't this rather be 38,421?

many thanks for your help,

claudius

The text was updated successfully, but these errors were encountered:

ezorita · 2016-03-10T16:08:19Z

Hi Claudius,

well, this is not necessarily true. Spheres clustering works as follows:

Select the node with greater count (the cluster centroid).
Visit its matching nodes and claim their count.
Set count = 0 at the matching nodes.
Repeat 1 until all nodes with count > 0 have been processed.

So now imagine that we have the following network after all pairs search:

A - B - C

i.e. A matches B, B matches C, but d(A,C) > d. If A,B and C have the same number of counts, step 1 will select a sequence at random. Therefore, if either A or C are selected first, they will claim B and the other will be left alone (thus generating a singleton). Otherwise if B is selected first, the three sequences will cluster together.

The point that you raised is interesting because starcode is not optimizing the centroids in spheres clustering. The optimal centroid choice would be the one that minimizes the final number of clusters and this can be achieved by modifying step 1:

Select the node that generates a greater cluster.

Precomputing this is not straightforward given the current implementation of starcode but I will work on it during the following days.

Note, however, that even if the centroid choice is optimal the A,B,C example still applies and the number of singletons won't necessarily coincide with the result of all pairs search.

Thank you for contributing!

Eduard

claudiuskerth · 2016-03-11T17:38:00Z

Hi Eduard,

thank you for the clarification!

I know this is not the intended use case for starcode, but I try to cluster sequences that come from the same locus while avoiding as much as possible clustering sequences from different but similar (paralogous) loci. I don't know which graph clustering algorithm does best for this problem, but in any case I would also be interested in a raw clusters output from starcode, i. e. before applying any clustering algorithm that severs edges between nodes.

cheers

claudius

ezorita · 2016-03-15T12:01:26Z

Yes, I can make a raw version that prints all the edges found for each node. In the A,B,C example it would be something like:
B A
B C
A B
C B
is this what you mean?

Eduard

claudiuskerth · 2016-03-15T13:20:56Z

Hi Eduard,

that's not quite what I mean. I am thinking of a "clustering" output with a canonical sequence, cluster size and list of child nodes (and grandchild nodes and great grandchild nodes ...), but from the raw clusters of the all-pairs search. For instance, if there was a path from every input sequence to any other input sequence, that would output just one big cluster. So, in principle, the cluster radius could be much bigger than tau if there were intermediate nodes between the canonical sequence and a sequence that is more than tau edits apart.

claudius

ezorita · 2016-03-16T08:13:55Z

I see, this feature should be easy to add. I am busy analyzing experimental data right now, but I will get back to this asap.

Eduard

…s clustering centroid selection (not yet optimal). Changed Eduard's e-mail address in src files. Added article citation to README files.

ezorita · 2016-03-18T11:59:50Z

Please pull the update in master branch.

Eduard

claudiuskerth · 2016-11-03T13:08:10Z

Hi Eduard,

thank you for adding the --connected-comp output to starcode.

I am just wondering how the canonical sequence is selected for this output, i. e. random or sequence with highest coverage or even some kind of majority consensus?

claudius

ezorita · 2016-12-07T11:44:58Z

Sorry Claudius, I missed this one.

Yes, the canonical sequence with --connected-comp is the one in the cluster with more counts. In case of count tie, the sequence with greater number of edges is the canonical. If two or more sequences have maximum count and the same number of edges, one of them is returned at random.

Eduard

claudiuskerth · 2016-12-07T19:32:46Z

Thank you Eduard!

ezorita added the enhancement label Mar 10, 2016

ezorita self-assigned this Mar 10, 2016

ezorita added a commit that referenced this issue Mar 18, 2016

Added connected components output as discussed in #13. Updated sphere…

87cf36d

…s clustering centroid selection (not yet optimal). Changed Eduard's e-mail address in src files. Added article citation to README files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allpairs output not congruent with sphere clustering output? #13

allpairs output not congruent with sphere clustering output? #13

claudiuskerth commented Mar 10, 2016

ezorita commented Mar 10, 2016

claudiuskerth commented Mar 11, 2016

ezorita commented Mar 15, 2016

claudiuskerth commented Mar 15, 2016

ezorita commented Mar 16, 2016

ezorita commented Mar 18, 2016

claudiuskerth commented Nov 3, 2016

ezorita commented Dec 7, 2016

claudiuskerth commented Dec 7, 2016

allpairs output not congruent with sphere clustering output? #13

allpairs output not congruent with sphere clustering output? #13

Comments

claudiuskerth commented Mar 10, 2016

ezorita commented Mar 10, 2016

claudiuskerth commented Mar 11, 2016

ezorita commented Mar 15, 2016

claudiuskerth commented Mar 15, 2016

ezorita commented Mar 16, 2016

ezorita commented Mar 18, 2016

claudiuskerth commented Nov 3, 2016

ezorita commented Dec 7, 2016

claudiuskerth commented Dec 7, 2016