Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allpairs output not congruent with sphere clustering output? #13

Open
claudiuskerth opened this issue Mar 10, 2016 · 9 comments
Open

allpairs output not congruent with sphere clustering output? #13

claudiuskerth opened this issue Mar 10, 2016 · 9 comments
Assignees

Comments

@claudiuskerth
Copy link

Hi Eduard,

I have run the latest starcode version from the allpairs branch on an input file of distinct sequences (same as referenced in "cluster size" bug report) with the command line:

starcode -d8 -t22 -i sort_uniqed_reads.txt -o starcode_edges.out

I count 96,071 distinct sequences (nodes) in the output file. There are 134,492 distinct sequences in the input file. So I infer from that 38,421 (134492 - 96,071) sequences are singletons, i. e. have no tau-match with any other sequence in the input.

To check that, I looked into the output from a run with the newest (i. e. cluster size corrected) version of starcode from the master branch created with the following command:

starcode -d8 -t22 -s --print-clusters -i sort_uniqed_reads.txt -o starcode.out

I assume that with sphere clustering, any sequence that does not have a tau-match will be put in a cluster on its own. Counting the number of "clusters" with only one sequence I get 41,908. Shouldn't this rather be 38,421?

many thanks for your help,

claudius

@ezorita ezorita self-assigned this Mar 10, 2016
@ezorita
Copy link
Collaborator

ezorita commented Mar 10, 2016

Hi Claudius,

well, this is not necessarily true. Spheres clustering works as follows:

  1. Select the node with greater count (the cluster centroid).
  2. Visit its matching nodes and claim their count.
  3. Set count = 0 at the matching nodes.
  4. Repeat 1 until all nodes with count > 0 have been processed.

So now imagine that we have the following network after all pairs search:

A - B - C

i.e. A matches B, B matches C, but d(A,C) > d. If A,B and C have the same number of counts, step 1 will select a sequence at random. Therefore, if either A or C are selected first, they will claim B and the other will be left alone (thus generating a singleton). Otherwise if B is selected first, the three sequences will cluster together.

The point that you raised is interesting because starcode is not optimizing the centroids in spheres clustering. The optimal centroid choice would be the one that minimizes the final number of clusters and this can be achieved by modifying step 1:

  1. Select the node that generates a greater cluster.

Precomputing this is not straightforward given the current implementation of starcode but I will work on it during the following days.

Note, however, that even if the centroid choice is optimal the A,B,C example still applies and the number of singletons won't necessarily coincide with the result of all pairs search.

Thank you for contributing!

Eduard

@claudiuskerth
Copy link
Author

Hi Eduard,

thank you for the clarification!

I know this is not the intended use case for starcode, but I try to cluster sequences that come from the same locus while avoiding as much as possible clustering sequences from different but similar (paralogous) loci. I don't know which graph clustering algorithm does best for this problem, but in any case I would also be interested in a raw clusters output from starcode, i. e. before applying any clustering algorithm that severs edges between nodes.

cheers

claudius

@ezorita
Copy link
Collaborator

ezorita commented Mar 15, 2016

Yes, I can make a raw version that prints all the edges found for each node. In the A,B,C example it would be something like:
B A
B C
A B
C B
is this what you mean?

Eduard

@claudiuskerth
Copy link
Author

Hi Eduard,

that's not quite what I mean. I am thinking of a "clustering" output with a canonical sequence, cluster size and list of child nodes (and grandchild nodes and great grandchild nodes ...), but from the raw clusters of the all-pairs search. For instance, if there was a path from every input sequence to any other input sequence, that would output just one big cluster. So, in principle, the cluster radius could be much bigger than tau if there were intermediate nodes between the canonical sequence and a sequence that is more than tau edits apart.

claudius

@ezorita
Copy link
Collaborator

ezorita commented Mar 16, 2016

I see, this feature should be easy to add. I am busy analyzing experimental data right now, but I will get back to this asap.

Eduard

ezorita added a commit that referenced this issue Mar 18, 2016
…s clustering centroid selection (not yet optimal). Changed Eduard's e-mail address in src files. Added article citation to README files.
@ezorita
Copy link
Collaborator

ezorita commented Mar 18, 2016

Please pull the update in master branch.

Eduard

@claudiuskerth
Copy link
Author

Hi Eduard,

thank you for adding the --connected-comp output to starcode.

I am just wondering how the canonical sequence is selected for this output, i. e. random or sequence with highest coverage or even some kind of majority consensus?

claudius

@ezorita
Copy link
Collaborator

ezorita commented Dec 7, 2016

Sorry Claudius, I missed this one.

Yes, the canonical sequence with --connected-comp is the one in the cluster with more counts. In case of count tie, the sequence with greater number of edges is the canonical. If two or more sequences have maximum count and the same number of edges, one of them is returned at random.

Eduard

@claudiuskerth
Copy link
Author

Thank you Eduard!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants