Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report a bug about cluster distance #27

Open
zhangfei1947 opened this issue Oct 9, 2018 · 1 comment
Open

Report a bug about cluster distance #27

zhangfei1947 opened this issue Oct 9, 2018 · 1 comment
Labels

Comments

@zhangfei1947
Copy link

Dear developer,
Sorry to disturb, I found a bug in using starcode and it really confused me.
I ran this command "starcode -d 1 -t 4 --input b1.fa --print-clusters > b1.cluster"
but output is like this, it seems that distance configure didn't work?
CATACCAA 2310939 AAAACCAA,AAAACCAC,AAAACCAG,AAAACCAT,AAAACCCA,AAAACCCG,......
...
...

@ezorita
Copy link
Collaborator

ezorita commented Oct 9, 2018

Hi, thanks for reporting.

Well, this is not a bug, this is how starcode is supposed to work. What starcode reports at the output is the result of clustering the sequences after matching them at a specified distance. So, to make things simple, there are two steps:

  1. In the first step, starcode finds all the pairs of sequences that match each other at distance 1.
  2. At the second step, it creates a network with the matching sequences and clusters the sequences following the specified algorithm (message passing by default).

That means that even if you set -d to 1, some clusters may contain sequences that are more than 1 mismatch from each other, especially when the input data set is too dense (almost all the combination of nucleotides are present).

To prevent this you can set a more restrictive clustering ratio, but that really depends on the nature of the biological data you are trying to cluster.

I can try to help you further but I would need more information about the sequences you are feeding to starcode.

Here is an example of what is going on

Say that you have an input with three sequences:

1. ATTTGAC
2. ATTCGAC
3. ATTCCAC

We set starcode to find matches at distance 1, and finds the following matches:

1 matches 2 at distance 1
2 matches 3 at distance 1

So the network is:

ATTTGAC <->  ATTCGAC <-> ATTCCAC

Which results in a cluster where ATTCGAC is the centroid and contains the three sequences, even though ATTTGAC and ATTCCAC are at distance 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants