Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

##1 mismatch by cluster #26

Open
penglbio opened this issue Oct 8, 2018 · 7 comments
Open

##1 mismatch by cluster #26

penglbio opened this issue Oct 8, 2018 · 7 comments
Labels

Comments

@penglbio
Copy link

penglbio commented Oct 8, 2018

sorry to trouble you. In a paper, I saw someone use your software(starcode)to cluster sequences within 1nt mismatch. the following is the paper title and description:
title:Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding
description:We then used Starcode(45)to collapse UMIs of aligned reads that were within 1nt mismatch of another UMI

I am confused, because In your software, I didn't find a parameter to set. can you tell me did there is a method to solve this problem

@ezorita
Copy link
Collaborator

ezorita commented Oct 8, 2018

The parameter -d specifies the clustering distance (the number of mismatched nucleotides you want to allow). So, in your example with distance 1, you'd run starcode as follows:

starcode -d1 input-file.fastq

Hope it helps.

@penglbio
Copy link
Author

penglbio commented Oct 9, 2018

I will try. Thank you very much.

@penglbio
Copy link
Author

penglbio commented Oct 9, 2018

how about the fasta, I test like the following, but can't work.
$ starcode -d 1 test_file.fasta
running starcode with 1 thread
reading input files
FASTA format detected
sorting
progress: 100.00%
message passing clustering
AGGGCTTACAAGTATAGGCC 2
CCTCATTATTTGTCGCAATG 1
TGCGCCAAGTACGATTTCCG 1
TGGGCTTACAAGTATAGGCC 1

the last sequence just 1 mismatch with the first.

@ezorita
Copy link
Collaborator

ezorita commented Oct 9, 2018

Note that you are using message passing algorithm for clustering. Message passing has a parameter called --cluster-ratio which is set to 5 by default. This parameter sets a restriction on the ratio of sequences needed to cluster one sequence with another. So, in other words, by default two sequences will only be clustered together if the count of one is at least 5 time bigger than the count of the other.

In your example, you are running starcode with just a few sequences and default parameters. Note that the last and the first sequence did not cluster together because their cluster ratio is 2, i.e. the first has 2 counts and the last has only 1.

So, to solve this, do one of the following:

  1. Run starcode with the whole dataset (but make sure that each canonical sequence is supposed to be over-represented compared to the others).
  2. Run starcode with a smaller --cluster-ratio.
  3. Use spheres clustering algorithm (this set with the parameter -s).

Hope it helps.

@bettycatherine
Copy link

I am really confused. Starcode was used in that paper for UMI collapse, so I think they used starcode-umi but not starcode. Am I correct? I am also wondering if there is any advice on how to set sequence distance when we use starcode-umi. Thank you very much!

@wangjianing-web
Copy link

I am really confused. Starcode was used in that paper for UMI collapse, so I think they used starcode-umi but not starcode. Am I correct? I am also wondering if there is any advice on how to set sequence distance when we use starcode-umi. Thank you very much!

But the UMI(10bp) is in the R2.fq file, it said the cDNA reads (Read 1) were mapped to genome, and then used Starcode (45) to collapse UMIs of aligned reads that were within 1 nt mismatch of another UMI, assuming the two aligned reads were also from the same UBC. I don't konw if I should combine the UMI and read 1, but it can not mappepd to genome,I don know what is the correct method.

@ezorita
Copy link
Collaborator

ezorita commented Aug 21, 2020

Hi @wangjianing-web. I can't tell which is the correct method they used. You should contact the authors for more details on how they used starcode in their work. What I understand from your description is that they followed these steps:

  • Map reads to genome
  • Take mapped reads and append UMI to them
  • Use starcode to cluster reads with similar UMI (1 mismatch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants