Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consensus step : genome size lower than pre-consensus stage #231

Open
shri1984 opened this issue Mar 25, 2021 · 10 comments
Open

Consensus step : genome size lower than pre-consensus stage #231

shri1984 opened this issue Mar 25, 2021 · 10 comments

Comments

@shri1984
Copy link

Hi,
I am getting 12% less bases to post consensus for my genome (complex and big, 100X coverage). I have checked is there missing contigs between the lay and cns.raw.fa file. I see no missing contigs. I just wonder what is driving this? or is it normal to loose that much bases in the consensus stage? I also wonder are there any parameters in wtpoa-cns I can tweak?
Thank you.

@ruanjue
Copy link
Owner

ruanjue commented Mar 26, 2021

The first step is to check whether the polished contigs are more accurate using WGS short reads.

@shri1984
Copy link
Author

I did that. I use polca.sh for polshing. I use 1 billion illumina PE reads (150 PE).
this is the report
Substitution Errors: 991295
Insertion/Deletion Errors: 663392
Assembly Size: 5498989353
Consensus Quality: 99.9699

@ruanjue
Copy link
Owner

ruanjue commented Mar 26, 2021

So, the reason should be many repeats were collapsed in assembling, not the problem of wtpoa-cns. One option is to add -R to wtdbg2, which will be 2X slower. Another opition, try to use flye or other assembler on this dataset, and find the best assembly.

@shri1984
Copy link
Author

shri1984 commented Mar 26, 2021

Thanks.
I see. I used the options you suggested (-R, aln-dovetail -1 or 1024, -l 500 etc, K 2000) for repetitive genomes (in issue #230). It worked beautifully, but things go wrong in cns stage. is there any other wtpoa-cns like the consensus calling tool I can try and compare?

@ruanjue
Copy link
Owner

ruanjue commented Mar 26, 2021

In wtdbg2 step, the assembly size was stated by uncorrected seqeunce length, usually will become smaller after wtpoa-cns.

@shri1984
Copy link
Author

Do you know what is acceptable limit for this reduction? in my case it is 12 %. data is coming from 7 cells of sequel CLR. I am also using RS preset. it started to become good with this preset. Again I got this info from other issues you addressed here. so you think I have no way out of this problem?

@ruanjue
Copy link
Owner

ruanjue commented Mar 26, 2021

If the genome size was correctly estimated and the genome was complicated, maybe there is no way. However, please find out some contigs that differed much in size between before and after polishing, then align their CLR long reads to their consensus sequences to see whether there were big insertion/deletions. If found many such cases, there should be errors when wtpoa-cns concatenates cns seq pieces.

@goblin290272908
Copy link

Hi
Thank you for providing such excellent tools. We rely on it to assemble the genome using ccs data. At present, for our data, its result is obviously better than hifiasm. Using the default parameters (and -g 1.3g), the direct output quality reaches 1892 contings and the N50 reaches 3M. The evaluation of busco reached more than 95%. However, the genome size is still too small compared with the estimated size, and only 880m assembly is obtained. How can we adjust the parameters so that our results are close to the estimated genome size?
thank you!

@ruanjue
Copy link
Owner

ruanjue commented Feb 25, 2022

wtdbg2 tends to collapse similar regions. For your case, please try increase '-s 0.5' to '-s 0.8' or others.

@goblin290272908
Copy link

thank you very much!I will try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants