Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Homotetraploid, super-large genome, with different parameters, the size of p_utg varies greatly? #632

Open
GLking123 opened this issue Apr 1, 2024 · 2 comments

Comments

@GLking123
Copy link

GLking123 commented Apr 1, 2024

Dear author,
Thank you for developing such a milestone software, which greatly accelerates the efficiency of assembly.

I am currently conducting assembly of a large genome of a plant species, which is a homologous tetraploid with a genome size of approximately 55 gigabases (G). Currently, I only have HiFi data available. I have employed three assembly strategies, outlined as follows:

  1. hifiasm -t 120 -l 0 the generated .p_ctg.gfa file is of size 55G, and the .p_utg.gfa file is of size 75G.
  2. hifiasm --n-hap 2 -t 120 -l 0 the generated .p_ctg.gfa file is of size 55G, and the .p_utg.gfa file is of size 56G.
  3. hifiasm --n-hap 4 -t 120 -l 0 the generated .p_ctg.gfa file is of size 56G, and the .p_utg.gfa file is of size 76G.

Using flow cytometry, the estimated genome size is approximately 50 G.

I used HapHic to scaffold chromosomes, but encountered numerous errors. Perhaps using p_utg would yield better results?

Currently, the generated size of p_utg with the --n-hap 2 parameter meets expectations. Can the generated p_utg be used?

What is the difference between using --n-hap without specifying a number and using --n-hap 4? Why is the size of p_utg significantly larger when using --n-hap 4 compared to --n-hap 2?

The following is the k-mer graph generated by Hifiasm:
Snipaste_2024-04-01_12-14-23
Snipaste_2024-04-01_12-14-45

For the above question, could you provide some debugging suggestions? Thank you for your valuable time and assistance. I sincerely look forward to your response!

@chhylp123
Copy link
Owner

--n-hap is used to determine the coverage of heterozygous nodes or contigs. For your sample, hifiasm thinks the homozygous coverage is 26, and the heterozygous coverages are 26/2 = 13 and 26/4 = 6 using --n-hap 2 and --n-hap 4, respectively. Hifiasm keeps any node in the assembly graph with coverage above the heterozygous coverage threshold as a real node, instead of sequencing errors. This is why --n-hap 4 leads to a larger graph. Could you please have a try with --hom-cov 55 and --n-hap 2? Since bv looking at the k-mer plot, there are only two peaks and the homozygous coverage should be 55.

@GLking123
Copy link
Author

GLking123 commented Jun 2, 2024

Dear author,
I tried your suggestions, and here are the results:

hifiasm --n-hap 2 -t 120 -l 0 --hom-cov 55 the generated .p_ctg.gfa file is of size 56G, and the .p_utg.gfa file is of size 66G.

Since mine is a homologous tetraploid, which form should I choose for assembly, p_ctg or p_utg?

p_ctg N50: 100MB
p_utg N50: 1MB

For the above question, could you provide some debugging suggestions? Thank you for your valuable time and assistance. I sincerely look forward to your response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants