Skip to content
Michael Alonge edited this page Jun 5, 2019 · 12 revisions

RaGOO Documentation

Alonge M, Soyk S, Ramakrishnan S, Wang X, Goodwin S, Sedlazeck FJ, Lippman ZB, Schatz MC: Fast and accurate reference-guided scaffolding of draft genomes. bioRxiv 2019.

Focused Pages

Description

RaGOO is a tool to order and orient genome assembly contigs via alignments to a reference genome.

FAQ

Will RaGOO work with large genome assemblies?

Yes, and it should be relatively fast. Any assembly that can be aligned to the provided reference with Minimap2 in a reasonable amount of time should work just fine. For example, I have successfully used RaGOO to order and orient a few human genome assemblies. Tomato genomes (~ 800 Mbp) take about 10 minutes to run using 8 cores. If you would like a test, you can run the following command separately to see roughly how long RaGOO should take to complete:

$minimap2 -k19 -w19 reference.fasta contigs.fasta

Please note that the above refers specifically to ordering and orienting. Calling structural variants, on the other hand, requires alignment to produce CIGAR strings, which takes longer with Minimap2. Still, this is relatively fast, even for human genome assemblies.

How closely related does my assembly have to be to the reference genome?

RaGOO will run given any pair of fasta files (one draft assembly, one reference assembly), however, it was designed for ordering and orienting contigs from an individual that is closely related (same species) to the provided reference genome. Closely related individuals are more likely to have similar chromosomal structure, which is one of the factors influencing erroneous reference bias. RaGOO confidence scores can be helpful in determining if the two assembly genotypes are too divergent for reference-guided scaffolding. They are metrics associated with each of the three stages of scaffolding, namely, assigning contigs to a chromosome, ordering contigs relative to each other along the chromosome, and orienting contigs. Confidence scores can be found in the groupings and orderings files in the RaGOO output. Divergent individuals should have more ambiguous alignments to the reference, and therefore, lower confidence scores.

Won't this introduce erroneous reference bias into my de novo assembly?

Erroneous reference bias is the phenomenon of introducing misassemblies by virtue of using a reference genome to guide scaffolding. This could lead to masking true biological variation between the genotype of the assembly and that of the reference, or it could lead to errors in the reference being passed on to the new ordered and oriented assembly. The way I see it, there are three main factors that influence erroneous reference bias:

  1. Draft assembly contiguity and accuracy
  2. Reference assembly accuracy
  3. Shared chromosomal structure between the two genotypes

RaGOO is best used in cases where these three conditions are favorable in order to minimize reference bias. For an example, please read about our Tomato assemblies in our paper. We argue that if the above conditions are positive, de novo scaffolding methods may be more expensive and less accurate than our reference-guided strategy.

Still, even if the three outlined conditions are ideal, it is likely that some reference bias will occur. If possible, one should validate/assess their assembly accuracy with independent data, such as long sequencing reads, Hi-C, or optical maps.