Skip to content
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.

Somethings to think about for tuning assembly parameters

Jason Chin edited this page Jun 4, 2016 · 1 revision

General Genome Assembly Consideration

Intuitively, to make good and high contiguous assembly depends on a couple of major factors:

(1) Enough coverage to cover every piece of a genome (2) We get reads longer than majority of the repeats of a genome and (3) Errors are random.

How do we know if the data we collect for genome assembly satisfying with these conditions? In reality, if we sequence a new genome no one study before, we will have no clue. Oh, well, this is not totally true though. Usually, there are some methods to estimate the genome sizes and some short read data might give some ideas about various aspects of the repeats in the genomes. However, unless we truly reconstruct the genome, we actually don't have the full information. This is a little big chicken-and-egg problem. It also post some challenges on optimizing the assembly parameters. This short discussion focuses on the general consideration for using DAligner/FALCON to assemble genomes.

Checking raw read length distribution

For most of time, it is important to check the length distribution of the read set. There are many ways to check the reads length before or during the (FACLON) assembly run. It is really easy to use DBstats from DAZZ_DB packages to check the length distribution if the Dazzler's database of the read is already built.

For example,

DBstats 0-rawread/raw_reads.db

will show the read distributions. What you need to focus is the % Bases column. The relevant parameter in FALCON configuration is the length_cutoff parameters. If you are coverage limited, one should choose lower length_cutoff to ensure one get enough reads to be corrected to cover the genome. In some later version of FALCON, a length cutoff might be chosen automatically. However, it is always a good idea to check it out and make sure everything actually makes sense. In a typically, one probably needs 20x or higher than the expected genome size. On the other hand, one should be aware that the estimated genome sizes can be off a lot in some case. Also, not all sequences may from the genomic DNA. Depending on the DNA extraction process, one might have excessive reads from organelle (e.g. mitochondria or chloroplast DNA) and contaminates (bacteria or fungal). You would probably not know how much they are before doing the assembly. What one can try to do is to get rid off them during the sample preparation steps. However, sometimes, it is not always possible. In such case, you will need to calibrate the length cutoff. You might be able to estimate the amount of reads belongs to those organelles' DNA by doing an assembly from subset of your data. Once you identify those DNA, you can use the assemble sequences for the estimation. It is possible to remove them before assembly too.

Checking p-read length distribution

Not every raw read longer than the length_cutoff will be fully corrected. There would be some lost. The final coverage and the read length distribution that one should really check after the computational intensive error correction step is the "p-read" (pre-assembled reads aka error corrected reads) length distribution. Here is how to check that:

DBstats 1-preads_ovl/preads.db

Ideally, you should have more than 10x-15x (just some guesstimated number) genome size p-reads longer than length_cutoff_pr. If the initial legnth_cutoff is large, e.g. 15kb, and one only gets less 10x p-reads longer than 15kb, it is possible to lower the length_cutoff_pr and repeat the 1-preads_ovl step to get more coverages. Or, you will have to re-adjust the length_cutoff, falcon_sense_option and other options to increase the sensitivity of the initial read-to-read overlap finding step (0-rawreads). We will discuss it later.

If, unfortunately, one don't get enough long reads such that one needs to lower the length_cutoff_pr to get enough coverage, the assembly contiguity will likely be bound by the repeat structure of the genome. In such case, the best way to get better assembly, if possible, is to add some more data with longer DNA molecules to be sequenced. There are other possible bioinformatics methods or combinations with other data to improve the assembly. Those are beyond the scope of this discussion note. Note that there is some fundamental theoretical limits on the assembly contiguity as a function of the read length distributions.

Sensitivity and specific during the first overlap stage for error correction

One important that affects the assembly quality is about whether we find all true overlaps during the error correction stage and the stage of finding overlaps for constructing the assembly graph.

Let's focus on the overlap stage for error correction first.

DAligner is a very high performance overalpper and it can be very sensitivity to find all overlaps between reads. However, there are a number of parameters that will affect the sensitivity. Also, in certainly scenarios, specificity will be more important than sensitivity. We will discuss about it later.

What parameters affect the sensitive and specificity of DAligner? We need to know a little bit more detailed internal work of DAligner.

Here is how we run DAligner:

(1) import all the reads into a binary database (2) "logically" splitting the reads into "block", each block can be as large as 400Mb (3) reads within each pair of blocks in the database will be compared to each other

When the reads in blocks are compared to each other, this is what happens:

(1) k-mers with certain k are extracted from query block and target block (2) the k-mers are sorted and compared between the query and target block to find candidate hits between the reads of the query block and target block (3) detailed alignments are performed for those candidate hits. If the reads in candidate pair are not aligned to each others, then the pair is discarded. Otherwise, some summary of the alignment information is written into the las (I guess it stands for "Local Alignment Storage") files.

Step (3) is slower than (1) and (2). (I has not really done benchmark to confirm this. Gene mentioned it to me before and it also makes senses from the algorithmic point of view.) So, in order to have the best performance, we do want the candidates from (1) and (2) reflecting the real possibility that the two reads are aligned to each other. If we have a true "random genome", to solve the problem is very simple. If the k is large enough, it is unlikely we have duplicated k-mer in the same block. Here are some simple math. Assuming we have 3G random genome, and we choose a block size of 300Mb. Each block is one-tenth of the genome. Let's choose k = 14. There are 4**14 = 268,435,456 ~ 300M distinct 14-mers. Namely, each 14-mer only rough happens 10 times in the genome. With 1/10 of the genome, each 14-mers only happens no more than once in the 300Mb block. The expectation value see randomly matched identical 14-mers in the 300Mb is about one. Now, if the genome is not 100% random, and there are some repetitive sequences. For example, some long chunks of simple tandem repeats, e.g. ATAT repeats many many times in long stretches in multiple locations in the genome. In such case, when you count ATATATATATATAT in each block, the expected occurrence will be higher than one. If we don't filter them out, then there will be many false positive prediction from the step (1) and (2).

True genomes have many repeats of many kinds. To avoid wasting time in those fruitless computation, most overlappers using k-mer as seeds would ignore some "uninformative k-mer". In the Daligner's design, there are two parameters controlling that: -t and -M.

Let's focus on -t for now. If -t 12 is passed into daligner, if the number of paired matched k-mer is larger than 12*12=144, then those k-mer will be dropped from the index and not used to find seed for potential matches. If the number set for -t is high, daligner will need bigger memory to store the indexes of k-mer but we will likely not missing any alignments. If the number set for -t is lower, daligner will use less memory and storage but it could miss some alignments. So, what is the right -t for a specific genome? Well, it depends on the size of the genome, the block sizes, and the repeat content of a genome. The example of 3G random shown above, we can probably comfortably set -t 8 for a less repetitive genome. In fact, I have obtained some good human genome with -t 8 and 400Mb as the block size. (Yes, one might miss some repetitive content in the final assembly. We need a different way to get those information out separately.)

A common question I have all the time is that people fails to get an assembly when using FALCON for small genomes due to wrong block size and -t setting. Let's take an example, if we have 200x 2Mb genome, there are totally 400Mb sequence. Everything can be fit into one block if we set the block size to be 400Mb. However, if there is no error in the sequencing data, every k-mer will happens 200x for all reasonable size k. The default setting for -t is 100. Namely, in such case no k-mer will be stored in the index and one will get zero overlaps for error correction. Even with sequencing errors, many k-mer will likely happens more than 100 times if k is small enough. To solve such problem for smaller genomes, one needs to use a smaller block size to ensure the proper k-mer are indexed and the overlaps are found properly to get a good assembly.

What about -M? In the earlier code, the default -t 100 can use too much memory and sometimes introduces memory errors. One way to solve this problem is to automatically compute the threshold for k-mer filtering by giving the size of memory available for single daligner process. This is also very good for distributed computing as we can set an upper bound of memory usage. When we submit the jobs to the computation nodes with known upper bound memory usage, we can arrange the processes to reach maximum efficiency for both memory and CPU usage if possible. It is all good, however, it might be sometimes hard to predict the exact threshold decided by daligner. If the threshold, for any reason, is too small, then there will be problem about overlap sensitivity. Fortunately, one can use the "-v" (verbose) to log what daligner actually doing. Here is an example of a portion of the log file from a human assembly run:

With command

$ daligner -v  -t16 -H10000 -e0.75 -k18 -w8 -h480 -l2400 -M24 raw_reads.1 raw_reads.2

 Allocated 399244167 of 16 (6387906672 bytes) at 0x7ff85a593010
   Kmer count = 399,244,166
   Using 11.90Gb of space
   Revised kmer count = 391,905,810
   Index occupies 5.84Gb

Comparing raw_reads.1 to raw_reads.2

   Capping mutual k-mer matches over 10000 (effectively -t100)
   Hit count = 69,928,759
   Highwater of 12.73Gb space

     69,928,759 18-mers (4.370435e-10 of matrix)
          7,547 seed hits (4.716753e-14 of matrix)
          6,737 confirmed hits (4.210516e-14 of matrix)

You can see that in this case, daligner reports -t100 and using about 12.64Gb maximum for the comparison. It also shows the number of seed hits and how many seed hits are confirm as legitimated hits. Ideally, if the number of seed hits is much greater than the number of confirmed hits, it means the k-mer filter is too loose and there is a lot of computational time wasted. If the two number is close, that is good. However, in such case, the k-mer filtering can be too stringent and one can loss true overlaps. The question is where the number of confirmed hits matches expected value. Let's do some simple math. There are actually about 45,000 reads in both blocks and we are aligning reads from two block of 400Mb data. Let's call the two block A-block and B-block. These reads are from 400Mb/3G=0.13 of the genome. The expect hits of each reads in A-block that hits any read in B-block is about 0.13. Roughly, the expected hits is approximately 45,000 * 0.13 = 5,850 which is actually smaller than the reported 6,737 What does this mean? It means, we probably find all relevant alignments between A-block and B-block even with the stringent parameters. Of course, one should ask why the number observed hits is bigger than the expected one. Well, there are alway repeats in real genome, those repeats could introduce spurious hits. Ideally, if we can detect overlaps caused by repeats, it will be great. There are some possible methods for that but that will be another topic.

Let's try another parameter for daligner. Instead using -k18 -w8 -h480 -l2400 we use:

daligner -v  -t16 -H10000 -e0.75 -k14 -h80 -l500 -M24 raw_reads.1 raw_reads.2

Here is the output of the same block:

 Allocated 399423407 of 16 (6390774512 bytes) at 0x7f9c4c284010
   Kmer count = 399,423,406
   Using 11.90Gb of space
   Revised kmer count = 348,179,358
   Index occupies 5.19Gb

Comparing raw_reads.1 to raw_reads.2

   Capping mutual k-mer matches over 36 (effectively -t6)
   Hit count = 589,621,592
   Highwater of 22.79Gb space

     589,621,592 14-mers (3.685040e-09 of matrix)
          24,886 seed hits (1.555335e-13 of matrix)
          24,557 confirmed hits (1.534773e-13 of matrix)

In this case, you see there are 24,557 confirmed hits even though daligner decides to cap mutual k-mer matches over 36. The number 24,557 is much bigger than the previous number 6,737 and the expected hits 5,850. Why? The parameters -k14 -h80 -l500 allows us to find shorter homology between the reads and many shorter repeats can be aligned between the reads, thus, we have more seed hits and more confirmed hits. However, since 24,557 is much larger than 5,850, it is likely many of the hits are mostly repeat-induced. Note that, because we use smaller k-mer, the memory usage is bigger. We also have more confirmed hits, we need more disk space to store the results and more downstream computation work to filter out the weed.

Let try to set -l to 2,400 to see what changes:

 Allocated 399423407 of 16 (6390774512 bytes) at 0x7f0b0e017010
   Kmer count = 399,423,406
   Using 11.90Gb of space
   Revised kmer count = 348,179,358
   Index occupies 5.19Gb

Comparing raw_reads.1 to raw_reads.2

   Capping mutual k-mer matches over 36 (effectively -t6)
   Hit count = 589,621,592
   Highwater of 22.79Gb space

     589,621,592 14-mers (3.685040e-09 of matrix)
          24,886 seed hits (1.555335e-13 of matrix)
          11,022 confirmed hits (6.888572e-14 of matrix)

We still get the same number of seed hits but only 11,022 confirmed hits as the overlap lengths of many hits are shorted than 2,400 bp.

For large and repetitive genome, my advise is to get longer molecule to start with and use more specific daligner parameters for overlapping to avoid excessive computation caused by repeats.

However, for smaller genome, it actually become more tricky.

How about small genome?

Let consider another case for small genome assembly. If we are interested in a 4.6 Mb E. coli assembly, assume you sequence ~320Mb (~70x genome) data and try to assemble it. If we put all data into one block,

TO BE FINISHED