Skip to content
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.
Jason Chin edited this page Apr 21, 2015 · 45 revisions

Plan for Next Falcon Release (v0.3)

  • Integrate latest (Apr. 2015) Gene Myers' daligner code
  • Supporting read length up to 100kb from current 64kb
  • New consensus code which process reads from diploid genomes better
  • Logging for tracking job submission

Tips for Restarting Jobs with Falcon

Falcon used a workflow engine to track dependencies. For small workflow, one can track all files. In the context of genome assembly, because the design nature of Gene Myers' daligner code, it may not be a good idea to track all output. However, each task will generate some sentinel files to track the progress. The fc_run.py code tracks the progressive of all tasks in the working directory. It will only submit jobs where the dependencies are not satisfied.

If you want to re-run the workflow when some jobs fails or try different parameters, you can restart the jobs by deleting the sentinel files and run fc_run.py again. However, it is very important to make sure all jobs you have submitted or running locally are deleted or killed. If you don't check it out, there will be multiple jobs trying to write into the same files and the dependent structure tracked by the sentinel files will be all messed up. You can get some error message which is hard to interpret due to the inconsistent state of the system.

Here are some receipts I typically use for my own work:

Regenerate the error corrected reads

$ rm -rf 0-rawreads/preads/ # or `mv 0-rawreads/preads/ 0-rawreads/preads_old`
$ rm -rf 1-preads_ovl/      # or `mv 1-preads_ovl 1-preads_ovl_old`
$ rm -rf 2-asm-falcon       # or `mv 2-asm-falcon 2-asm-falcon_old` 
$ fc_run.py fc_run.cfg      

Redo pread overlaps

$ rm -rf 1-preads_ovl/      # or `mv 1-preads_ovl 1-preads_ovl_old`
$ rm -rf 2-asm-falcon       # or `mv 2-asm-falcon 2-asm-falcon_old` 
$ fc_run.py fc_run.cfg      

Redo overlap to graph to contig

$ rm -rf 2-asm-falcon       # or `mv 2-asm-falcon 2-asm-falcon_old`
$ fc_run.py fc_run.cfg      

For this, I typically modify the script run_falcon_asm.sh inside 2-asm-falcon instead of deleting the directory. It is useful for testing out different overlap filtering parameters of fc_ovlp_filter.py by changing the run_falcon_asm.sh.

Other Tips

  • get p-read 5'- and 3'- overlap count:
$ fc_ovlp_stats.py --n_core 20 --fofn las.fofn #dump overlap count for las files inside las.fofn using 20 cores, this only work for v0.2.* branch

000000000 13329 8 8    
000000002 10096 2 0    
000000003 11647 5 7    
000000004 14689 2 1    
000000005 13854 0 1    

The columns are (1) read_identifer (2) length (3) 5'-overlap_count and (4) 3'-overlap count.

To get a coverage histogram with one line:

$ cat ovlp.stats | awk '{print $3}'  | sort -g | uniq -c 

Check overlap-filtering for understand how it impacts the assembly and get ideas about how to set the parameters for overlap_filtering_setting and fc_ovlp_filter.py.

  • Get some ideas about how many overlapping jobs are finished
$ cd 0-rawreads
$ ls -d job_* | wc
     59      59     767
$ find . -name "job*done"  | wc
     59      59    1947

59 of 59 overlap jobs are finished in this example.

  • Memory usage control

You need to check how much memory and number of core in your cluster. With -t 16 for ovlp_HPCdaligner_option and -s 400 for pa_DBsplit_option, then it takes about 20Gb to 25Gb per daligner job (for Dec. 2014 daligner code used for Falcon v0.2.*, newer code needs different strategy) The daligner is hard coded to use 4 threads for now. If you have a 48 core blade server, you will need 48/4 * 25Gb = 300Gb RAM to utilize all cores for computation. If you don't have that much RAM, you can reduce the chunk size by reducing the number used for -s. The tradeoff is that you will have more tasks, jobs and files to track.

I also used small -s number to test the job scheduler sometimes. For example, you can create a lot of small jobs for E. coli assembly with -s 50 to test out the configuration.

  • Local model

While one can try the local mode for small assembly. Unless you have one big RAM and high core number machine, it is not recommended for larger genome (>100Mb).