First, I cloned and installed canu.
$ git clone git@github.com:marbl/canu.git
Cloning into 'canu'...
...
$ cd canu/src/
$ make -j 2
...
See here for install logs.
I also added alias canu="/Users/ksb/computation/science/canu/Darwin-amd64/bin/canu"
to my ~/.bash_profile
to make it easier to run the script.
I downloaded the following files from google drive
and moved them to a raw_reads/
subfolder.
Copy of JB418_ACTGAT.R1.fastq.gz
(renamed tor1.fastq.gz
)Copy of JB418_ACTGAT.R2.fastq.gz
(renamed tor2.fastq.gz
)filtered_subreads_pacbiojb418.fastq
(renamed topb.fastq
)- I compressed this file with gzip, creating
pb.fastq.gz
- I compressed this file with gzip, creating
$ mv raw_reads/Copy\ of\ JB418_ACTGAT.R1.fastq.gz raw_reads/r1.fastq.gz
$ mv raw_reads/Copy\ of\ JB418_ACTGAT.R2.fastq.gz raw_reads/r2.fastq.gz
$ mv raw_reads/filtered_subreads_pacbiojb418.fastq raw_reads/pb.fastq
$ gzip raw_reads/pb.fastq
I didn't have java installed, and java and gnuplot are required to make plots. I installed
using homebrew
.
$ brew cask install java
...
$ brew install gnuplot
...
I made a subfolder runs/
, moved there, and ran the assembler based on this
command in the "quick start" section of the canu docs.
I used 4.7
for the genomeSize
parameter based on the median genome length for
Hafnia alvei in genbank.
$ mkdir runs
$ cd runs/
$ canu -p jb418 -d jb418-auto genomeSize=4.7m -pacbio-raw ../raw_reads/pb.fastq.gz
This step took about 2.75 hours on my ~3 year old MacBook Pro. For complete run log, see here.
The assembly ends up here.
We can use the illumina paired reads for polishing/error correction. I used Pilon, which in turn requires Bam alignment files which can be generated by Bowtie. I used Bowtie2. More precisely, bowtie makes Sam files, which can be converted to Bam using SamTools.
All of these can also be installed with homebrew from the homebrew/science cask.
$ brew tap homebrew/science
...
$ brew install pilon
...
$ brew install bowtie2
...
$ brew install samtools
Bowtie2 maps illumina reads onto the draft assembly produced by Canu, generating a Sam (sequence alignment map) file. Using samtools, we can convert this to a bam (binary alignment map) file that pilon can use, and then pilon uses that map file to polish the assembly - correcting errors and filling in gaps (if possible).
See here for bowtie2 example.
First, we need a bowtie2 index. I backed out to the root of the project, and then:
$ mkdir polishing
$ cd polishing/
$ mkdir jb418
$ cd jb418
$ bowtie2-build ../../runs/jb418-auto/jb418.contigs.fasta jb418
This generates 6 .bt2
files in the jb418
directory. Next, we want to
generate the alignments. Bowtie doesn't work with compressed fastq files so
first we've got to unzip them.
$ gunzip ../../raw_reads/r*
Next, run the aligner to generate sam files.
$ bowtie2 -x jb418 -1 ../../raw_reads/r1.fastq -2 ../../raw_reads/r2.fastq -S jb418.sam
16051550 reads; of these:
16051550 (100.00%) were paired; of these:
1755602 (10.94%) aligned concordantly 0 times
13645476 (85.01%) aligned concordantly exactly 1 time
650472 (4.05%) aligned concordantly >1 times
----
1755602 pairs aligned concordantly 0 times; of these:
1032651 (58.82%) aligned discordantly 1 time
----
722951 pairs aligned 0 times concordantly or discordantly; of these:
1445902 mates make up the pairs; of these:
985662 (68.17%) aligned 0 times
335313 (23.19%) aligned exactly 1 time
124927 (8.64%) aligned >1 times
96.93% overall alignment rate
This took ~ 1 hour to run on my macbook. There was no output until the very end, be patient!
Now we need to convert this to Bam using samtools.
$ samtools view -bS jb418.sam > jb418.bam
This step took ~ 20 min.
Now that we've got our Bam alignment file, we can run Pilon. Turns out, pilon needs a sorted and indexed bam file, so first we need to do some more with samtools.
$ samtools sort jb418.bam -o sorted_jb418
$ samtools index sorted_jb418.bam sorted_jb418.bai
I then tried to run pilon, but got an error that I don't have enough memory or something.
$ pilon --genome ../../runs/jb418-auto/jb418.contigs.fasta --bam sorted_jb418.bam --outdir pilon/
Pilon version 1.22 Wed Mar 15 16:38:30 2017 -0400
Genome: ../../runs/jb418-auto/jb418.contigs.fasta
Fixing snps, indels, gaps, local
Input genome size: 5959283
Scanning BAMs
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.simontuffs.onejar.Boot.run(Boot.java:340)
at com.simontuffs.onejar.Boot.main(Boot.java:166)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at htsjdk.samtools.DefaultSAMRecordFactory.createBAMRecord(DefaultSAMRecordFactory.java:36)
at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:200)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:660)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:634)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:628)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:598)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:544)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:518)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.broadinstitute.pilon.BamFile.scan(BamFile.scala:285)
at org.broadinstitute.pilon.GenomeFile$$anonfun$processRegions$3.apply(GenomeFile.scala:92)
at org.broadinstitute.pilon.GenomeFile$$anonfun$processRegions$3.apply(GenomeFile.scala:92)
at scala.collection.parallel.AugmentedIterableIterator$class.map2combiner(RemainsIterator.scala:115)
at scala.collection.parallel.immutable.ParVector$ParVectorIterator.map2combiner(ParVector.scala:62)
at scala.collection.parallel.ParIterableLike$Map.leaf(ParIterableLike.scala:1054)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
at scala.collection.parallel.ParIterableLike$Map.tryLeaf(ParIterableLike.scala:1051)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinTask.doJoin(ForkJoinTask.java:341)
at scala.concurrent.forkjoin.ForkJoinTask.join(ForkJoinTask.java:673)
at scala.collection.parallel.ForkJoinTasks$WrappedTask$class.sync(Tasks.scala:378)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync(Tasks.scala:443)
at scala.collection.parallel.ForkJoinTasks$class.executeAndWaitResult(Tasks.scala:426)
at scala.collection.parallel.ForkJoinTaskSupport.executeAndWaitResult(TaskSupport.scala:56)
This is the same issue noted in Pilon's issue tracker.