Skip to content

Latest commit

 

History

History

sample_data

Unicycler sample data

I've put together a few small read sets so users can test that Unicycler works.

The synthetic Shigella plasmid reads are the smallest in size and included in the Unicycler repo – try these if you're in a hurry.

The other three are real read sets from small bacterial genomes from the FDA-ARGOS project and are available to download via figshare. The Helicobacter pylori and Streptococcus pyogenes genomes are relatively simple and easy to assemble. The Neisseria gonorrhoeae genome is complex and tougher. I subsampled each Illumina read set down to create smaller files. The PacBio read sets were subsampled based on quality (i.e. they are a high-quality subset of the original reads).

I'd recommend looking at the resulting assembly graphs in Bandage to get an idea of how well the assemblies completed – especially useful for comparing hybrid assemblies made with low-depth vs high-depth long reads.

Shigella sonnei plasmids (synthetic reads)

These are synthetic reads from plasmids A, B and E from the Shigella sonnei 53G genome assembly:

Download reads from the figshare page or via these direct links:

These plasmids are small compared to a bacterial genome, but insertion sequences create many repeats. Only the smallest plasmid assembles completely with short reads alone. Hybrid assemblies with low-depth long reads manage to complete the medium-sized plasmid, and it takes high-depth long reads to complete all three.

Helicobacter pylori

These are real Illumina and PacBio reads from Helicobacter pylori sample FDAARGOS_300:

Download reads from the figshare page or via these direct links:

The Helicobacter pylori genome is small and simple. It has only two copies of the RNA operon and no other large repeats, making it very easy to assemble compared to most bacterial genomes. A hybrid assembly with the high-depth long reads should produce a nice completed chromosome. A hybrid assembly with the low-depth long reads comes very close to completion, with just a couple of slightly ambiguous spots remaining.

Streptococcus pyogenes

These are real Illumina and PacBio reads from Streptococcus pyogenes sample FDAARGOS_190:

Download reads from the figshare page or via these direct links:

The Streptococcus pyogenes genome is particularly small and simple and is relatively easy to assemble with Illumina reads. It does have a few repetitive elements, however, including five copies of the RNA operon and six copies of IS1548. A hybrid assembly with the high-depth long reads should produce a nice completed chromosome. A hybrid assembly with the low-depth long reads will not quite complete, leaving a bit of ambiguity around some of the RNA operons.

Neisseria gonorrhoeae

These are real Illumina and PacBio reads from Neisseria gonorrhoeae sample FDAARGOS_204:

Download reads from the figshare page or via these direct links:

While the Neisseria gonorrhoeae genome is small, it is a difficult one to assemble, with many copies of IS1016, ISNgo2 and other repeats. A hybrid assembly with the high-depth long reads should produce a nice completed chromosome. A hybrid assembly with the low-depth long reads, while still a large improvement over the Illumina-only assembly, fails to resolve in a number of regions. This demonstrates that more complex genomes require higher long-read-depth to achieve complete assemblies.

Assembly commands

Illumina-only assembly:
unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -o output_dir

Long-read-only assembly:
unicycler -l long_reads_high_depth.fastq.gz -o output_dir

Hybrid assembly (low-depth long reads):
unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -l long_reads_low_depth.fastq.gz -o output_dir

Hybrid assembly (high-depth long reads):
unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -l long_reads_high_depth.fastq.gz -o output_dir