GitHub - bcgsc/RNA-Bloom: :hibiscus: reference-free transcriptome assembly for short and long reads

RNA-Bloom is a fast and memory-efficient de novo transcript sequence assembler. It is designed for the following sequencing data types:

single-end/paired-end bulk RNA-seq (strand-specific/agnostic)
paired-end single-cell RNA-seq (strand-specific/agnostic)
long-read RNA-seq (ONT cDNA/direct RNA, PacBio cDNA)

Written by Ka Ming Nip 📧

©️ 2018-present Canada's Michael Smith Genome Sciences Centre, BC Cancer

Dependency 📌

Java SE Development Kit (JDK) 11 (JDK 17 is slightly faster)
External software used:

software	short reads	long reads
minimap2 >=2.22	required	required
Racon	not used	required
ntCard >=1.2.1	required	required

⚠️ Their executables must be accessible from your PATH!

Installation 🔧

RNA-Bloom can be installed in two ways:

(A) install with `conda` or `mamba`:

conda install -c bioconda rnabloom

mamba install -c bioconda rnabloom

All dependent software (listed above) will be installed. RNA-Bloom can be run as rnabloom ...

(B) download from GitHub:

Download the binary tarball rnabloom_vX.X.X.tar.gz from the releases section.
Extract the downloaded tarball with the command:

tar -zxf rnabloom_vX.X.X.tar.gz

RNA-Bloom can be run as java -jar /path/to/RNA-Bloom.jar ...

Quick Start for Short Reads 🏃

⚠️ Input reads must be in either FASTQ or FASTA format and may be compressed with GZIP.

ℹ️ Note that -left, -right, -sef, and -ser can accept multiple file paths separated by the whitespace character.

(A) assemble bulk RNA-seq data:

paired-end reads only
- when left reads are sense and right reads are antisense, use -revcomp-right to reverse-complement right reads
- when left reads are antisense and right reads are sense, use -revcomp-left to reverse-complement left reads
- for non-stranded data, use either -revcomp-right or -revcomp-left

java -jar RNA-Bloom.jar -left LEFT.fastq -right RIGHT.fastq -revcomp-right -t THREADS -outdir OUTDIR

single-end reads only
- use -sef for forward reads and -ser for reverse reads

java -jar RNA-Bloom.jar -sef SE.fastq -t THREADS -outdir OUTDIR

paired-end and single-end reads

java -jar RNA-Bloom.jar -left LEFT.fastq -right RIGHT.fastq -revcomp-right -sef SE.fastq -t THREADS -outdir OUTDIR

final output files:

file name	description
`rnabloom.transcripts.fa`	assembled transcripts longer than length threshold (default: 200)
`rnabloom.transcripts.short.fa`	assembled transcripts shorter than length threshold
`rnabloom.transcripts.nr.fa`	assembled transcripts with redundancy reduced

(B) assemble multi-sample RNA-seq data with pooled assembly mode:

java -jar RNA-Bloom.jar -pool READSLIST.txt -revcomp-right -t THREADS -outdir OUTDIR

This is especially useful for single-cell datasets. RNA-Bloom was tested on Smart-seq2 and SMARTer datasets. It is not supported for long-read data (-long) at this time.

file format for the `-pool` option:

This is a tabular file that describes the read file paths for all cells/samples to be used pooled assembly.

Column header is on the first line, leading with #
Columns are separated by space/tab characters
Each sample can have more than one lines; lines sharing the same name will be grouped together during assembly

column	description
`name`	sample name
`left`	path to one left read file
`right`	path to one right read file
`sef`	path to one single-end forward read file
`ser`	path to one single-end reverse read file

(i) paired-end reads only:

Only name, left, and right columns are specified for a total of 3 columns. The legacy header-less tri-column format is still supported.

#name left right
cell1 /path/to/cell1/left.fastq /path/to/cell1/right.fastq
cell2 /path/to/cell2/left.fastq /path/to/cell2/right.fastq
cell3 /path/to/cell3/left.fastq /path/to/cell3/right.fastq

(ii) paired and unpaired reads:

In addition to name, left, and right columns, either sef, ser or both are specified for a total of 4~5 columns.

#name left right sef ser
cell1 /path/to/cell1/left.fastq /path/to/cell1/right.fastq /path/to/cell1/sef.fastq /path/to/cell1/ser.fastq
cell2 /path/to/cell2/left.fastq /path/to/cell2/right.fastq /path/to/cell2/sef.fastq /path/to/cell2/ser.fastq
cell3 /path/to/cell3/left.fastq /path/to/cell3/right.fastq /path/to/cell3/sef.fastq /path/to/cell3/ser.fastq

final output files per cell:

file name	description
`rnabloom.transcripts.fa`	assembled transcripts longer than length threshold (default: 200)
`rnabloom.transcripts.short.fa`	assembled transcripts shorter than length threshold
`rnabloom.transcripts.nr.fa`	assembled transcripts with redundancy reduced

(C) strand-specific assembly:

java -jar RNA-Bloom.jar -stranded ...

The -stranded option indicates that input reads are strand-specific.

Strand-specific reads are typically in the F2R1 orientation, where /2 denotes left reads in forward orientation and /1 denotes right reads in reverse orientation.

Configure the read file paths accordingly for bulk RNA-seq data and indicate read orientation:

-stranded -left /path/to/reads_2.fastq -right /path/to/reads_1.fastq -revcomp-right

and for scRNA-seq data:

cell1 /path/to/cell1/reads_2.fastq /path/to/cell1/reads_1.fastq

(D) reference-guided assembly:

java -jar RNA-Bloom.jar -ref TRANSCRIPTS.fasta ...

The -ref option specifies the reference transcriptome FASTA file for guiding short-read assembly. It is not supported for long-read data (-long) at this time.

Quick Start for Long Reads 🏃

⚠️ It is strongly recommended to trim adapters in your reads before assembly. For example, see Porechop for more information.

⚠️ Input reads must not have purely integer IDs (e.g. 1, 2, 3), which could be in conflict with RNA-Bloom's sequence IDs. Please rename your read IDs (with seqtk rename) if necessary.

ℹ️ Note that -long, -sef, and -ser can accept multiple file paths separated by the whitespace character.

(A) assemble long-read cDNA sequencing data:

Default presets for -long are intended for ONT data. Please add the -lrpb flag for PacBio data.

java -jar RNA-Bloom.jar -long LONG.fastq -t THREADS -outdir OUTDIR

Input reads are expected to be in a mix of both forward and reverse orientations.

Options -pool and -ref are not supported for long-read data at this time.

(B) assemble nanopore direct RNA sequencing data:

java -jar RNA-Bloom.jar -long LONG.fastq -stranded -t THREADS -outdir OUTDIR

Input reads are expected to be only in the forward orientation.

By default, uracil (U) is written as T. Use the -uracil option to write U instead of T in the output assembly.

ntCard v1.2.1 supports uracil in reads.

(C) assemble long-read sequencing data with short-read polishing:

cDNA data:

java -jar RNA-Bloom.jar -long LONG.fastq -sef SHORT.fastq -t THREADS -outdir OUTDIR

direct RNA data:

java -jar RNA-Bloom.jar -stranded -long LONG.fastq -sef SHORT_FORWARD.fastq -ser SHORT_REVERSE.fastq -t THREADS -outdir OUTDIR

final output files:

file name	description
`rnabloom.transcripts.fa`	assembled transcripts longer than min. length threshold (default: 200)
`rnabloom.transcripts.short.fa`	assembled transcripts shorter than min. length threshold

General Settings ⚙️

(A) set Bloom filter sizes automatically:

If ntcard is found in your PATH, then the -ntcard option is automatically turned on to count the number of unique k-mers in your reads.

java -jar RNA-Bloom.jar -fpr 0.01 ...

This sets the size of Bloom filters automatically to accommodate a false positive rate (FPR) of ~1%.

Alternatively, you can specify the exact number of unique k-mers:

java -jar RNA-Bloom.jar -fpr 0.01 -nk 28077715 ...

This sets the size of Bloom filters automatically to accommodate 28,077,715 unique k-mers for a FPR of ~1%.

As a rule of thumb, a lower FPR may result in a better assembly but requires more memory for a larger Bloom filter.

(B) set the total size of Bloom filters:

java -jar RNA-Bloom.jar -mem 10 ...

This sets the total size to 10 GB. If neither -nk, -ntcard, or -mem are used, then the total size is configured based on the size of input read files.

(C) stop at an intermediate stage:

java -jar RNA-Bloom.jar -stage N ...

N	short reads	long reads
1	construct graph	construct graph
2	assemble fragments	correct reads
3	assemble transcripts	assemble transcripts

This is a very useful option if you only want to assemble fragments or correct long reads (ie. with -stage 2)!

(D) list all available options in RNA-Bloom:

java -jar RNA-Bloom.jar -help

(E) limit the size of Java heap:

java -Xmx2g -jar RNA-Bloom.jar ...

or if you installed with conda:

export JAVA_TOOL_OPTIONS="-Xmx2g"
rnabloom ...

This limits the maximum Java heap to 2 GB with the -Xmx option. Note that java options has no effect on Bloom filter sizes.

See documentation for other JVM options.

Implementation 📝

RNA-Bloom is written in Java with Apache NetBeans IDE. It uses the following libraries:

Citing RNA-Bloom 📜

If you use RNA-Bloom in your work, please cite our manuscript(s).

Long-read RNA-seq assembly:

Ka Ming Nip, Saber Hafezqorani, Kristina K. Gagalova, Readman Chiu, Chen Yang, René L. Warren, and Inanc Birol. Reference-free assembly of long-read transcriptome sequencing data with RNA-Bloom2. Nature Communications. 2023 May 22;14(1):2940. doi: 10.1038/s41467-023-38553-y

Short-read RNA-seq assembly:

Ka Ming Nip, Readman Chiu, Chen Yang, Justin Chu, Hamid Mohamadi, René L. Warren, and Inanc Birol. RNA-Bloom enables reference-free and reference-guided sequence assembly for single-cell transcriptomes. Genome Research. 2020 Aug;30(8):1191-1200. doi: 10.1101/gr.260174.119. Epub 2020 Aug 17.

Name		Name	Last commit message	Last commit date
Latest commit History 1,879 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
nbproject		nbproject
recipe		recipe
src/rnabloom		src/rnabloom
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.xml		build.xml
manifest.mf		manifest.mf
rnabloom		rnabloom
rnabloom_logo.png		rnabloom_logo.png

License

bcgsc/RNA-Bloom

Folders and files

Latest commit

History

Repository files navigation

Dependency 📌

Installation 🔧

(A) install with conda or mamba:

(B) download from GitHub:

Quick Start for Short Reads 🏃

(A) assemble bulk RNA-seq data:

final output files:

(B) assemble multi-sample RNA-seq data with pooled assembly mode:

file format for the -pool option:

(i) paired-end reads only:

(ii) paired and unpaired reads:

final output files per cell:

(C) strand-specific assembly:

(D) reference-guided assembly:

Quick Start for Long Reads 🏃

(A) assemble long-read cDNA sequencing data:

(B) assemble nanopore direct RNA sequencing data:

(C) assemble long-read sequencing data with short-read polishing:

final output files:

General Settings ⚙️

(A) set Bloom filter sizes automatically:

(B) set the total size of Bloom filters:

(C) stop at an intermediate stage:

(D) list all available options in RNA-Bloom:

(E) limit the size of Java heap:

Implementation 📝

Citing RNA-Bloom 📜

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

(A) install with `conda` or `mamba`:

file format for the `-pool` option: