cubseq-nf

What is CUBseq?

Codon Usage Bias from RNA-sequencing data (CUBseq) is a fully automatic pipeline that produces robust estimates of codon usage frequencies at the transcriptome level. CUBseq can be used for any organism with an NCBI taxonomy ID, available RNA-sequencing data and a reference genome/annotation. The end result is a dataset of transcriptome-wide sequences with variants built in, allowing CUBseq to provide codon relative frequencies as well as raw counts at codon and amino acid resolution for custom downstream codon usage analysis.

What can CUBseq be used for?

Large-scale transcriptome-wide codon usage analysis.
Generation of transcriptome-derived codon usage tables (expressed as relative frequency and frequency per thousand).
Quantification of transcriptome-wide genes.
Robust identification of high expression genes.
Reconstruction of transcriptomes per sample using variant calls.
Analysis of mutation frequency per sample across the transcriptome and at gene level.
Comparison of codon frequency with tRNA abundance.

Running CUBseq: quick-guide

Note

Before running the workflow, you will need to have Nextflow installed. See instructions on how to here.

Install or update the workflow

nextflow pull stracquadaniolab/cubseq-nf -r main

Define the configuration file

A nextflow.config configuration file will need to be created where parameters are defined, as specified below in Configuring CUBseq. This configuration file will need to be created in the same directory where the pipeline will be run. An example configuration file is provided in example-nextflow.config.

Run the analysis

Assuming the configuration file is set, to run CUBseq, the bare minimum command required is:

nextflow run stracquadaniolab/cubseq-nf -r main -profile singularity -c conf/nextflow.config

Alternatively, you can define parameters and call custom profiles (examples available on example-nextflow.config) directly in the nextflow run command:

nextflow run stracquadaniolab/cubseq-nf -r main -profile singularity,cell -c conf/nextflow.config --resultsDir ./results/test-run

For example, here we call a profile, cell, which we defined in our config file (which we used to specify the executor, RAM/CPU requirements and error strategy for each process). We also specify a custom results directory path to save output files to.

Configuring CUBseq

To run CUBseq you will need to specify a number of paths for storing results, and provide appropriate parameter options based on the organism being analysed. These parameters need to be defined in a configuration file called nextflow.config. Required parameters are indicated with an asterisk, the rest of the parameters are optional.

Parameter	Description
`resultsDir`	Directory where all results are stored [default: `"./results/"`].
Paths to genome files
`genome.reference` *	Path to genome reference (fasta) file [example: `"data/genome/ecoli.fa"`].
`genome.annotation` *	Path to genome annotation (GTF/GFF/GFF3) file [example: `"data/genome/ecoli.gff"`].
ENA metadata retrieval parameters
`taxonId` *	NCBI taxonomy ID of organism to be analysed [default: `"562"`].
`limitSearch`	Limit number of records output from ENA search query [default: `0`].
`removeRun`	Remove run by specifying its run accession [default: `"NULL"`, example: `"SRR13894889"`].
`max_sra_bytes`	Specify runs to remove if they exceed size of sra_bytes [default: `"55000000000"`].
`dateMin`	Set minimum date (YYYY/MM/DD) to filter runs by (inclusive) [default: `"1950-01-01"`].
`dateMax`	Set maximum date (YYYY/MM/DD) to filter runs by (inclusive), uses current date by default [default: `"FALSE"`].
STAR align parameters
`star.sjdbOverhang`	The "--sjdbOverhang" option of STAR, specifies length of genomic sequence on each side of the junctions, refer to STAR documentation for more detail. Here, we use STAR's default option [default: `"100"`].
`star.genomeSAindexNbases` *	The "--genomeSAindexNbases" option of STAR, specifying the length (bases) of SA pre-indexing string. This must be scaled down for small genomes, using formula: min(14, log2(GenomeLength)/2 - 1). [default: `"10"`].
`star.alignIntronMax`	The "--alignIntronMax" option of STAR, specifying maximum intron size [default: `"1"`.]
`star.limitBAMsortRAM`	The "--limitBAMsortRAM" option of STAR, specifying maximum available RAM (bytes) [default: `"2342750981"`].
`star.outBAMsortingBinsN`	The "--outBAMsortingBinsN" option of STAR, specifying the number of genome bins for coordinate-sorting [default: `"50"`].
featureCounts parameters
`featureCounts.type.feature`	The "-t" option of featureCounts, specifying feature type(s) in a GTF annotation to be used for read mapping. Multiple types should be separated by "," with no space in between [default: `"exon"`].
`featureCounts.type.attribute`	The "-g" option of featureCounts, specifying attribute type in the GTF annotation [default" `"gene_id"`].
Freebayes parameters
`freebayes.ploidy` *	The "--ploidy" option of Freebayes, specifying the default ploidy for the organism used in the analysis. [default: `"1"`].
`freebayes.args`	Additional Freebayes arguments, refer to their documentation [default: ""].
bcftools parameters
`bcftools.filter_vcf.args`	Additional bcftools filter arguments for filtering the VCF file, refer to their documentation [default: `'QUAL>20 && TYPE="snp"'`, note the use of quotation marks here].
Salmon indexing parameters
`salmon.index.args`	Additional arguments for salmon indexing, refer to their documentation [default: ""].
Salmon quantification parameters
`salmon.quant.libtype`	The "--libType" option of Salmon quant, specifying library type, CUBseq sets this to "Automatic" detection by default. Refer to their documentation for more information [default: `"A"`].
`salmon.quant.args`	Additional arguments for salmon quant, refer to their documentation [example: `"--writeUnmappedNames"`].
tximport parameters
`summarize_to_gene. counts_from_abundance`	Generate counts from abundances in tximport [default: `"no"`].

CUBseq results

CUBseq results are stored in the following directories:

results/metadata/metadata.csv: file containing the ENA metadata of RNA sequencing runs.
results/bams/: directory containing the bam files, as processed by STAR.
results/featureCounts/ : directory containing featureCounts gene quantification results per sample and summary statistics.
results/freebayes-vcf/ : directory containing vcf files, as processed by Freebayes.
results/vcf/ : directory containing filtered vcf files, as processed by bcftools norm and bcftools filter.
results/transcriptome-consensus/ : directory containing consensus transcriptomes in fasta format.
results/wt-transcriptome/ : directory containing the wild-type transcriptome, as generated by gffread.
results/mut-transcriptome/ : directory containing the reconstructed mutated transcriptomes per sequencing run, as processed by gffread.
results/salmon-quant/ : directory containing gene abundance results per sequencing run, as processed by salmon quantification.
results/dataset/ : directory containing the tximport RDS file that sumamrises salmon quantification results at the gene-level (expressed as TPM matrix).
results/gene-rank-analysis/ : directory containing results of CUBseq's gene rank analysis.
results/heg-mut-transcriptome/ : directory of fasta files per sequencing run, containing only highly expressed genes.
results/protein-mut-transcriptome/ : directory of fasta files per sequencing run, containing transcriptome-wide (i.e. all protein-coding) gemes.
results/cu-data/ : directory containing codon usage count data for highly expressed genes, protein coding genes, as well as from the Kazusa and CoCoPUTs databases (if available).
results/summarise-codon-counts/ : directory containing codon counts summarised at codon and amino acid resolution.

Authors

Anima Sutradhar (A.Sutradhar@sms.ed.ac.uk): developer and maintainer.
Giovanni Stracquadanio (giovanni.stracquadanio@ed.ac.uk): principal investigator.

Contact us about CUBseq

If you have any questions, issues or feature requests, please get in touch using the emails above or posting an Issue.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
bin		bin
conf		conf
containers		containers
templatefiles		templatefiles
testdata		testdata
.bumpversion.cfg		.bumpversion.cfg
.devcontainer.json		.devcontainer.json
.gitignore		.gitignore
cubseq-schematic.png		cubseq-schematic.png
license.md		license.md
main.nf		main.nf
nextflow.config		nextflow.config
readme.md		readme.md

License

stracquadaniolab/cubseq-nf

Folders and files

Latest commit

History

Repository files navigation

cubseq-nf

What is CUBseq?

What can CUBseq be used for?

Running CUBseq: quick-guide

Install or update the workflow

Define the configuration file

Run the analysis

Configuring CUBseq

CUBseq results

Authors

Contact us about CUBseq

About

Topics

Resources

License

Stars

Watchers

Forks

Languages