By Patrícia A. G. Ferrareze and Rodrigo S. A. Streit
The C3Q pipeline is a Cufflinks and CodingQuarry based gene prediction pipeline optimized for intron rich fungal genomes, as Cryptococcus species. The C3Q pipeline was developed and tested with C. neoformans H99 and C. deneoformans JEC21 genomes, using RNA-Seq as the primary source of information for the gene prediction. The C3Q pipeline was used to gene prediction in C. deuterogattii R265. The selection of parameters and results are described in the paper:
Application of an optimized annotation pipeline to the Cryptococcus deuterogattii genome reveals dynamic primary metabolic gene clusters and genomic impact of RNAi loss
Patricia A. G. Ferrareze, Corinne Maufrais, Rodrigo Silva Araujo Streit, Shelby J. Priest, Christina A. Cuomo, Joseph Heitman, Charley C. Staats, Guilhem Janbon
G3 Genes|Genomes|Genetics, Volume 11, Issue 2, February 2021, jkaa070
https://doi.org/10.1093/g3journal/jkaa070
The C3Q pipeline performs the gene prediction using RNA-Seq alignment (.bam) and genome (.fna/.fa) files. The addition of a protein file of sequences from close species (.faa/.fa) is optional but recomended.
The pipeline works as described below:
- The Cufflinks transcripts assembly (input: bam files from reads mapping - subsampled¹)
- The Cuffmerge combination of the assembled transcripts (input: the Cufflinks generated GTFs)
- The The CodingQuarry gene training and prediction with the merged assembled transcripts file (GFF) and the genome file (input: the Cuffmerge combined GFF file and the genome)
- The filtering of small dubious sequences (spliced sequences up to 150 nt, intronless sequences up to 300 nt) (input: the CodingQuarry gene prediction file)
- The HTSeq-count of reads for the predicted sequences and the filtering of genome-predicted sequences without reads (The filtered CodingQuarry gene prediction file and the mapped BAM files)
- The filtering of alternative sequences from multitranscripts loci (selection of the best gene model) (input: the full filtered CodingQuarry gene prediction file)
- The recover of deleted and non-predicted loci with Exonerate mapping of orthologous genes (input: the multifiltered CodingQuarry gene prediction file and the fasta file of protein sequences from related species)
¹In the paper we show that subsampled BAM alignments generate better results than the large whole files in the Cufflinks transcripts assembly. For our organism models, the optimal subsampled library size was of 7.5 million reads for each RNA-Seq replicate, but the optimal size may differ for other organisms. If you want to subsample your bam files before using the pipeline, see https://broadinstitute.github.io/picard/ for DownsampleSam tool installation and usage. Keep in mind that DownsampleSam subsamples libraries based on a probability, so the probability value required to generate a 7.5 million reads subsampling will depend on your original library size.
The C3Q pipeline was tested in Linux systems x64. Therefore, we do not guarantee it will work in other operational systems.
The C3Q pipeline is built on Python and depends on Cufflinks suite (cufflinks, cuffmerge and gffread), gffcompare, HTseq-count and Exonerate.
The tested versions of the dependencies and their repositories are listed bellow:
- Python v3.6.9
- Cufflinks v2.2.1
- CodingQuarry v2.0
- GFFcompare v0.10.1 and v0.10.6
- HTSeq-count of the 'HTSeq' framework v0.12.4
- Exonerate v2.3.0 with GFF3 support
In the exception of Exonerate², all the dependencies are available in Bioconda channel for Conda-based installation.
After installation of the required programs and dependencies, the C3Qpipeline should be allowed to be executed by using
chmod +x C3Qpipeline
So it can be run in its directory as
./C3Qpipeline
One may also export the code directory to $PATH
export PATH=$PATH:/path/to/C3Qpipeline/directory/
Or simply moving it to a bin directory that is already on your $PATH
mv C3Qpipeline /path/to/chosen/bin/
So that it may be called from any directory as
C3Qpipeline
²The Exonerate required by C3Q is a modified version that produces a different output format than the regular Exonerate, which is used by C3Q. Thus, any attempt of executing C3Q with a regular version of Exonerate such as the ones available on Bioconda will likely cause it to crash or even to produce inaccurate results.
C3Qpipeline [OPTIONS] Required arguments: -genome Genome file in FASTA format. -libs List of RNA-seq libraries as specified in READ ME³. -strandness Strandess of the RNA-seq library. Must be either "yes" (stranded), "no" (unstranded) or "reverse" (reversely stranded). Optional arguments: --sublibs List of sub-sampled libraries as specified in READ ME⁴. --exo Protein fasta file for exonerate guidance. --refine-exo Sets exonerate to refine its alignments. This is very memory and time consuming. --o Output name. --p Number of cpu cores to be used. Default: 1 -h, --help Show this help message.
³A file containing the absolute paths of the BAM files from the reads alignment, one path per line. Those MUST not to be subsampled, as they are used by HTSeq to count mapped reads.
⁴A file containing the absolute paths of the subsampled BAM files from the reads alignment, one path per line. Those libraries optimize Cufflinks transcripts assembly, although they are optional. In the absence of subsampled libraries, Cufflinks will use the same libraries provided for HTSeq. Check "How does C3Q pipeline works?" for subsampling instructions.
Example structure for both -libs and --sublibs files:
/home/user/Documents/my_experiment/library_1.bam
/home/user/Documents/my_experiment/library_2.bam
/home/user/Documents/my_experiment/library_3.bam
To perform the complete pipeline added to the option of refine Exonerate alignments (memory and time consuming):
C3Qpipeline -genome genome_file.fna -libs libs_path.txt -strandness yes --sublibs subsampled_libs_path.txt --exo protein_file.faa --refine-exo
To perform the complete pipeline with reduced memory and time:
C3Qpipeline -genome genome_file.fna -libs libs_path.txt -strandness yes --sublibs subsampled_libs_path.txt --exo protein_file.faa
To perform the pipeline with the full size BAM libraries (no subsampling):
C3Qpipeline -genome genome_file.fna -libs libs_path.txt -strandness yes --exo protein_file.faa --refine-exo
To perform the pipeline without the mapping of related protein sequences by Exonerate (RNA-Seq and genome based prediction only):
C3Qpipeline -genome genome_file.fna -libs libs_path.txt -strandness yes --sublibs subsampled_libs_path.txt
If you use this pipeline, please cite the paper:
Application of an optimized annotation pipeline to the Cryptococcus deuterogattii genome reveals dynamic primary metabolic gene clusters and genomic impact of RNAi loss
Patricia A. G. Ferrareze, Corinne Maufrais, Rodrigo Silva Araujo Streit, Shelby J. Priest, Christina A. Cuomo, Joseph Heitman, Charley C. Staats, Guilhem Janbon
bioRxiv 2020.09.01.278374
https://doi.org/10.1101/2020.09.01.278374