C3Q pipeline

Cufflinks and CodingQuarry based gene prediction pipeline for intron rich fungal genomes

By Patrícia A. G. Ferrareze and Rodrigo S. A. Streit

What does C3Q pipeline do?

The C3Q pipeline is a Cufflinks and CodingQuarry based gene prediction pipeline optimized for intron rich fungal genomes, as Cryptococcus species. The C3Q pipeline was developed and tested with C. neoformans H99 and C. deneoformans JEC21 genomes, using RNA-Seq as the primary source of information for the gene prediction. The C3Q pipeline was used to gene prediction in C. deuterogattii R265. The selection of parameters and results are described in the paper:

Application of an optimized annotation pipeline to the Cryptococcus deuterogattii genome reveals dynamic primary metabolic gene clusters and genomic impact of RNAi loss
Patricia A. G. Ferrareze, Corinne Maufrais, Rodrigo Silva Araujo Streit, Shelby J. Priest, Christina A. Cuomo, Joseph Heitman, Charley C. Staats, Guilhem Janbon
G3 Genes|Genomes|Genetics, Volume 11, Issue 2, February 2021, jkaa070 https://doi.org/10.1093/g3journal/jkaa070

How does C3Q pipeline works?

The C3Q pipeline performs the gene prediction using RNA-Seq alignment (.bam) and genome (.fna/.fa) files. The addition of a protein file of sequences from close species (.faa/.fa) is optional but recomended.
The pipeline works as described below:

The Cufflinks transcripts assembly (input: bam files from reads mapping - subsampled¹)
The Cuffmerge combination of the assembled transcripts (input: the Cufflinks generated GTFs)
The The CodingQuarry gene training and prediction with the merged assembled transcripts file (GFF) and the genome file (input: the Cuffmerge combined GFF file and the genome)
The filtering of small dubious sequences (spliced sequences up to 150 nt, intronless sequences up to 300 nt) (input: the CodingQuarry gene prediction file)
The HTSeq-count of reads for the predicted sequences and the filtering of genome-predicted sequences without reads (The filtered CodingQuarry gene prediction file and the mapped BAM files)
The filtering of alternative sequences from multitranscripts loci (selection of the best gene model) (input: the full filtered CodingQuarry gene prediction file)
The recover of deleted and non-predicted loci with Exonerate mapping of orthologous genes (input: the multifiltered CodingQuarry gene prediction file and the fasta file of protein sequences from related species)

¹In the paper we show that subsampled BAM alignments generate better results than the large whole files in the Cufflinks transcripts assembly. For our organism models, the optimal subsampled library size was of 7.5 million reads for each RNA-Seq replicate, but the optimal size may differ for other organisms. If you want to subsample your bam files before using the pipeline, see https://broadinstitute.github.io/picard/ for DownsampleSam tool installation and usage. Keep in mind that DownsampleSam subsamples libraries based on a probability, so the probability value required to generate a 7.5 million reads subsampling will depend on your original library size.

What does C3Q requires?

The C3Q pipeline was tested in Linux systems x64. Therefore, we do not guarantee it will work in other operational systems.

The C3Q pipeline is built on Python and depends on Cufflinks suite (cufflinks, cuffmerge and gffread), gffcompare, HTseq-count and Exonerate.
The tested versions of the dependencies and their repositories are listed bellow:

In the exception of Exonerate², all the dependencies are available in Bioconda channel for Conda-based installation.

After installation of the required programs and dependencies, the C3Qpipeline should be allowed to be executed by using

chmod +x C3Qpipeline

So it can be run in its directory as

./C3Qpipeline

One may also export the code directory to $PATH

export PATH=$PATH:/path/to/C3Qpipeline/directory/

Or simply moving it to a bin directory that is already on your $PATH

mv C3Qpipeline /path/to/chosen/bin/

So that it may be called from any directory as

C3Qpipeline

²The Exonerate required by C3Q is a modified version that produces a different output format than the regular Exonerate, which is used by C3Q. Thus, any attempt of executing C3Q with a regular version of Exonerate such as the ones available on Bioconda will likely cause it to crash or even to produce inaccurate results.

C3Q PIPELINE USAGE

C3Qpipeline [OPTIONS]  

Required arguments:										
  -genome                     Genome file in FASTA format.						
  -libs                       List of RNA-seq libraries as specified in READ ME³.			
  -strandness                 Strandess of the RNA-seq library. Must be either "yes" (stranded), "no" (unstranded) or "reverse" (reversely stranded).								
												
Optional arguments:										
  --sublibs                   List of sub-sampled libraries as specified in READ ME⁴.			
  --exo                       Protein fasta file for exonerate guidance.				
  --refine-exo                Sets exonerate to refine its alignments. This is very memory and time consuming.						
  --o                         Output name.								
  --p                         Number of cpu cores to be used. Default: 1				
  -h, --help                  Show this help message.

³A file containing the absolute paths of the BAM files from the reads alignment, one path per line. Those MUST not to be subsampled, as they are used by HTSeq to count mapped reads.
⁴A file containing the absolute paths of the subsampled BAM files from the reads alignment, one path per line. Those libraries optimize Cufflinks transcripts assembly, although they are optional. In the absence of subsampled libraries, Cufflinks will use the same libraries provided for HTSeq. Check "How does C3Q pipeline works?" for subsampling instructions.

Example structure for both -libs and --sublibs files:

/home/user/Documents/my_experiment/library_1.bam
/home/user/Documents/my_experiment/library_2.bam
/home/user/Documents/my_experiment/library_3.bam

Some usage examples:

To perform the complete pipeline added to the option of refine Exonerate alignments (memory and time consuming):

C3Qpipeline -genome genome_file.fna -libs libs_path.txt -strandness yes --sublibs subsampled_libs_path.txt --exo protein_file.faa --refine-exo

To perform the complete pipeline with reduced memory and time:

C3Qpipeline -genome genome_file.fna -libs libs_path.txt -strandness yes --sublibs subsampled_libs_path.txt --exo protein_file.faa

To perform the pipeline with the full size BAM libraries (no subsampling):

C3Qpipeline -genome genome_file.fna -libs libs_path.txt -strandness yes --exo protein_file.faa --refine-exo

To perform the pipeline without the mapping of related protein sequences by Exonerate (RNA-Seq and genome based prediction only):

C3Qpipeline -genome genome_file.fna -libs libs_path.txt -strandness yes --sublibs subsampled_libs_path.txt

If you use this pipeline, please cite the paper:
Application of an optimized annotation pipeline to the Cryptococcus deuterogattii genome reveals dynamic primary metabolic gene clusters and genomic impact of RNAi loss
Patricia A. G. Ferrareze, Corinne Maufrais, Rodrigo Silva Araujo Streit, Shelby J. Priest, Christina A. Cuomo, Joseph Heitman, Charley C. Staats, Guilhem Janbon
bioRxiv 2020.09.01.278374
https://doi.org/10.1101/2020.09.01.278374

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
C3Qpipeline		C3Qpipeline
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

C3Q pipeline

Cufflinks and CodingQuarry based gene prediction pipeline for intron rich fungal genomes

What does C3Q pipeline do?

How does C3Q pipeline works?

What does C3Q requires?

C3Q PIPELINE USAGE

Some usage examples:

About

Uh oh!

Releases 2

Packages

Contributors 3

Uh oh!

Languages

License

UBTEC/C3Q

Folders and files

Latest commit

History

Repository files navigation

C3Q pipeline

Cufflinks and CodingQuarry based gene prediction pipeline for intron rich fungal genomes

What does C3Q pipeline do?

How does C3Q pipeline works?

What does C3Q requires?

C3Q PIPELINE USAGE

Some usage examples:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Uh oh!

Languages

Packages