Clean up CAT (Comparative Annotation Toolkit) annotations
- samtools
- gtfToGenePred, genePredToGtf, and liftOver from UCSC Kent utils
- biomaRt
- pandas
- natsort
- liftoff
The easiest thing to do is to use conda to install all the requirements using the included yml file:
conda env create --file litterbox.yml
CAT makes good annotations for cross-species analysis. But its GFF3 files can pose problems for cross-species DE analysis.
- Mitochondrial annotations tend to be messed up
- Mitochondrial genes can appear on the autosomes, especially where there are nuclear mitochondrial insertions (NUMTs)
- Source annotations can include features that steal reads from other genes and confound RNA-seq quantification, including read-through transcripts that are filtered out by default in the 10X mkref pipeline.
- De novo annotations can be wrong and cause the same problem
- GFF3s are not necessarily sorted and formatted in the way expected by indexing programs (like
STAR genomeGenerate
andcellranger mkref
) - When using tools like scanpy or Seurat, genes are usually keyed to their human-readable name, not their unique IDs. If you are using annotations from multiple species, gene synonyms might mean that genes are dropped when they don't have the same name for one or more species.
The GTF file format often includes features marking 5' and 3' UTRs, but GFF3 does not. These scripts start with a CAT GFF3 file and output a GTF file, but this does not contain UTR annotations. If this turns out to be important later, it might be added in.
litterbox
needs the following things:
- The CAT annotation you want to clean up (GFF3 format)
- A recent Gencode (Human) annotation (GTF format)
- The name of the mitochondrial sequence in the genome assembly the CAT annotation corresponds to (it's probably chrM; you can check by indexing the FASTA using
samtools faidx
and looking at the sequence names in the first column of the resultingfai
file - The genome (FASTA format) the HAL annotation corresponds to. If you have a set of aligned genomes in HAL format, you can extract the genome you want with the HAL toolkit command
hal2fasta
followed by the name of the genome in the HAL file. - The genome (FASTA format) that Gencode corresponds to: this is hg38
- a tab-separated file, mapping (in three columns)
HGNC ID
->Approved symbol
->Ensembl Gene ID
. You can generate an up-to-date one here or use the one included in thedata
directory (default).
Recommended other things, to rescue some of the genes that will otherwise get dropped:
litterbox
can also try to pull in records for dropped genes from an Ensembl annotation. This is a last resort: if all transcripts that CAT used to place a gene get filtered out by their tags/biotypes in the latest Gencode annotation, all we can do is try to pull some of those genes back from a different annotation. There are two possibilities:
If there is an Ensembl annotation of the same organism on the same genome assembly version (you can check dates against those in the UCSC genome browser, or if still uncertain, compare sequence lengths in the .fai files), then you will need to download the GTF of the Ensembl annotation and the genome assembly the annotation used, in FASTA format.
You will also need to know the name of the organism of the CAT assembly as it exists in the Ensembl database. This should be lowercase first letter of species name + lowercase genus name, with no spaces. Example: hsapiens
If the most recent Ensembl annotation for the organism is on a different assembly version than the one that CAT annotated, then you will need to download that assembly (in GTF format), the UCSC chain file for lifting from that assembly version to the one used for CAT (UCSC downloads), and a FASTA file of the UCSC version of the same assembly as was used for CAT.
You will also need to know the name of the organism of the CAT assembly as it exists in the Ensembl database. This should be lowercase first letter of species name + lowercase genus name, with no spaces. Example: hsapiens
Everything is set up so you can just run one program (in the main directory): filter_annotation.py
. It has a lot of arguments that will show up if you run it with --help
or -h.
You must provide an output file name prefix, and the final annotation will be in [output_prefix].gtf
.
If you installed and activated the conda
environment litterbox
, then you can ignore all of the arguments that provide paths to programs.
You can optionally filter a human Gencode annotation (recommended: the version of Gencode that was used to create the CAT annotation) to contain the same genes, and the same gene names, as the final CAT annotation. This is a good idea to make sure that gene names match up and that genes that exist in both annotations, but left included in only one, don't drive up/down expression of other genes as a result of being left in or out. If you choose this option, the resulting annotation will be [output_prefix].human.gtf
.
- Goes through the latest Gencode GTF and makes a list of allowed gene IDs and allowed transcript IDs, based on values of
gene_type
,transcript_type
, andtag
fields. Allowed type fields were taken from the 10X Genomicscellranger mkref
instructions:protein_coding
,lncRNA
,IG_C_gene
,IG_D_gene
,IG_LV_gene
,IG_V_gene
,IG_V_pseudogene
,IG_J_pseudogene
,IG_C_pseudogene
,TR_C_gene
,TR_D_gene
,TR_J_gene
,TR_V_gene
,TR_V_pseudogene
, andTR_J_pseudogene
. Disallowed tag fields were also based on this guide, with the addition of several more:readthrough_gene
,readthrough_transcript
,PAR
,fragmented_locus
, andlow_sequence_quality
. - Goes through the CAT GFF3, converts to GTF, and removes all genes and transcripts that were lifted from the human gencode annotation and did not pass the filters in step 1. If a transcript was a prediction not based on homology with human (i.e. the Gencode annotation), it's removed if
transcript_class
ispoor_alignment
, or iftranscript_class
ispossible_paralog
and eitherreference_support
is not true, or bothrna_support
andpacbio_isoform_supported
are not true. If all transcripts of a gene are removed, then the gene is also removed. - Renames transcripts that pass filters to their preferred symbol/common name, according to the latest HGNC data (included in
data
directory) - Removes all mitochondrial genes (as determined from the recent Gencode annotation) from the CAT annotation, to avoid cases where NUMTs are annotated as mitochondrial genes instead of the actual mitochondrial genes
- Uses liftoff and the mitochondrial sequences from hg38 and the CAT-annotated assembly to re-annotate the mitochondrial genome in the CAT-annotated assembly
- Optionally attempts to rescue some of the eliminated genes by taking their annotations from an Ensembl gene annotation. First, takes the list of genes that were eliminated because of all their transcripts being eliminated (and not directly filtered out themselves). Then converts these human Ensembl gene IDs to IDs for the species of interest, using biomaRt. Only genes with a one-to-one mapping from human ID -> other species ID are considered.
- If the Ensembl gene annotation you downloaded is on the same assembly coordinates, pulls the relevant genes out of the annotation and converts sequence names to those used in CAT by mapping chromosome names to each other across assemblies (matching them up based on their sequence lengths in
fai
files for both genomes). - If the Ensembl gene annotation you downloaded is on different/older assembly coordinates, pulls the relevant genes out of the annotation and then lifts this annotation to the correct assembly, using the UCSC Kent tools
gtfToGenePred
,liftOver
, andgenePredToGtf
. Then converts sequence names to those used in CAT by mapping chromosome names to each other across assemblies (matching them up based on their sequence lengths infai
files for both genomes). For a gene to survive make it into the final annotation, it must have agene
feature and at least onetranscript
and at least oneexon
feature on the new genome coordinates.
- If the Ensembl gene annotation you downloaded is on the same assembly coordinates, pulls the relevant genes out of the annotation and converts sequence names to those used in CAT by mapping chromosome names to each other across assemblies (matching them up based on their sequence lengths in
- Combines the new main GTF, the mitochondrial GTF, and (optionally) the "rescued" GTF and sorts these in a way designed to make
cellranger-arc mkref
happy. The sort order matters to this program, as documented here, and the code to sort properly was taken from chbk's suggestion in that issue (it's in the repository inscripts/sort_annotation.py
). - Optionally also creates a new human annotation with the same filters: if you provide a human annotation (such as Gencode of the same version used to make the CAT annotation), filters the genes the same way as the CAT annotation, and sets the
gene_name
field to the HGNC approved symbol for each gene, provided one exists. If you are cleaning up a set of CAT annotations for multiple species all based on the same Gencode, you only need to do this once to have a corresponding human annotation. The human annotation is adjusted according to the HGNC IDs and the information in the latest Gencode release, but does not depend on the provided CAT annotation.
You got yourself a clean annotation