Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

How to create a pigeon‐compatible annotation GTF

Elizabeth Tseng edited this page Sep 29, 2023 · 9 revisions

Last updated: 09/29/2023

Please use the latest pigeon version that also contains pigeon prepare to help validate the correctness of custom annotation GT and reference genomes!

Pigeon is designed to work for Gencode annotation GTF file formats. Other GTF formats will need to be modified to work with pigeon classify.

<name="req">

pigeon GTF format requirements

The pigeon GTF format requirements are:

A tab-delimited 9-column file GFF/GTF File Format

  • Column 1 must be the chromosome
  • Column 2 is ignored
  • Column 3 will only be processed if it is gene, transcript, or exon. All other types (e.g. CDS) are ignored.
  • Column 4 & 5 are 1-based start/end
  • Column 6 & 8 are ignored
  • Column 7 is the strand which must be + or -
  • Column 9 is attribute, AKA free text string, but to be properly processed it must contain a minimal of the following, separated by semicolon. Ex: gene_id "ENSG0001"; transcript_id "ENST000A"; gene_name "TP53";
  • No extra blank lines at the beginning or end of the file

An isoform record is a one line of "gene" record followed by one or more "transcript" records. Each "transcript" record includes one or more "exon" records. "Gene" records are only considered during pigeon prepare, to check for unique IDs. Otherwise, during pigeon classify, only "transcript" records are considered for both collapsed isoforms and annotations. pigeon uses a "transcript" entry to trigger the next batch and read the next 1..N exons as children of it.

## Pigeon GTF examples

Example 1: Gencode annotation

Below is a snippet of a Gencode annotation as a reference:

chr1    ENSEMBL gene    17369   17436   .       -       .       gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR68
59-1"; level 3;
chr1    ENSEMBL transcript      17369   17436   .       -       .       gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "mi
RNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "
basic"; transcript_support_level "NA";
chr1    ENSEMBL exon    17369   17436   .       -       .       gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "miRNA"; ge
ne_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; exon_number 1; exon_id
 "ENSE00003746039.1"; level 3; tag "basic"; transcript_support_level "NA";
chr1    HAVANA  gene    29554   31109   .       +       .       gene_id "ENSG00000243485.3"; gene_type "lincRNA"; gene_status "KNOWN"; gene_name "RP1
1-34P13.3"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
chr1    HAVANA  transcript      29554   31097   .       +       .       gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "li
ncRNA"; gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; leve
l 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT0000
0002840.1";
chr1    HAVANA  exon    29554   30039   .       +       .       gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; 
gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number 
1; exon_id "ENSE00001947070.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
9.2"; havana_transcript "OTTHUMT00000002840.1";
chr1    HAVANA  exon    30564   30667   .       +       .       gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; 
gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number 
2; exon_id "ENSE00001922571.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
9.2"; havana_transcript "OTTHUMT00000002840.1";
chr1    HAVANA  exon    30976   31097   .       +       .       gene_id "ENSG00000243485.3"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; 
gene_status "KNOWN"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "RP11-34P13.3-001"; exon_number 
3; exon_id "ENSE00001827679.1"; level 2; tag "not_best_in_genome_evidence"; tag "basic"; transcript_support_level "5"; havana_gene "OTTHUMG0000000095
9.2"; havana_transcript "OTTHUMT00000002840.1";

Example 2: modified non-model organism annotation for Pigeon

Here is an example of a pigeon-compatible annotation after it's been manually modified.

Pf3D7_13_v3     VEuPathDB       gene    21364   28787   .       +       .       gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       transcript      21364   28787   .       +       .       gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gen
e_name "PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       exon    21364   26538   .       +       .       gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       exon    27474   28787   .       +       .       gene_id "PF3D7_1300100"; transcript_id "PF3D7_1300100.1"; gene_name "
PF3D7_1300100"; transcript_name "PF3D7_1300100.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       CDS     21364   26538   .       +       0       Parent=PF3D7_1300100.1
Pf3D7_13_v3     VEuPathDB       CDS     27474   28787   .       +       0       Parent=PF3D7_1300100.1
Pf3D7_13_v3     VEuPathDB       gene    30605   31881   .       -       .       gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       transcript      30605   31881   .       -       .       gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gen
e_name "PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       exon    30605   31597   .       -       .       gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       exon    31828   31881   .       -       .       gene_id "PF3D7_1300200"; transcript_id "PF3D7_1300200.1"; gene_name "
PF3D7_1300200"; transcript_name "PF3D7_1300200.1"; biotype "test";
Pf3D7_13_v3     VEuPathDB       CDS     30605   31597   .       -       0       Parent=PF3D7_1300200.1
Pf3D7_13_v3     VEuPathDB       CDS     31828   31881   .       -       0       Parent=PF3D7_1300200.1
## Using pigeon prepare to check genomes and annotations

Example usages:

$ pigeon prepare annotation.gtf collapsed_isoforms.gff reference.fasta cage.bed

or

$ pigeon prepare reference_files.fofn