Skip to content

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl

License

Notifications You must be signed in to change notification settings

openvax/pyensembl

Repository files navigation

Tests Coverage Status PyPI

PyEnsembl

PyEnsembl is a Python interface to Ensembl reference genome metadata such as exons and transcripts. PyEnsembl downloads GTF and FASTA files from the Ensembl FTP server and loads them into a local database. PyEnsembl can also work with custom reference data specified using user-supplied GTF and FASTA files.

Example Usage

from pyensembl import EnsemblRelease

# release 77 uses human reference genome GRCh38
data = EnsemblRelease(77)

# will return ['HLA-A']
gene_names = data.gene_names_at_locus(contig=6, position=29945884)

# get all exons associated with HLA-A
exon_ids  = data.exon_ids_of_gene_name('HLA-A')

Installation

You can install PyEnsembl using pip:

pip install pyensembl

This should also install any required packages such as datacache.

Before using PyEnsembl, run the following command to download and install Ensembl data:

pyensembl install --release <list of Ensembl release numbers> --species <species-name>

For example, pyensembl install --release 75 76 --species human will download and install all human reference data from Ensembl releases 75 and 76.

Alternatively, you can create the EnsemblRelease object from inside a Python process and call ensembl_object.download() followed by ensembl_object.index().

Cache Location

By default, PyEnsembl uses the platform-specific Cache folder and caches the files into the pyensembl sub-directory. You can override this default by setting the environment key PYENSEMBL_CACHE_DIR as your preferred location for caching:

export PYENSEMBL_CACHE_DIR=/custom/cache/dir

or

import os

os.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'
# ... PyEnsembl API usage

Usage tips

List installed genomes

To see the genomes for which PyEnsembl has already downloaded and indexed metadata you can run:

pyensembl list

Or equivalently do this in Python:

from pyensembl.shell import collect_all_installed_ensembl_releases
collect_all_installed_ensembl_releases()

Load genome in Python

Here's an example Python snippet that loads fly genome data from Ensembl release v100:

from pyensembl import EnsemblRelease
data = EnsemblRelease(release=100, species='drosophila_melanogaster')

Data structures

Gene

gene = genome.gene_by_id(gene_id='FBgn0011747')

Transcript

transcript = gene.transcripts[0]

Protein information

transcript.protein_id
transcript.protein_sequence

Non-Ensembl Data

PyEnsembl also allows arbitrary genomes via the specification of local file paths or remote URLs to both Ensembl and non-Ensembl GTF and FASTA files. (Warning: GTF formats can vary, and handling of non-Ensembl data is still very much in development.)

For example:

from pyensembl import Genome
data = Genome(
    reference_name='GRCh38',
    annotation_name='my_genome_features',
    # annotation_version=None,
    gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf', # Path or URL of GTF file
    # transcript_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing transcript sequences
    # protein_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing protein sequences
    # cache_directory_path=None, # Where to place downloaded and cached files for this genome
)
# parse GTF and construct database of genomic features
data.index()
gene_names = data.gene_names_at_locus(contig=6, position=29945884)

API

The EnsemblRelease object has methods to let you access all possible combinations of the annotation features gene_name, gene_id, transcript_name, transcript_id, exon_id as well as the location of these genomic elements (contig, start position, end position, strand).

Genes

genes(contig=None, strand=None)
Returns a list of Gene objects, optionally restricted to a particular contig or strand.
genes_at_locus(contig, position, end=None, strand=None)
Returns a list of Gene objects overlapping a particular position on a contig, optionally extend into a range with the end parameter and restrict to forward or backward strand by passing strand='+' or strand='-'.
gene_by_id(gene_id)
Return a Gene object for given Ensembl gene ID (e.g. "ENSG00000068793").
gene_names(contig=None, strand=None)
Returns all gene names in the annotation database, optionally restricted to a particular contig or strand.
genes_by_name(gene_name)
Get all the unqiue genes with the given name (there might be multiple due to copies in the genome), return a list containing a Gene object for each distinct ID.
gene_by_protein_id(protein_id)
Find Gene associated with the given Ensembl protein ID (e.g. "ENSP00000350283")
gene_names_at_locus(contig, position, end=None, strand=None)
Names of genes overlapping with the given locus, optionally restricted by strand. (returns a list to account for overlapping genes)
gene_name_of_gene_id(gene_id)
Returns name of gene with given genen ID.
gene_name_of_transcript_id(transcript_id)
Returns name of gene associated with given transcript ID.
gene_name_of_transcript_name(transcript_name)
Returns name of gene associated with given transcript name.
gene_name_of_exon_id(exon_id)
Returns name of gene associated with given exon ID.
gene_ids(contig=None, strand=None)
Return all gene IDs in the annotation database, optionally restricted by chromosome name or strand.
gene_ids_of_gene_name(gene_name)
Returns all Ensembl gene IDs with the given name.

Transcripts

transcripts(contig=None, strand=None)
Returns a list of Transcript objects for all transcript entries in the Ensembl database, optionally restricted to a particular contig or strand.
transcript_by_id(transcript_id)
Construct a Transcript object for given Ensembl transcript ID (e.g. "ENST00000369985")
transcripts_by_name(transcript_name)
Returns a list of Transcript objects for every transcript matching the given name.
transcript_names(contig=None, strand=None)
Returns all transcript names in the annotation database.
transcript_ids(contig=None, strand=None)
Returns all transcript IDs in the annotation database.
transcript_ids_of_gene_id(gene_id)
Return IDs of all transcripts associated with given gene ID.
transcript_ids_of_gene_name(gene_name)
Return IDs of all transcripts associated with given gene name.
transcript_ids_of_transcript_name(transcript_name)
Find all Ensembl transcript IDs with the given name.
transcript_ids_of_exon_id(exon_id)
Return IDs of all transcripts associatd with given exon ID.

Exons

exon_ids(contig=None, strand=None)
Returns a list of exons IDs in the annotation database, optionally restricted by the given chromosome and strand.
exon_by_id(exon_id)
Construct an Exon object for given Ensembl exon ID (e.g. "ENSE00001209410")
exon_ids_of_gene_id(gene_id)
Returns a list of exon IDs associated with a given gene ID.
exon_ids_of_gene_name(gene_name)
Returns a list of exon IDs associated with a given gene name.
exon_ids_of_transcript_id(transcript_id)
Returns a list of exon IDs associated with a given transcript ID.
exon_ids_of_transcript_name(transcript_name)
Returns a list of exon IDs associated with a given transcript name.