Individual level PrediXcan: introduction, tutorials and manual

Introduction

In the following, we focus on the individual-level implementation of PrediXcan.

The method was originally implemented in this repository. However, maintenance and new features were hard to enact, so we re-implemented the method to integrate with latest MetaXcan features.

PrediXcan consists of two steps:

Predict gene expression (or whatever biology the models predict) in a cohort with available genotypes
Run associations to a trait measured in the cohort

The first step is implemented in Predict.py. The second step is implemented in PrediXcanAssociation.py. Both steps are wrapped in PrediXcan.py, although I find it more convenient to run each step at a time. Running one step at a time is easier to troubleshoot when running on a cohort for the first time.

The prediction models are trained and pre-compiled on specific data sets with their own human genome releases and variant definitions. We implemented a few rules to support variant matching from genotypes based on different variant definitions. In the following, mapping refers to the process of assigning a model variant to a genotype variant.

Originally, PrediXcan was applied to genes so we say "gene expression" a lot as it was the mechanism we initially studied. But conceptually, everything said here applies to any intermediate/molecular mechanism such as splicing or brain morphology. Whenever we say "gene", it generally could mean a splicing intron event, etc.

You can find a specification of command-line arguments below. We also show some examples of running predixcan.

Features

Handles multiple input genotype formats (vcf, bgen, text format)
For bgen, only those variants available to the model get loaded, saving a lot of time
Output can be stored in HDF5 for granular and efficient access.
Supports prediction models built upon different versions of the human genome.
- If your genotypes are in (say) hg19 and use one of the latest models defined on hg38, genotype variants can be lifted over behind the scenes from hg19 to hg38
If you use non-standard variants, you can provide a table mapping from your genotype variants to PrediXcan models.
It loads one variant at a time, so that the memory consumption is bounded by the predicted expression matrix (roughly genotype sample size * number of genes * float size). When using HDF5 storeage, this matrix is stored in disk so that memory is negligible.

The software assumes that no genotypes are missing. In vcf's, if an allele is denoted is missing, it is counted as zero dosage of that allele.

Command line overview

In the following, mapping refers to implemented rules to match genotype variants to model variants.

Predict.py:

--model_db_path: path to a sqlite file containing prediction models
--model_db_snp_key: optional. If provided, will load variant ids from an alternative column in the db. By default, PrediXcan uses rsids, and this works with Elastic Net models. For the more sophisticated MASHR models, --model_db_snp_key varID must be specified with this argument.
--liftover: Optional. Path to liftover chain. If provided, genomic coordinates will be mapped with liftover.
--zero_based_positions Optional flag. Lets PrediXcan note that genomic coordinates start at 0 (genome browser starts at 1).
--skip_palindromic: Optional flag. Palindromic/ambiguous snv (such as C/G or A/T) from the genotype will be ignored.
--bgen_genotypes: pattern of files with genotypes in bgen format (e.g. data/ukb/chr*.bgen)
- --bgen_use_rsid: Load rsid as variant id, otherwise load id. Rsids are convenient for some models where each variant is guaranteed to have an rsid. Not so much for other models.
- --force_colon: Some files have mixed separators in variant ids such as chr1:100_A_T. This converts on-the-fly to chr1:100:A:T.
--vcf_genotypes: pattern of vcf genotype files.
--vcf_mode:
- genotyped is meant for phased, genotyped vcfs that contain counts of each allele at each chromosome pair.
- imputed will load DS field as dosage. This is meant to work with imputed vcfs as generated by the Michigan Imputation Server.
--force_mapped_metadata: Optional. Mapping variants is inherently heuristic; uses the mapped variants definitions (coordinates and alleles) instead of the genotyped properites. i.e. a genotype variant might be chr1 100 C T and end up mapped to chr1_100_T_C_b38; this option tells PrediXcan to use T C instead of C T, and swaps allele dosages and frequencies appropriately, before matching to the model. This option is mostly meant for development purposes, as the PrediXcan method will match alleles of genotype and model variants anyway.
--text_genotypes: use the text format as specified below
--text_sample_ids: a file specifying samples. If specified alongside bgen or vcf, will use the file's IDs instead of the ones in the vcf or bgen.
--generate_sample_ids: Optional. Provide a number here to generate ids programmatically based on sample size.
--prediction_output: specify output (and output type) of predicted expression matrix (e.g. --prediction_output results/vcf_1000G_hg37_en_c/Whole_Blood__predict.h5 HDF5)
--prediction_summary_output: Optional. A separate file that will contain some additional information on the predictions (such as number of snps in the gene's models, number of snps used, etc).
--variant_mapping: Optional. Specify a way to map from genotype variant to model variants. This argument takes three values:
- A file with a table that states how to convert from genotype variants to model variants
- name of column with genotype variant ids
- name of column with model target variant ids
--on_the_fly_mapping: Optional. Specify a pattern to build a variant id from genotype variant properties. e.g. --on_the_fly_mapping METADATA "chr{}_{}_{}_{}_b38" will take the genotype variant's chromosome, position, alleles to build a variant id like chr1_123_A_G_b38. This will use the genotype properties, or if liftover is specified, the lifted coordinates.
--only_entries: Optional. This convenience argument specifies a whitelist of genes/splicing/etc to be computed, anything else will be disregarded.

PrediXcanAssociation.py:

--hdf5_expression_file: Load predicted expression matrix from an HDF5 file (as generated by Predict.py).
--expression_file: Alternatively, load predicted expression matrix from a text file (as generated by Predict.py).
--input_phenos_file: path to (text, tab separated file) file specifying phenotype values for each individual.
--input_phenos_column: name of column to be used as phenotype.
--covariates_file: path to (text, tab-separated) file specifying covariates for each individual.
--covariates: names of columns from the above file to use as covariates.
--output: path were to save the results.
--mode: can be linear (default), or logistic for dichotomic phenotypes.

PrediXcan.py takes the same arguments from PrediXcanAssociation.py and Predict.py. If no prediction output is specified, the predicted transcriptome matrix will not be stored. This is disrecommended and the matrix contains useful information.

All scripts support a --throw option: in case of error, the full software stack trace will be printed instead of a user-friendly error message.

Predict.py Arguments

Prediction models are SQLite databases with two tables, weights and extra. The weights table has columns gene rsid varID ref_allele eff_allele weight, where each variant has an ID with the format chr_pos_ref_eff_build. For example, the snp rs115095900 has variant ID chr10_101056744_G_C_b37. GTEx v8 model variants have the string chr in front, but some GTEx v7 models do not, so it's best to check the varID format beforehand by querying the weights table. By default, Predict.py predicts expression by matching the rsids in the genotype files to the rsids in the prediction model. However, for genotype files without rsids models, there are arguments that change it to varIDs.

Case 1: Same build genotype and model with rsids

The simplest case is when the genotype and prediction model variants are defined in the same build and the genotype files have rsids (Examples 1 and 2). Use the following arguments for VCF genotype files: --model_db_path, --vcf_genotypes, --vcf_mode, --prediction_output, --prediction_summary_output. For bgen, --model_db_path, --bgen_genotypes, --bgen_use_rsid, --prediction_output, --prediction_summary_output. And text dosage: --model_db_path, --text_genotypes, --text_sample_ids, --prediction_output, --prediction_summary_output.

Case 2: Same build genotype and model without rsids

Some VCFs have rsids, and others don't (Example 3). You can check by querying:

bcftools query -f '%ID\n' /Users/sabrinami/Desktop/1000G_test/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | head -3
rs367896724
rs540431307
rs555500075

bcftools query -f '%ID\n' /Users/sabrinami/Desktop/geuvadis_test/GEUVADIS.chr1.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz  | head -3
snp_1_10583
snp_1_10611
snp_1_13302

Or the model might have some variants without rsids, for example, the MASHR models. So the GEUVADIS genotypes and MASHR models will need to use varIDs to predict expression. The argument --on_the_fly_mapping METADATA builds an ID for each variant in the VCF, which can be matched to the varID in the prediction model. --model_db_snp_key varID specifies that the varID column, instead of rsid, should be used to match genotype variants. If both genotypes and models are in hg37, use arguments --model_db_path, --model_db_snp_key varID, --vcf_genotypes, --vcf_mode, --on_the_fly_mapping METADATA "{}_{}_{}_{}_b37", --prediction_output, --prediction_summary_output. The --on_the_fly_mapping depends on the varID format in the model, so for MASHR models, it should be "chr{}_{}_{}_{}_b38". The --on_the_fly_mapping argument is not necessary for text dosages, however, the --model_db_snp_key argument can be used depending on whether the text files have rsids or varIDs.

Case 3: Different build genotype and model

If genotypes and models are defined in different builds, there are two options. 1. Use rsids. Then the same arguments as in the Case 1 (Examples 1 and 2) will work. Or 2. Use varIDs, in the case of genotypes without rsids or MASHR models (Example 4). The --on_the_fly_mapping METADATA argument should still follow the prediction model varID format, including its build. The --liftover argument should be the liftover chain from the genotype build to the prediction model build. So running Predict.py with a VCF and model in different builds should have arguments --model_db_path, --model_db_snp_key varID, --vcf_genotypes, --vcf_mode, --on_the_fly_mapping METADATA, --liftover, --prediction_output, --prediction_summary_output.

Text dosage format

The dosage format consists of gzipped, tab separated text files without header. Each line looks like:

chromosome variant_id position allele1 allele2 MAF id1 ..... idn with:

chromosome: the chromosome identifier (i.e. 1)
variant_id: a unique string identifying the variant. It can be an rsid, it can be a string encoding properties (chr1_123_C_T_b38), or whatever.
position: position of variant in the chromosome
allele1: The non-effect allele (sometimes called "ancestral", or "ref" in Thousand Genomes)
allele2: effect allele (the one which dosage will be used to predict expression/splicing/etc)
MAF: allele frequency of allele 2 (unused at the moment)
id1 .. idn: each entry is the dosage (count/probabilities of allele 2 across the chromosomes)

It generally expects a samples file, which is a headerless, tab-separated file with columns: family id and individual id. If there are no family ids, duplicate the individual id column. Example:

HG00096 HG00096
HG00097 HG00097
HG00099 HG00099
HG00100 HG00100
...

Examples

You can try PrediXcan by downloading the following example data from Zenodo. The package contains minimalist example cases to show in the following tutorials. To unpack, run the following in a UNIX-like command line:

tar -xvpf predixcan_sample_data.tar

Prerequisites

In the following, we assume that you have downloaded the MetaXcan software software and all dependencies (such as Python 3.5 or later) as listed in MetaXcan's readme.

For convenience we assume that the path to MetaXcan is exposed to to your environment like so:

export $METAXCAN=/path/to/code/MetaXcan/software

Also, the folder with the sample data should be exposed like this:

export $DATA=/path/to/sample/data/untared

In the following, python3 stands for a valid python 3.5 (or greater) installation with all dependencies.

Example 1: VCF hg19-based genotype, on GTEx v8 Elastic Net Model

The easiest example is one where the genotypes and models use the same variants. Here we show how to run on Thousand Genomes genotypes from hg37 with the Elastic Net models:

printf "Predict expression\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $DATA/models/gtex_v8_en/en_Whole_Blood.db \
--vcf_genotypes $DATA/1000G_hg37/ALL.chr*.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--vcf_mode genotyped \
--prediction_output $RESULTS/vcf_1000G_hg37_en/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/vcf_1000G_hg37_en/Whole_Blood__summary.txt \
--verbosity 9 \
--throw

printf "association\n\n"
python3 $METAXCAN/PrediXcanAssociation.py \
--expression_file $RESULTS/vcf_1000G_hg37_en/Whole_Blood__predict.txt \
--input_phenos_file $DATA/1000G_hg37/random_pheno_1000G_hg37.txt \
--input_phenos_column pheno \
--output $RESULTS/vcf_1000G_hg37_en/Whole_Blood__association.txt \
--verbosity 9 \
--throw

Option: use the PrediXcan wrapper

You can run both steps in one command like so:

python3 $METAXCAN/PrediXcan.py \
--model_db_path $DATA/models/gtex_v8_en/en_Whole_Blood.db \
--vcf_genotypes $DATA/1000G_hg37/ALL.chr*.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--vcf_mode genotyped \
--prediction_output $RESULTS/vcf_1000G_hg37_en_b/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/vcf_1000G_hg37_en_b/Whole_Blood__summary.txt \
--input_phenos_file $DATA/1000G_hg37/random_pheno_1000G_hg37.txt \
--input_phenos_column pheno \
--output $RESULTS/vcf_1000G_hg37_en_b/Whole_Blood__association.txt \
--verbosity 9 \
--throw

Option: HDF5 storage

The following uses HDF5 for predicted expression. It is generally slower, as it stores the prediction matrix in disk, but it consumes a dramatically smaller amount of memory. This was implemented to run on large cohorts such as the UK Biobank

printf "Predict expression\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $DATA/models/gtex_v8_en/en_Whole_Blood.db \
--vcf_genotypes $DATA/1000G_hg37/ALL.chr*.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--vcf_mode genotyped \
--prediction_output $RESULTS/vcf_1000G_hg37_en_c/Whole_Blood__predict.h5 HDF5 \
--prediction_summary_output $RESULTS/vcf_1000G_hg37_en_c/Whole_Blood__summary.txt \
--verbosity 9 \
--throw

printf "association\n\n"
python3 $METAXCAN/PrediXcanAssociation.py \
--hdf5_expression_file $RESULTS/vcf_1000G_hg37_en_c/Whole_Blood__predict.h5 \
--input_phenos_file $DATA/1000G_hg37/random_pheno_1000G_hg37.txt \
--input_phenos_column pheno \
--output $RESULTS/vcf_1000G_hg37_en_c/Whole_Blood__association.txt \
--verbosity 9 \
--throw

Example 2: bgen hg19-based genotype, on GTEx v8 Elastic Net Model

Then next example shows how to run with a bgen. The data in the example bgen is actually Thousand Genomes from hg37.

printf "Predict expression\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $DATA/models/gtex_v8_en/en_Whole_Blood.db \
--bgen_genotypes $DATA/1000G_hg37_bgen/chr22.bgen \
--bgen_use_rsid \
--prediction_output $RESULTS/bgen_1000G_hg37_en/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/bgen_1000G_hg37_en/Whole_Blood__summary.txt \
--verbosity 9 \
--throw

printf "association\n\n"
python3 $METAXCAN/PrediXcanAssociation.py \
--expression_file $RESULTS/bgen_1000G_hg37_en/Whole_Blood__predict.txt \
--input_phenos_file $DATA/1000G_hg37/random_pheno_1000G_hg37.txt \
--input_phenos_column pheno \
--output $RESULTS/bgen_1000G_hg37_en/Whole_Blood__association.txt \
--verbosity 9 \
--throw

Example 3: VCF hg38-based genotype without rsids, on GTEx v8 Elastic Net Model

An interesting example consists of a genotype without variant ids whatsoever. However, using the genotype variants' coordinates, we can match to model variants.

#printf "Predict expression\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $DATA/models/gtex_v8_en/en_Whole_Blood.db \
--vcf_genotypes $DATA/1000G_hg38/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz \
--vcf_mode genotyped \
--variant_mapping $DATA/gtex_v8_eur_filtered_maf0.01_monoallelic_variants.txt.gz id rsid \
--on_the_fly_mapping METADATA "chr{}_{}_{}_{}_b38" \
--prediction_output $RESULTS/vcf_1000G_hg38_en/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/vcf_1000G_hg38_en/Whole_Blood__summary.txt \
--verbosity 9 \
--throw


printf "association\n\n"
python3 $METAXCAN/PrediXcanAssociation.py \
--expression_file $RESULTS/vcf_1000G_hg38_en/Whole_Blood__predict.txt \
--input_phenos_file $DATA/1000G_hg38/random_pheno_1000G_hg38.txt \
--input_phenos_column pheno \
--output $RESULTS/vcf_1000G_hg38_en/Whole_Blood__association.txt \
--verbosity 9 \
--throw

Example 4: VCF hg19-based genotype, on GTEx v8 MASHR Model

The following example is trickier. It assumes a genotype based on hg19, and we want to apply GTEx v8 mashr models. This poses a problem because the MASHR models include many variants that lack an rsid, so a mapping rule must be used. Also, the MASHR models are defined on hg38 so that the genotype's coordinates do not match the models'. We therefore specify a lift over conversion and pattern to match from converted coordinates to model variants:

printf "Predict expression\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $DATA/models/gtex_v8_mashr/mashr_Whole_Blood.db \
--model_db_snp_key varID \
--vcf_genotypes $DATA/1000G_hg37/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--vcf_mode genotyped \
--liftover $DATA/hg19ToHg38.over.chain.gz \
--on_the_fly_mapping METADATA "chr{}_{}_{}_{}_b38" \
--prediction_output $RESULTS/vcf_1000G_hg37_mashr/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/vcf_1000G_hg37_mashr/Whole_Blood__summary.txt \
--verbosity 9 \
--throw

printf "association\n\n"
python3 $METAXCAN/PrediXcanAssociation.py \
--expression_file $RESULTS/vcf_1000G_hg37_mashr/Whole_Blood__predict.txt \
--input_phenos_file $DATA/1000G_hg37/random_pheno_1000G_hg37.txt \
--input_phenos_column pheno \
--output $RESULTS/vcf_1000G_hg37_mashr/Whole_Blood__association.txt \
--verbosity 9 \
--throw

Example 5: VCF hg38-based genotype (GTEx vcf), on GTEx v8 MASHR Model

export PRE=/home/data/test-PrediXcan-GTEx
export DATA=$PRE/data
export MODEL=$PRE/models
export RESULTS=$PRE/results/
export METAXCAN=$PRE/repos/MetaXcan-master/software
export VCFSMALL=$PRE/data/gtex-small-common-test.vcf.gz

printf "Predict expression\n\n"

python3   $METAXCAN/Predict.py \
--model_db_path $PRE/models/gtex_v8_mashr/mashr_Whole_Blood.db \
--model_db_snp_key varID \
--vcf_genotypes  $VCFSMALL \
--vcf_mode genotyped \
--prediction_output $RESULTS/Whole_Blood__predict.txt  \
--prediction_summary_output $RESULTS/Whole_Blood__summary.txt \
--verbosity 9 \
--throw \
--on_the_fly_mapping METADATA "{}_{}_{}_{}_b38"

** Notice that GTEx v8 vcf file has the prefix chr in the chromosome number, so the patter should be "{}{}{}{}b38" and not "chr{}{}{}_{}_b38" **

Provide feedback

Saved searches

Use saved searches to filter your results more quickly