Skip to content
Alvaro Barbeira edited this page Mar 2, 2020 · 1 revision

Transcriptome Model general topics

Transcriptome models (a.k.a as 'weight DB files', or 'prediction models') play a key role in MetaXcan's calculations. The summary-based methods such as S-PrediXcan and S-MulTiXcan also depend on pre-computed panels of LD reference (a.k.a. 'covariance matrices'). We have released matching pairs of prediction models and LD references in Predictdb. Any such pair has:

  • a .db file containing prediction models in an sqlite3 database that can be queried programmatically.
  • a .txt.gz with the covariance between variants in any model.

For example, if you want to use MASHR models from Whole Blood tissue, you must use mashr_Whole_Blood.db models with mashr_Whole_Blood.txt.gz. The whole point is using a compilation of LD that is a close match to the cohort where the models were trained. If you have GWAS summary-statistics from a different ancestry, the models will have decreased performance. Using models trained on populations closest to the GWAS study is preferred.

GTEx v8 models

In the initial PrediXcan and MetaXcan implementation, we relied on Elastic Net models for gene expression. We later added models for other mechanisms such as splicing variation. In GTEx v8, we introduced a new family of prediction models called MASHR that exhibited dramatically superior performance. These MASHR models use fine-mapping information and include many variants lacking an rsid, defined on hg38. To leverage this models on GWAS from older human genome releases, some pre-processing is tipycally needed. The choices boil down to:

  • Using Elastic Net models that are rsid-based. This is straightforward but has less power.
  • Using MASHR models that may require harmonization and imputation as explained in this tutorial.

Covariance Matrix Issues

When computing S-PrediXcan, you might get:

Uncontiguos SNP Entries

This error occurs when users generated their own models and LD references through custom methods.

Snp Covariance Entries for genes must be contiguous but [MY_GENE] was found in two different places in the file.

MetaXcan covariance files are assumed to contain all SNP entries for a given gene in a contiguous matter, as in:

gene_1 snp_1 ...
gene_1 snp_2 ...
gene_1 snp_3 ...
gene_2 snp_4 ...
gene_2 snp_5 ...
...

If this is not the case, then there might have been an error when generating the covariance file.

For example, snps for a given gene might have been present in different chromosome's files (for example, because of snp naming mismatch) when building the covariance. You can manually remove the sparse entries, but you might prefer to figure out why SNPs from different chromosomes are grouped in the hypothetical Transcriptome Model of this example. MetaXcan currently assumes that 'cis' snps are relevant. 'Trans' locality is something we haven't finished analising just yet.

SNP entries in a Covariance File that are not present in the Transcriptome Model

Transcriptome Models are taken to be the defining authority when grouping SNPS for a given Gene. All sort of errors will occur if a Covariance Matrix contains SNPS not present in the Weights for a gene's transcriptome model. So, if you are building custom Covariance Matrices, check that all SNP covariance entries for a gene have an SNP entry in the model.

The converse is not necessary: that is, not all Transcriptome Model SNPS must be present in the covariance matrices. It is allowed for snp data to be missing from the covariance matrix (for example, if you wanted to remove rare variants).

#General Issues

I get "SyntaxError: invalid syntax" when running MetaXcan

Metaxcan is implemented with Python 3.5. If you run with Python 2 or older, you are going to get syntax errors.

Results look really weird!

MetaXcan scripts won't run if the specified output is already present. So, if you run MetaXcan's scripts several times, you will have to either provide different paths for output arguments, or manually delete previous results, or pass the "--overwrite" option.

I get the following warning: DtypeWarning: Columns () have mixed types. Specify dtype option on import or set low_memory=False.

This happens when pandas, a dependency library, is parsing a data set (such as a GWAS) that contains a mix of heterogeneous values such as strings and numbers. An example of this is a column that contains mostly numeric values, but also some missing values coded as . or NA. It could also be caused by ill-formatted GWAS files. Another typical case is when a column with the chromosomes contains numbers {1,22} and the letters {X,Y}.

Please check if this is the case, and if not, let us know and we'll look into it.