B-GEX

Blood-based multi-tissue gene expression inference with Bayesian regression. The method is published in the Bioinformatics 2020. (https://doi.org/10.1093/bioinformatics/btaa239).

The inference models are tissue-specific, so we built an independent inference models for each tissue.

Preprocessing

Download data files from GTEx Portal and gencode:

GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct
GTEx_v7_Annotations_SampleAttributesDS.txt
gencode.v30lift37.annotation.gtf

Collect pseudogenes into a file. Then remove pseudogene from gene_tpm

awk -F "\t" '$3 == "gene" && $9~/pseudogene/ {print}' gencode.v30lift37.annotation.gtf | cut -f9 | cut -d" " -f2,4 | sed 's/[";]//g' | sed 's/[\. ]/\t/g' | cut -f1,3 > gencode.v30lift37.pseudogene.txt

cut -f1 gencode.v30lift37.pseudogene.txt | grep -v -f - GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct > GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.no_pseudogene

Main script

The rest of steps can be done with B-GEX_v2.ipynb on a linux x86_64 server. We have tested on 2 x Intel(R) Xeon(R) Gold 6144 CPU 3.50GHz, 128 Gb RAM.

How it works?

Feature selection

Feature is the blood gene expression profile. Target is the tissue gene expression profile.

First, we compute coefficient of variation of each features and rank them in descending order. We select those top 10% features to construct a low dimensional new subsets.

Second, we assume that one feature gene is important when the absolute of cosine similarity between vector. We use cosine similarity to reduce to [5,10,15,20,...] features per target gene.

Train and evaluate baseline LSR model with these feature sets. We found 10 features is an optimal feature set for most tissues.

Model training with optimal feature set

The optimal feature set used to make a linear regression model for each gene in each tissue.

The models’ performance is further analyzed at gene level with MAE, r(Pearsons correlation coefficient), RMSE, etc.

Interpretation of predict result

For convenience, we separate genes into predictable and non-predictable genes according to empirical quality check results.

Rule 1, if target gene’s MAE < 0.7 and r > 0.3, the gene is predictable.

Rule 2. If the ratio of predictable genes to total genes > 0.2, the tissue is predictable.

B-GEX outputs inferred gene expression values from blood gene expression profiles together with empirical quality check marks (MAE, RMSE, r, and predictability). Users can filter the output by these marks and should be cautious when the genes of interest fall into the non-predictable category.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
demo		demo
v1		v1
26tissues_samplesize.txt		26tissues_samplesize.txt
B-GEX_v2.ipynb		B-GEX_v2.ipynb
GTEx_v7_Annotations_SampleAttributesDS.txt		GTEx_v7_Annotations_SampleAttributesDS.txt
README.md		README.md
TISSUE_list.txt		TISSUE_list.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

demo

demo

v1

v1

26tissues_samplesize.txt

26tissues_samplesize.txt

B-GEX_v2.ipynb

B-GEX_v2.ipynb

GTEx_v7_Annotations_SampleAttributesDS.txt

GTEx_v7_Annotations_SampleAttributesDS.txt

README.md

README.md

TISSUE_list.txt

TISSUE_list.txt

Repository files navigation

B-GEX

Preprocessing

Main script

How it works?

Feature selection

Model training with optimal feature set

Interpretation of predict result

About

Releases

Packages

Languages

xuwenjian85/B-GEX

Folders and files

Latest commit

History

Repository files navigation

B-GEX

Preprocessing

Main script

How it works?

Feature selection

Model training with optimal feature set

Interpretation of predict result

About

Topics

Resources

Stars

Watchers

Forks

Languages