Skip to content

mulinlab/CAUSALdb-finemapping-pip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine-mapping

Description

This is a pipeline of summary statistics based fine-mapping using three commonly-used tools.

Contents

Getting started

Clone this repo:

git clone https://github.com/mulinlab/CAUSALdb-finemapping-pip.git

set up conda environment and download fine-mapping tools:

cd bin
bash 00_set_up.sh
cd ..

build reference panel:

cd ref
python 01_prepare_reference.py
cd ..

❗ In PAINTOR's framework, they go through every VCF file when computing LD which is very time-consuming. So I processed the VCF files in 1000G FTP by splitting them into blocks and convert them into genotype matrix. Although the preparation will cost hours, it can speed up at 30 times. If you cannot wait for the reference preparation, you can get the ready data from 115...212:/f/jianhua/jianhua_pipeline/Fine-mapping/ref. Also, if you are skilled with python, you can figure out what else I have done with the original VCF files.

Usage

I wrote the whole pipeline in a python script and it quite simple after activating the conda environment.

I also wrote a brief help:

(base) ➜  conda activate finemap
(finemap) ➜  python fine_map_pipe.py -h
    
usage: Finemap using three tools, output variants in total credible sets

python fine_map_pipe.py -s 120575 SUM_STAT ./

positional arguments:
  input               input summary statistics
  output              directory of output file, default:working directory

optional arguments:
  -h, --help          show this help message and exit
  -p , --pop          population of input data,[EUR,EAS,SAS,AMR,AFR],
                      default=EUR
  -n , --maxcausal    the maximum number of allowed causal SNPs, default=1
  -s , --samplesize   sample size of input data
  -c , --cred         the cutoff of credible set, default=0.95
  -t , --thread       number of threads, default=10

So we can finemap the test data in input folder which is can be browsed in CAUSALdb using:

python fine_map_pipe.py -s 120575 ./input/GR018.txt.gz output

Input

Just like the test data, the input file should be a txt file or .txt.gz file containing below information:

Column Description
CHR chromosome, int, 1-22
BP Base-pair, int, hg19
rsID can be NA
MAF Minor allele frequency, (0, 0.5]
EA Effective allele, forward strand, capital
NEA Non-effective allele, forward strand, capital
BETA Effect size
SE Standard error
P P value, (0, 1)
Zscore Equal to BETA/SE
zless input/GR018.txt.gz| head -n2
CHR	BP	rsID	MAF	EA	NEA	BETA	SE	P	Zscore
1	762320	exm2268640		C	T	-2.1671804559878427	2.2192489884814903	0.3288	-0.9765377689642318

Output

The output is a txt file including summary statistics and posterior probability of variants in all credible sets.

The example output is:

head -n2 ./output/GR018_total_credible_set.txt
CHR	BP	rsID	MAF	EA	NEA	BETA	SE	P	Zscore	PAINTOR	CAVIARBF	FINEMAP	block_id	label
1	55505647	rs11591147	0.020899999999999967	T	G	-1.4101773015832226	0.2294923644583753	8.033e-10	-6.144767844069152	0.946675	0.9511994	0.950832	34	7

I appended five columns in the input file:

Column Description
PAINTOR Posterior probability computing by PAINTOR
CAVIARBF Posterior probability computing by CAVIARBF
FINEMAP Posterior probability computing by FINEMAP
block_id No. of block in /ref/blocks.txt
label Because the credible set defined by the three tools are different, I used a this column to annotate the type of credible set of the variants. And the rule is PAINTOR:1, CAVIARBF: 2, FINEMAP: 4. So label = 5 means this variant is in the credible set defined by PAINTOR and FINEMAP, but not in CAVIARBF's credible set.

Caution

  • The default maximum number of allowed causal SNPs is 1, I haven't tested more than 1 causal SNP for many times, so I'm not sure whether it can work in all condition.
  • We used the relatively independent LD blocks determined by ldtect which only include three population. And considering the preparation of reference panel, I used blocks of EUR population for all condition as it's the most popular population.

Reference

Citation

Wang J, Huang D, Zhou Y, Yao H, Liu H, Zhai S, Wu C, Zheng Z, Zhao K, Wang Z, Yi X, Zhang S, Liu X, Liu Z, Chen K, Yu Y, Sham PC, Li MJ. CAUSALdb: a database for disease/trait causal variants identified using summary statistics of genome-wide association studies. Nucleic Acids Res. 2020 Jan 8;48(D1):D807-D816.