Admixed Genotype Simulation

A Snakemake pipeline for simulating genotypes for admixed population. The aim is to have fully automatic pipeline for simulating genotypes for admixed populations.

Note for updated software

Please see https://kangchenghou.github.io/admix-kit/simulate-admix-genotype.html for a more intuitive and easy-to-use interface.

Requirements

Reference phased genotype panels: download from https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html
admix-simu software: download from https://github.com/williamslab/admix-simu
HAPGEN2: download from https://mathgen.stats.ox.ac.uk/genetics_software/hapgen/hapgen2.html

Steps

Snakemake rule extract_raw: extract ancestry reference population and perform basic quality control. We extract the ancestral population according to 1KG sample files. And filter for bi-allelic SNPs that have minor allele frequency larger than the specificied one in all of the ancestral populations.
Snakemake rule extend_hapgen: extend the ancestral population using HAPGEN2.
Snakemake rule simulate_admixture: Simulate admixture population from the extended ancestral population.

Run

First set up the configuration, specified in config.yaml properly. Then submit the script

qsub snakemake.sh

Helper scripts

Extract the list of SNPs from 1kg that's overlap with UKB

import numpy as np
import pandas as pd
chr_i = 2
kg_legend = pd.read_csv(f'/u/project/pasaniuc/pasaniucdata/admixture/1000G_haplotype/1000GP_Phase3/1000GP_Phase3_chr{chr_i}.legend.gz', delim_whitespace=True)
ukb_bim = pd.read_csv(f"/u/project/sgss/UKBB/data/cal/{chr_i}.bim", delim_whitespace=True, header=None)
ukb_index = kg_legend['position'].isin(ukb_bim[3].values)
duplicated = kg_legend['position'].duplicated()
biallelic = kg_legend['TYPE'] == "Biallelic_SNP"
np.savetxt(f"data/ukb_array.snp_list", kg_legend.loc[ukb_index & (~duplicated) & biallelic, 'id'].values, fmt='%s')

Extract a subsample of SNPs from 1kg

import numpy as np
import pandas as pd
chr_i = 2
n_snp = 1_000
kg_legend = pd.read_csv(f'/u/project/pasaniuc/pasaniucdata/admixture/1000G_haplotype/1000GP_Phase3/1000GP_Phase3_chr{chr_i}.legend.gz', delim_whitespace=True)
maf_threshold = 0.01

maf_mode = "AND"
pops = ["AFR", "EUR"]
# filter biallelic SNPs and SNPs with the given MAF threshold in ALL / ANY population
maf_filter_index = ((maf_threshold < kg_legend[pops]) &
                    (kg_legend[pops] < 1 - maf_threshold))
if maf_mode == 'AND':
    maf_filter_index = maf_filter_index.all(axis=1)
elif maf_mode == 'OR':
    maf_filter_index = maf_filter_index.any(axis=1)

biallelic = kg_legend['TYPE'] == "Biallelic_SNP"
filter_index = np.where(maf_filter_index & biallelic)[0]
filter_index = filter_index[np.linspace(0, len(filter_index) - 1, num=n_snp + 2)[1:-1].astype(int)]

np.savetxt(f"data/kg_1k.snp_list", kg_legend.loc[filter_index, 'id'].values, fmt='%s')

Extract five regions of dense SNPs

import numpy as np
import pandas as pd
chr_i = 22
kg_legend = pd.read_csv(f'/u/project/pasaniuc/pasaniucdata/admixture/1000G_haplotype/1000GP_Phase3/1000GP_Phase3_chr{chr_i}.legend.gz', delim_whitespace=True)
ukb_bim = pd.read_csv(f"/u/project/pasaniuc/pasaniucdata/UKBB_IMPUTED_LD_SUMSTATS/genotype/raw/chr{chr_i}.bim", delim_whitespace=True, header=None)
ukb_index = kg_legend['position'].isin(ukb_bim[3].values)
partition = pd.read_csv(f"/u/project/pasaniuc/pasaniucdata/UKBB_IMPUTED_LD_SUMSTATS/partition/fourier_ls-chr{chr_i}.bed", delim_whitespace=True)
block_start, block_stop = partition.loc[10, "start"], partition.loc[15, "stop"]
duplicated = kg_legend['position'].duplicated()
biallelic = kg_legend['TYPE'] == "Biallelic_SNP"
snp_list = kg_legend.loc[ukb_index & (~duplicated) & 
              biallelic & 
              (block_start <= kg_legend['position']) &
              (kg_legend['position'] < block_stop), 'id'].values
np.savetxt(f"data/ukb_imputed_blocks.txt", snp_list, fmt='%s')

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
snakemake.sh		snakemake.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

Snakefile

Snakefile

config.yaml

config.yaml

snakemake.sh

snakemake.sh

utils.py

utils.py

Repository files navigation

Admixed Genotype Simulation

Note for updated software

Requirements

Steps

Run

Helper scripts

About

Releases

Packages

Languages

bogdanlab/admixed_genotype_simulation

Folders and files

Latest commit

History

Repository files navigation

Admixed Genotype Simulation

Note for updated software

Requirements

Steps

Run

Helper scripts

About

Resources

Stars

Watchers

Forks

Languages