Haplomap

Haplotype-based computational genetic mapping

Installation

conda install -c bioconda haplomap

Usage

0. Variant calling

See variant calling in the workflow folder using GATK, BCFtools, svtools.

Note: You need to use Ensemble-vep to annotate your VCF files.

This code snap works for haplomap

bcftools view -f .,PASS ${vcf} | \
        vep --fasta ${reference} ${genome_build} \
        --format vcf --fork ${threads} --hgvs --force_overwrite \
        --uniprot --domains --symbol --regulatory --distance 1000 --biotype \
        --gene_phenotype MGI --check_existing  --pubmed --numbers \
        --offline --cache --variant_class \
        --gencode_basic --no_intergenic --individual all \
        -o ${output} --tab --compress_output gzip \

1. Construct variant panel

WARNING: If you use GATK pipeline or Structrual Variants, you have to filter and subset SVs or Indels first. Haplomap convert work best with variants from BCFtools output.

build/bin/haplomap convert --type snp \
                           -o ${HOME}/data/SNPS/chr18.txt ${HOME}/data/VCFs/chr18.vcf

# support stdin, but much slower
zcat ${HOME}/data/VCFs/chr18.vcf.gz | bin/haplomap convert -o ${HOME}/data/SNPS/chr18.txt

2. Construct annotation file from ensemble-vep results

build/bin/haplomap annotate -o ${HOME}/data/SNPS/chr18.annotation.txt \
                            --type snp \
                            --samples test_strains.txt \ # only annotate the selected strains
                            input.vep.txt

(Optional) If you'd like to use ANNOVAR (contributed by Boyoung Yoo @byoo1), see this

1. generate strain level gene annotation database (only run once), see here: scripts/gene_annotation
- this step generates 3 files for next step.
  - AA_by_strains_chr*.pkl
  - mm10_kgXref.txt
  - mm10_knownGene.txt
1. run annotateSNPs.py for each case (test_strains.txt) to get strain specific SNP annotation.

python scripts/annotateSNPs.py test_strains.txt chr18.txt \
                    AA_by_strains_chr18.pkl mm10_kgXref.txt mm10_knownGene.txt
                    genes_coding.txt genes_coding_transcript.txt

NOTE: Structural variants only support ensemble-vep inputs now !

3. Find haploblocks

build/bin/haplomap eblocks -a ${HOME}/data/SNPS/chr18.txt \
                     -g ${HOME}/data/chr18.annotation.txt \
                     -s ${HOME}/TMPDATA/test_strains.txt \
                     -o ${HOME}/TMPDATA/test.SNPs.hb.txt

4. Statistical testing with trait data

build/bin/haplomap ghmap -p ${HOME}/data/test_traits.txt \
                  -b ${HOME}/TMPDATA/test.SNPs.hb.txt \
                  -o ${HOME}/TMPDATA/test.final.output.txt

NOTE 1: By default. Output result are gene-summrized (see haplomap ghmap --help).
Recommend adding -a flag, which will output gene-oriented format results.

NOTE 2: strain order in (-p) should keep the same to the (-b). That's, eblocks (-s)

Input

see example folder for test cases.

eblocks:

Strain file (-s):
- Tree column txt file: "#Abbrev \t (Optional) \t Values "
- see test.strain.txt in the example folder
Allele file (-a): NIEHS compact format (use subcmd convert to convert vcf to niehs)
Gene Annotation (-g):
- format: " <SNP_{chr}_{postion}> <gene_name> < consequence> "
- see above to prepare this file

ghmap:

Trait file (-p):
- same as eblocks -s: "#Abbrev \t (Optional) \t Values "
- If multiple aninmal values for same strain, seperate them by comma. Example:
```
    129S1	18.2,19.1,14.3
    A_J	19.3,18.2
```
haploblocks (-b): eblocks output file
genetic distance matrix (-r): optional file, could obtain from plink.

How to get the genetic distance matrix

convert vcf to plink .tped, .tfam

haplomap convert  --plink \
                  -o ${HOME}/data/SNPS/chr1.snp.txt \
                  --type snp input.vcf

convert tped, tfam to .bed, .bim, .fam

plink --tfile chr1.snp --make-bed --out chr1

merge .bed files

plink --bfile chr1 --merge-list mergelist.txt --make-bed --out mouse_merged

note: the mergelist.txt in this format

chr2.bed chr2.bim chr2.fam
chr3.bed chr3.bim chr3.fam
...

Sample-distance and similarity matrices

plink --bfile mouse_merged # need .bim, .fam
      --make-rel square \ # Relationship/covariance
      --out mouse_grm \
      --threads 12

format to (ghmap -r) input

cut -f2 mouse_grm.rel.id | tr "\n" "\t" > mouse_grm.dist
echo "#$(cat mouse_grm.dist)" > mouse_grm.dist
cat mouse_grm.rel >> mouse_grm.dist

Output

eblocks:

SNP file (-p):

NO	Field	Explanation
0	Chrom	chromosome
1	Position	chromosome position (1-based)
2	SNP	SNP name
3	Pattern	Binary pattern of alleles, strain order is same to the input strain file
4	GeneName	Associated Gene
5	CodonFlag	Annotation

Haploblock file (-o):

NO	Field	Explanation
0	Chrom	chromosome idx
1	BlockIdx	Block id
2	BlockStart	SNP vector start index; row index (0-based) of eblock -p output
3	Size	SNP number of the HaploBlock
4	ChrBeg	Chromosome begin position (1-based)
5	ChrEnd	Chromosome end position
6	Pattern	Haplotype pattern, strain order is same to the input strain file
7	GeneName	Associated Gene
8	CodonFlag	Annotation

ghmap:

gene-oriented results file

NO	Field	Explanation
0	GeneName	Associated Gene
1	CodonFlag	-1: non-coding; 0: synonymouse; 1: missense; 2: splicing
2	Haplotype	Haplotype pattern, see header line for strain order
3	FStat/Pvalue	isCategorical ? Fstat : Pvalue
4	EffectSize	Genetic Effect ( Omega^2 )
5	FDR	Benjamini Hochberg. If categorical, skip
6	popPvalue	Pillai’s Trace Pvalue
7	popFDR	Pillai’s Trace FDR
8	Chr	Chromosome
9	ChrStart	Chromosome begin position (1-based)
10	ChrEnd	Chromosome end position
11	BlockIdx	HaploBlock's index; the second column of eblock -o output
12	BlockStart	SNPVector's index; row index (0-based) of eblock -p output
13	BlockSize	SNP number of HaploBlock
14	Expression	Gene expression Map, see header line

block-oriented result file

CodonFlag

i. SNPs

-1: Non-codon change
0: Synonymous (not important)
1: missense/nonsense
2: Splicing site change
3: Stop codon

ii. Indels and structral variants:

2: HIGH
1: MODERATE
0: LOW
-1: MODIFIER

See explanation here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Haplomap

Installation

Usage

0. Variant calling

1. Construct variant panel

2. Construct annotation file from ensemble-vep results

3. Find haploblocks

4. Statistical testing with trait data

Input

eblocks:

ghmap:

Output

Files

README.md

Latest commit

History

README.md

File metadata and controls

Haplomap

Installation

Usage

0. Variant calling

1. Construct variant panel

2. Construct annotation file from ensemble-vep results

3. Find haploblocks

4. Statistical testing with trait data

Input

eblocks:

ghmap:

Output