Skip to content

Latest commit

 

History

History
230 lines (180 loc) · 7.12 KB

README.md

File metadata and controls

230 lines (180 loc) · 7.12 KB

Haplomap

Haplotype-based computational genetic mapping

Installation

conda install -c bioconda haplomap

Haplomap

HBCGM

Usage

0. Variant calling

See variant calling in the workflow folder using GATK, BCFtools, svtools.

Note: You need to use Ensemble-vep to annotate your VCF files.

This code snap works for haplomap

bcftools view -f .,PASS ${vcf} | \
        vep --fasta ${reference} ${genome_build} \
        --format vcf --fork ${threads} --hgvs --force_overwrite \
        --uniprot --domains --symbol --regulatory --distance 1000 --biotype \
        --gene_phenotype MGI --check_existing  --pubmed --numbers \
        --offline --cache --variant_class \
        --gencode_basic --no_intergenic --individual all \
        -o ${output} --tab --compress_output gzip \

1. Construct variant panel

WARNING: If you use GATK pipeline or Structrual Variants, you have to filter and subset SVs or Indels first. Haplomap convert work best with variants from BCFtools output.

build/bin/haplomap convert --type snp \
                           -o ${HOME}/data/SNPS/chr18.txt ${HOME}/data/VCFs/chr18.vcf

# support stdin, but much slower
zcat ${HOME}/data/VCFs/chr18.vcf.gz | bin/haplomap convert -o ${HOME}/data/SNPS/chr18.txt

2. Construct annotation file from ensemble-vep results

build/bin/haplomap annotate -o ${HOME}/data/SNPS/chr18.annotation.txt \
                            --type snp \
                            --samples test_strains.txt \ # only annotate the selected strains
                            input.vep.txt

(Optional) If you'd like to use ANNOVAR (contributed by Boyoung Yoo @byoo1), see this

    1. generate strain level gene annotation database (only run once), see here: scripts/gene_annotation
    • this step generates 3 files for next step.
      • AA_by_strains_chr*.pkl
      • mm10_kgXref.txt
      • mm10_knownGene.txt
    1. run annotateSNPs.py for each case (test_strains.txt) to get strain specific SNP annotation.
python scripts/annotateSNPs.py test_strains.txt chr18.txt \
                    AA_by_strains_chr18.pkl mm10_kgXref.txt mm10_knownGene.txt
                    genes_coding.txt genes_coding_transcript.txt

NOTE: Structural variants only support ensemble-vep inputs now !

3. Find haploblocks

build/bin/haplomap eblocks -a ${HOME}/data/SNPS/chr18.txt \
                     -g ${HOME}/data/chr18.annotation.txt \
                     -s ${HOME}/TMPDATA/test_strains.txt \
                     -o ${HOME}/TMPDATA/test.SNPs.hb.txt

4. Statistical testing with trait data

build/bin/haplomap ghmap -p ${HOME}/data/test_traits.txt \
                  -b ${HOME}/TMPDATA/test.SNPs.hb.txt \
                  -o ${HOME}/TMPDATA/test.final.output.txt

NOTE 1: By default. Output result are gene-summrized (see haplomap ghmap --help).
Recommend adding -a flag, which will output gene-oriented format results.

NOTE 2: strain order in (-p) should keep the same to the (-b). That's, eblocks (-s)

Input

see example folder for test cases.

eblocks:

  • Strain file (-s):
    • Tree column txt file: "#Abbrev \t (Optional) \t Values "
    • see test.strain.txt in the example folder
  • Allele file (-a): NIEHS compact format (use subcmd convert to convert vcf to niehs)
  • Gene Annotation (-g):
    • format: " <SNP_{chr}_{postion}> <gene_name> < consequence> "
    • see above to prepare this file

ghmap:

  • Trait file (-p):
    • same as eblocks -s: "#Abbrev \t (Optional) \t Values "
    • If multiple aninmal values for same strain, seperate them by comma. Example:
        129S1	18.2,19.1,14.3
        A_J	19.3,18.2
    
  • haploblocks (-b): eblocks output file
  • genetic distance matrix (-r): optional file, could obtain from plink.

How to get the genetic distance matrix

  1. convert vcf to plink .tped, .tfam
haplomap convert  --plink \
                  -o ${HOME}/data/SNPS/chr1.snp.txt \
                  --type snp input.vcf
  1. convert tped, tfam to .bed, .bim, .fam
plink --tfile chr1.snp --make-bed --out chr1
  1. merge .bed files
plink --bfile chr1 --merge-list mergelist.txt --make-bed --out mouse_merged

note: the mergelist.txt in this format

chr2.bed chr2.bim chr2.fam
chr3.bed chr3.bim chr3.fam
...
  1. Sample-distance and similarity matrices
plink --bfile mouse_merged # need .bim, .fam
      --make-rel square \ # Relationship/covariance
      --out mouse_grm \
      --threads 12
  1. format to (ghmap -r) input
cut -f2 mouse_grm.rel.id | tr "\n" "\t" > mouse_grm.dist
echo "#$(cat mouse_grm.dist)" > mouse_grm.dist
cat mouse_grm.rel >> mouse_grm.dist

Output

  1. eblocks:
  • SNP file (-p):
NO Field Explanation
0 Chrom chromosome
1 Position chromosome position (1-based)
2 SNP SNP name
3 Pattern Binary pattern of alleles, strain order is same to the input strain file
4 GeneName Associated Gene
5 CodonFlag Annotation
  • Haploblock file (-o):
NO Field Explanation
0 Chrom chromosome idx
1 BlockIdx Block id
2 BlockStart SNP vector start index; row index (0-based) of eblock -p output
3 Size SNP number of the HaploBlock
4 ChrBeg Chromosome begin position (1-based)
5 ChrEnd Chromosome end position
6 Pattern Haplotype pattern, strain order is same to the input strain file
7 GeneName Associated Gene
8 CodonFlag Annotation
  1. ghmap:
  • gene-oriented results file
NO Field Explanation
0 GeneName Associated Gene
1 CodonFlag -1: non-coding; 0: synonymouse; 1: missense; 2: splicing
2 Haplotype Haplotype pattern, see header line for strain order
3 FStat/Pvalue isCategorical ? Fstat : Pvalue
4 EffectSize Genetic Effect ( Omega^2 )
5 FDR Benjamini Hochberg. If categorical, skip
6 popPvalue Pillai’s Trace Pvalue
7 popFDR Pillai’s Trace FDR
8 Chr Chromosome
9 ChrStart Chromosome begin position (1-based)
10 ChrEnd Chromosome end position
11 BlockIdx HaploBlock's index; the second column of eblock -o output
12 BlockStart SNPVector's index; row index (0-based) of eblock -p output
13 BlockSize SNP number of HaploBlock
14 Expression Gene expression Map, see header line
  • block-oriented result file

BlockIdx | BlockStart | blockSize | ChrIdx | ChrStart | ChrEnd | Pattern | Fstat/Pval | Effect | FDR | GeneExpMap | Gene | CodonFlag | ...

CodonFlag

i. SNPs

  • -1: Non-codon change
  • 0: Synonymous (not important)
  • 1: missense/nonsense
  • 2: Splicing site change
  • 3: Stop codon

ii. Indels and structral variants:

  • 2: HIGH
  • 1: MODERATE
  • 0: LOW
  • -1: MODIFIER

See explanation here