Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF format issues with --write-vcf, - FORMAT field inconsistencies #175

Open
insilicool opened this issue Apr 18, 2023 · 1 comment
Open

Comments

@insilicool
Copy link

Hi,

I've been attempting to parse the vcfs generated by the --write-vcf option into R and encountering issues. I believe this has to do with the inconsistency in the number of fields in the FORMAT section.

e.g. with the CMRG truth set

module load mugqic_dev/python/2.7.13 &&
0.3.15/hap.py
--threads 3 --write-vcf --decompose --preprocess-truth
--engine vcfeval
HG002_GRCh37_CMRG_smallvar_v1.00.vcf.gz
HG002_GRCh37_CMRG_smallvar_v1.00.vcf.gz
-f HG002_GRCh37_CMRG_smallvar_v1.00.bed
-r Homo_sapiens.hs37d5.fa
--engine-vcfeval-template Homo_sapiens.hs37d5.SDF
-o comparisons_0.3.15/HG002-cmrg-happy_0.3.15

This results in variants looking like this:
bcftools query -f '%CHROM\t%POS\t%FORMAT\n' HG002-cmrg-happy_0.3.15.vcf.gz

Problematic calls:
21 47407748 GT:BD:BI:BVT:BLT:QQ 0/1:N:ti:SNP:het:. 0/1:N:ti:SNP:het:0
21 47407754 GT:BD:BI:BVT:BLT:QQ 0/1:N:i1_5:INDEL:het:. 0/1:N:i1_5:INDEL:het:0

Correctly formatted:
21 47408099 GT:BD:BK:QQ:BI:BVT:BLT 1/1:TP:gm:30:tv:SNP:homalt 1/1:TP:gm:30:tv:SNP:homalt
21 47408101 GT:BD:BK:QQ:BI:BVT:BLT 0/1:TP:gm:30:i16_plus:INDEL:het 0/1:TP:gm:30:i16_plus:INDEL:het

As you can see BK field is missing and QQ field is in a different location. Additionally, I don't see documentation explaining what a BD ="N" represents.

I first observed this in 0.3.12 but it persists in 0.3.15 as well.

Thanks for any help and insight you can provide.

  • Rob Eveleigh
@YuanTian1991
Copy link

My R code to parse the output:

library("data.table")
library("tidyverse")
library("reshape2")

message("Reading VCF...")
happy_vcf <- fread(happy_vcf_path, sep='\t', header = TRUE, skip = '#CHROM') %>%
    .[,INFO:=NULL]
  
  format_types <- unique(happy_vcf$FORMAT)
  
  truth <- list()
  query <- list()
  for(i in format_types) {
    message("Parsing ", i, "...")
    index <- which(happy_vcf$FORMAT == i)
    col_names <- str_split(i, ":")[[1]]
    
    truth[[i]] <- colsplit(happy_vcf[index, TRUTH], ":", names=col_names) %>% 
      setDT %>% .[,index:=index]
    
    query[[i]] <- colsplit(happy_vcf[index, QUERY], ":", names=col_names) %>% 
      setDT %>% .[,index:=index]
  }
  
  truth <- rbindlist(truth, fill = TRUE) %>% .[order(index), ]
  query <- rbindlist(query, fill = TRUE) %>% .[order(index), ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants