Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqGDS2VCF error after using seqMerge #68

Open
gustavahlberg opened this issue Jan 20, 2021 · 8 comments
Open

seqGDS2VCF error after using seqMerge #68

gustavahlberg opened this issue Jan 20, 2021 · 8 comments
Assignees
Labels

Comments

@gustavahlberg
Copy link

gustavahlberg commented Jan 20, 2021

After having merged some gds files I am getting an error with seqGDS2VCF due to seqSummary()
This stems from the .summary_filter() function were the attributes of the merged gds seems to somehow get twisted.

traceback()
4: stop(gettextf("arguments imply differing number of rows: %s",                                                                                                                                           
        paste(unique(nrows), collapse = ", ")), domain = NA)                                                                                                                                                
 3: data.frame(ID = id, Description = dp, stringsAsFactors = FALSE)                                                                                                                                         
 2: .summary_filter(gdsfile, check, verbose)                                                                                                                                                                
 1: seqSummary(gdsmerged, check = "none", verbose = FALSE)

Filter attribute in one of the original gds files

 $R.class                                                                                                                                                                                                   
 [1] "factor"                                                                                                                                                                                               
                                                                                                                                                                                                            
 $R.levels                                                                                                                                                                                                  
 [1] "PASS"        "MONOALLELIC"                                                                                                                                                                            
                                                                                                                                                                                                            
 $Description                                                                                                                                                                                               
 [1] "All filters passed"                                                                                                                                                                                   
 [2] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"                                                                 
                                                                                                                                                                                                            
 $md5                                                                                                                                                                                                       
 [1] "dde0f98a641ade2203cb5ab84d037576"  

Filter attribute in the merged gds

 $Description                                                                                                                                                                                               
  [1] "All filters passed"                                                                                                                                                                                  
  [2] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"                                                                
  [3] "All filters passed"                                                                                                                                                                                  
  [4] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"                                                                
  [5] "All filters passed"                                                                                                                                                                                  
  [6] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"                                                                
  [7] "All filters passed"                                                                                                                                                                                  
  [8] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"                                                                
  [9] "All filters passed"                                                                                                                                                                                  
 [10] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"                                                                
 [11] "All filters passed"                                                                                                                                                                                  
 [12] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"                                                                
 [13] "All filters passed"                                                                                                                                                                                  
 [14] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"                                                                
 [15] "All filters passed"                                                                                                                                                                                  
 [16] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"                                                                
 [17] "All filters passed"                                                                                                                                                                                  
 [18] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"                                                                
 [19] "All filters passed"                                                                                                                                                                                  
 [20] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"                                                                
 [21] "All filters passed"                                                                                                                                                                                  
 [22] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"                                                                
                                                                                                                                                                                                            
 $md5                                                                                                                                                                                                       
 [1] "754cf550034a5297723ac2e7ac510f9c"
sessionInfo()
 R version 4.0.0 (2020-04-24)                                                                                                                                                                               
 Platform: x86_64-pc-linux-gnu (64-bit)                                                                                                                                                                     
 Running under: CentOS Linux 7 (Core)                                                                                                                                                                       
                                                                                                                                                                                                            
 Matrix products: default                                                                                                                                                                                   
 BLAS/LAPACK: /services/tools/intel/perflibs/2019/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so                                                                            
                                                                                                                                                                                                            
 locale:                                                                                                                                                                                                    
  [1] LC_CTYPE=en_US.utf-8       LC_NUMERIC=C                                                                                                                                                               
  [3] LC_TIME=en_US.utf-8        LC_COLLATE=en_US.utf-8                                                                                                                                                     
  [5] LC_MONETARY=en_US.utf-8    LC_MESSAGES=en_US.utf-8                                                                                                                                                    
  [7] LC_PAPER=en_US.utf-8       LC_NAME=C                                                                                                                                                                  
  [9] LC_ADDRESS=C               LC_TELEPHONE=C                                                                                                                                                             
 [11] LC_MEASUREMENT=en_US.utf-8 LC_IDENTIFICATION=C                                                                                                                                                        
                                                                                                                                                                                                            
 attached base packages:                                                                                                                                                                                    
 [1] stats     graphics  grDevices utils     datasets  methods   base                                                                                                                                       
                                                                                                                                                                                                            
 other attached packages:                                                                                                                                                                                   
 [1] GMMAT_1.3.1        SeqVarTools_1.28.0 SeqArray_1.30.0    gdsfmt_1.26.0                                                                                                                                 
                                                                                                                                                                                                            
 loaded via a namespace (and not attached):                                                                                                                                                                 
  [1] Rcpp_1.0.5             pillar_1.4.7           compiler_4.0.0                                                                                                                                          
  [4] bigparallelr_0.3.0     GenomeInfoDb_1.26.0    XVector_0.30.0                                                                                                                                          
  [7] bitops_1.0-6           iterators_1.0.13       tools_4.0.0                                                                                                                                             
 [10] zlibbioc_1.36.0        tibble_3.0.4           lifecycle_0.2.0                                                                                                                                         
 [13] nlme_3.1-147           lattice_0.20-41        mgcv_1.8-31                                                                                                                                             
 [16] logistf_1.24           pkgconfig_2.0.3        rlang_0.4.9                                                                                                                                             
 [19] Matrix_1.2-18          foreach_1.5.1          flock_0.7                                                                                                                                               
 [22] parallel_4.0.0         RhpcBLASctl_0.20-137   CompQuadForm_1.4.3                                                                                                                                      
 [25] GenomeInfoDbData_1.2.4 bigassertr_0.1.3       dplyr_1.0.2                                                                                                                                             
 [28] generics_0.1.0         vctrs_0.3.5            Biostrings_2.58.0                                                                                                                                       
 [31] S4Vectors_0.28.0       IRanges_2.24.0         tidyselect_1.1.0                                                                                                                                        
 [34] stats4_4.0.0           grid_4.0.0             data.table_1.13.4                                                                                                                                       
 [37] glue_1.4.2             mice_3.12.0            Biobase_2.50.0                                                                                                                                          
 [40] R6_2.5.0               GWASExactHW_1.01       BiocParallel_1.24.1                                                                                                                                     
 [43] tidyr_1.1.2            purrr_0.3.4            magrittr_2.0.1                                                                                                                                          
 [46] Rsamtools_2.6.0        backports_1.2.0        ellipsis_0.3.1                                                                                                                                          
 [49] codetools_0.2-16       operator.tools_1.6.3   formula.tools_1.7.1                                                                                                                                     
 [52] BiocGenerics_0.36.0    GenomicRanges_1.42.0   splines_4.0.0                                                                                                                                           
 [55] RCurl_1.98-1.2         doParallel_1.0.16      broom_0.7.2                                                                                                                                             
 [58] crayon_1.3.4                         
@zhengxwen zhengxwen self-assigned this Jan 22, 2021
@zhengxwen zhengxwen added the bug label Jan 22, 2021
@gustavahlberg
Copy link
Author

Hi Zheng
Will you be addressing this anytime soon?
Just need to know cause otherwise I will make a quick fix myself so I can move on.

Bw,
Gustav

@zhengxwen
Copy link
Owner

A quick fix could be:

f <- seqOpen("your_gds_file.gds", readonly=F)

node <- index.gdsn(f, "annotation/filter")
(s <- get.attr.gdsn(node)$Description)

put.attr.gdsn(node, "Description", unique(s))

seqClose(f)

@gustavahlberg
Copy link
Author

Ah yes. Thank you!

@zhengxwen
Copy link
Owner

In your case, did you merge multiple gds files with different variants? or multiple gds files with different samples?

@gustavahlberg
Copy link
Author

different variants. Actually merging the 200K UKB exome vcfs (gds).

@gustavahlberg
Copy link
Author

gustavahlberg commented Feb 1, 2021

seqGDS2VCF is also printing "." filter as NA which cause some downstream error since NA is not declared in the header. And I don't know how to rewrite the header in the gds without messing something up.
Also the contigs are converted to 'chrX' -> 'X' (which I prefer) but that is not changed in contigs listed in the header, which also cause some downstream errors.
I tried to fix using write.gdsn() for "description/vcf.contig/ID" but it doesn't work.

@gustavahlberg
Copy link
Author

Sorry, solved the last by just using add.gdsn instead of write.gdsn. And I added NA as a filterlevel in filter attribute which seem to work

@zhengxwen
Copy link
Owner

"seqGDS2VCF is also printing "." filter as NA."
It is fixed in the development version:
http://www.bioconductor.org/packages/devel/bioc/news/SeqArray/NEWS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants