Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running vcf2xg.sh on human data #42

Open
moldach opened this issue Nov 12, 2020 · 30 comments · Fixed by #46
Open

Error running vcf2xg.sh on human data #42

moldach opened this issue Nov 12, 2020 · 30 comments · Fixed by #46

Comments

@moldach
Copy link

moldach commented Nov 12, 2020

I've returned to give MoMI-G another shot, this time with human data but I get an error trying to generate the xg file.

$ ~/MoMI-G$ sudo bash MoMI-G/scripts/vcf2xg.sh proband_trim_bwaMEM_sort_dedupped.bam.generator.V2.overlap.hashcount.fastq.bam.FINAL.vcf momig-vcf-test-output MoMI-G/scripts/vg Homo_sapiens.GRCh37.dna.toplevel.fa


+ input=.//momig-vcf-test-output.pcf
+ reference=Homo_sapiens.GRCh37.dna.toplevel.fa
+ pref=.//momig-vcf-test-output.pcf
+ sort -t, -k 7n .//momig-vcf-test-output.pcf
+ sed -e 's/[ ]*//g'
+ awk -F '[,:]' '{print $1,"\t",$2,"\n",$4,"\t",$5}' /dev/fd/63
+ sort -k 1,1 -k 2,2n
+ uniq
++ sed 1d .//momig-vcf-test-output.pcf
+ '[' '!' -s Homo_sapiens.GRCh37.dna.toplevel.fa.fai ']'
++ dirname MoMI-G/scripts/vcf2gfa/pcf2gfa.sh
+ ruby MoMI-G/scripts/vcf2gfa/json_to_breakpoint_list.rb .//momig-vcf-test-output.pcf.bp.tsv Homo_sapiens.GRCh37.dna.toplevel.fa
+ cat .//momig-vcf-test-output.pcf.bp.tsv breakpoint_list_tmp.tsv
+ sort -k 1,1 -k 2,2n
+ uniq
++ dirname MoMI-G/scripts/vcf2gfa/pcf2gfa.sh
+ ruby MoMI-G/scripts/vcf2gfa/gfa_generator.rb .//momig-vcf-test-output.pcf.bp.merged.tsv .//momig-vcf-test-output.pcf.output.pcf Homo_sapiens.GRCh37.dna.toplevel.fa
Traceback (most recent call last):
	5: from MoMI-G/scripts/vcf2gfa/gfa_generator.rb:42:in `<main>'
	4: from MoMI-G/scripts/vcf2gfa/gfa_generator.rb:42:in `open'
	3: from MoMI-G/scripts/vcf2gfa/gfa_generator.rb:43:in `block in <main>'
	2: from MoMI-G/scripts/vcf2gfa/gfa_generator.rb:43:in `each_line'
	1: from MoMI-G/scripts/vcf2gfa/gfa_generator.rb:49:in `block (2 levels) in <main>'
MoMI-G/scripts/vcf2gfa/gfa_generator.rb:25:in `fasta': [ERROR] unexpected sequence :0- (RuntimeError)
	5: from MoMI-G/scripts/vcf2gfa/gfa_generator.rb:42:in `<main>'
	4: from MoMI-G/scripts/vcf2gfa/gfa_generator.rb:42:in `open'
	3: from MoMI-G/scripts/vcf2gfa/gfa_generator.rb:43:in `block in <main>'
	2: from MoMI-G/scripts/vcf2gfa/gfa_generator.rb:43:in `each_line'
	1: from MoMI-G/scripts/vcf2gfa/gfa_generator.rb:48:in `block (2 levels) in <main>'
MoMI-G/scripts/vcf2gfa/gfa_generator.rb:51:in `rescue in block (2 levels) in <main>': [ERROR] in 1 0 : [ERROR] unexpected sequence :0- (RuntimeError)

This is the output files:

(base) mtg@mtg-ThinkPad-P53:~/MoMI-G$ ls -al
total 31834880
drwxrwxr-x  4 mtg  mtg         4096 Nov 12 10:40 ./
drwx------ 78 mtg  mtg        24576 Nov 11 14:57 ../
-rw-r--r--  1 root root    14029435 Nov 12 10:40 breakpoint_list_tmp.tsv
-rw-rw-r--  1 mtg  mtg          798 Nov 11 15:04 config.yaml
-rwxr-x---  1 mtg  mtg  32570478447 Nov 11 21:48 Homo_sapiens.GRCh37.dna.toplevel.fa*
-rwxr-x---  1 mtg  mtg        11646 Nov 12 09:04 Homo_sapiens.GRCh37.dna.toplevel.fa.fai*
drwxrwxr-x  9 mtg  mtg         4096 Nov 11 15:06 MoMI-G/
-rw-r--r--  1 root root          11 Nov 12 10:40 momig-vcf-test-output.ggf
-rw-r--r--  1 root root         111 Nov 12 10:40 momig-vcf-test-output.json
-rw-r--r--  1 root root       13492 Nov 12 10:40 momig-vcf-test-output.pcf
-rw-r--r--  1 root root    14032093 Nov 12 10:40 momig-vcf-test-output.pcf.bp.merged.tsv
-rw-r--r--  1 root root        2658 Nov 12 10:40 momig-vcf-test-output.pcf.bp.tsv
-rw-r--r--  1 root root       13492 Nov 12 10:40 momig-vcf-test-output.pcf.output.pcf
-rw-r--r--  1 root root         558 Nov 12 10:40 momig-vcf-test-output.vg
-rw-r--r--  1 root root        3330 Nov 12 10:40 momig-vcf-test-output.xg
-rw-r--r--  1 mtg  mtg        62805 Nov 11 15:12 proband_trim_bwaMEM_sort_dedupped.bam.generator.V2.overlap.hashcount.fastq.bam.FINAL.vcf
-rw-r--r--  1 root root       62805 Nov 12 10:40 proband_trim_bwaMEM_sort_dedupped.bam.generator.V2.overlap.hashcount.fastq.bam.FINAL.vcf.vcf
@moldach
Copy link
Author

moldach commented Nov 16, 2020

Please let me know if you can help me debug this.
I can share some data with you if needed

@6br
Copy link
Contributor

6br commented Nov 16, 2020

Thank you for your report. I will get back to you soon.

@6br
Copy link
Contributor

6br commented Nov 17, 2020

I improved the error message in the latest commit. Is it possible for you to try it again?
In general case, when the reference id is different between the VCF file and the reference fasta file (for example, one is chr1 and the other is 1), the missing reference error occurs.

@KokoroO7
Copy link

Hi, I treated my sequel2 human CCS data (pacbio) with smrttools CCS, pbmm2, and pbsv.
Generated vcf file was bgzipped and treated with bcftools to generate chr1-only data.
After that chr1.vcf.gz was decompressed to test.chr1.vcf
Then I want to create .pcf file and tried this test.chr1.vcf with the vcf2xg.sh but got errors as described below.
Could you tell how to fix this ?
For your reference,
vg version v1.28.0
ruby 2.7.2

$ bash scripts/vcf2xg.sh test.chr1.vcf test_output /bin/vg hg19                            
Traceback (most recent call last):                                       
        3: from scripts/vcf2gfa/vcf2pcf.rb:9:in `<main>'                
        2: from scripts/vcf2gfa/vcf2pcf.rb:9:in `each'                  
        1: from scripts/vcf2gfa/vcf2pcf.rb:15:in `block in <main>'      
scripts/vcf2gfa/vcf2pcf.rb:15:in `split': invalid byte sequence in UTF-8
 (ArgumentError)
+ input=.//test_output.pcf                                               
+ reference=hg19                                                         
+ pref=.//test_output.pcf                                                
+ sort -t, -k 7n .//test_output.pcf                                      
+ sed -e 's/[ ]*//g'                                                     
+ awk -F '[,:]' '{print $1,"\t",$2,"\n",$4,"\t",$5}' /dev/fd/63          
+ sort -k 1,1 -k 2,2n                                                    
+ uniq                                                                   
++ sed 1d .//test_output.pcf                                             
+ '[' '!' -s hg19.fai ']'                                                
+ '[' '!' -s hg19 ']'                                                    
+ '[' '!' -s hg19.fa ']'                                                 
+ rsync -avPq --ignore-existing rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz .                        
+ gunzip -c hg19.fa.gz                                                   
+ samtools faidx hg19.fa                                                 
+ reference=hg19.fa                                                      
++ dirname scripts/vcf2gfa/pcf2gfa.sh                                    
+ ruby scripts/vcf2gfa/json_to_breakpoint_list.rb .//test_output.pcf.bp.tsv hg19.fa                                                               
+ cat .//test_output.pcf.bp.tsv breakpoint_list_tmp.tsv                  
+ sort -k 1,1 -k 2,2n                                                    
+ uniq                                                                   
++ dirname scripts/vcf2gfa/pcf2gfa.sh                                    
+ ruby scripts/vcf2gfa/gfa_generator.rb .//test_output.pcf.bp.merged.tsv .//test_output.pcf.output.pcf hg19.fa                                                                                                                                                                                      
scripts/vcf2xg.sh: line 58: /bin/vg: No such file or directory           
scripts/vcf2xg.sh: line 59: /bin/vg: No such file or directory           
scripts/vcf2xg.sh: line 64: /bin/vg: No such file or directory

@6br
Copy link
Contributor

6br commented Nov 17, 2020

Thank you for trying MoMI-G.

scripts/vcf2gfa/vcf2pcf.rb:15:in `split': invalid byte sequence in UTF-8
For the first error, the input vcf file might not be encoded in UTF-8.

scripts/vcf2xg.sh: line 58: /bin/vg: No such file or directory
scripts/vcf2xg.sh: line 59: /bin/vg: No such file or directory
scripts/vcf2xg.sh: line 64: /bin/vg: No such file or directory

For the last three errors, the path to the vg binary might not be appropriate.

$ bash scripts/vcf2xg.sh test.chr1.vcf test_output /bin/vg hg19           

The third argument of command needs to be the path to the vg binary you installed on your environment.

I updated the error message for the last three error lines.

@KokoroO7
Copy link

Thank you so much. I re-checked and changed input file and vg file path, and re-ran it.
And now I got new errors. Are these vcf problems in my input ?

$ bash scripts/vcf2xg.sh test2.chr1.vcf test2_output ../vg/bin/vg hg19                                                
+ input=.//test2_output.pcf                                              
+ reference=hg19                                                         
+ pref=.//test2_output.pcf                                               
+ sort -t, -k 7n .//test2_output.pcf                                     
+ sed -e 's/[ ]*//g'                                                     
+ awk -F '[,:]' '{print $1,"\t",$2,"\n",$4,"\t",$5}' /dev/fd/63          
++ sed 1d .//test2_output.pcf                                            
+ sort -k 1,1 -k 2,2n                                                    
+ uniq                                                                   
+ '[' '!' -s hg19.fai ']'                                                
+ '[' '!' -s hg19 ']'                                                    
+ '[' '!' -s hg19.fa ']'                                                 
+ samtools faidx hg19.fa                                                 
+ reference=hg19.fa                                                      
++ dirname scripts/vcf2gfa/pcf2gfa.sh                                    
+ ruby scripts/vcf2gfa/json_to_breakpoint_list.rb .//test2_output.pcf.bp.tsv hg19.fa                                                              
+ cat .//test2_output.pcf.bp.tsv breakpoint_list_tmp.tsv                 
+ sort -k 1,1 -k 2,2n                                                    
+ uniq                                                                   
++ dirname scripts/vcf2gfa/pcf2gfa.sh                                    
+ ruby scripts/vcf2gfa/gfa_generator.rb .//test2_output.pcf.bp.merged.tsv .//test2_output.pcf.output.pcf hg19.fa                                  
Traceback (most recent call last):                                               
5: from scripts/vcf2gfa/gfa_generator.rb:42:in `<main>'                  
4: from scripts/vcf2gfa/gfa_generator.rb:42:in `open'                    
3: from scripts/vcf2gfa/gfa_generator.rb:43:in `block in <main>'         
2: from scripts/vcf2gfa/gfa_generator.rb:43:in `each_line'               
1: from scripts/vcf2gfa/gfa_generator.rb:49:in `block (2 levels) in <main>'                                                               
scripts/vcf2gfa/gfa_generator.rb:25:in `fasta': [ERROR] unexpected sequence :0- (RuntimeError)                                                            
5: from scripts/vcf2gfa/gfa_generator.rb:42:in `<main>'                  
4: from scripts/vcf2gfa/gfa_generator.rb:42:in `open'                    
3: from scripts/vcf2gfa/gfa_generator.rb:43:in `block in <main>'         
2: from scripts/vcf2gfa/gfa_generator.rb:43:in `each_line'               
1: from scripts/vcf2gfa/gfa_generator.rb:48:in `block (2 levels) in <main>'                                                               
scripts/vcf2gfa/gfa_generator.rb:51:in `rescue in block (2 levels) in <main>': [ERROR] in chr1 0 : [ERROR] unexpected sequence :0- (RuntimeError)

@6br
Copy link
Contributor

6br commented Nov 17, 2020

I improved the error message. Is it possible for you to try it with the latest master?
In general case, when the reference id is different between the VCF file and the reference fasta file (for example, one is chr1 and the other is 1), the unexpected reference error occurs.

@KokoroO7
Copy link

Thank you for your instruction.
Then I git pulled and re-tried. I checked the VCF file and hg19.fa and confirmed that chr1 (coordinate) is the format in them. So this is not the case that one is "chr1" and the other is "1".
Anyway, I hope the notified errors I experienced this time will help to solve the problems.

$ bash scripts/vcf2xg.sh test.chr1.vcf test3_output ../vg/bin/vg hg19
+ input=.//test3_output.pcf
+ reference=hg19
+ pref=.//test3_output.pcf
+ sort -t, -k 7n .//test3_output.pcf
+ sed -e 's/[ ]*//g'
+ awk -F '[,:]' '{print $1,"\t",$2,"\n",$4,"\t",$5}' /dev/fd/63
+ sort -k 1,1 -k 2,2n
++ sed 1d .//test3_output.pcf
+ uniq
+ '[' '!' -s hg19.fai ']'
+ '[' '!' -s hg19 ']'
+ '[' '!' -s hg19.fa ']'
+ samtools faidx hg19.fa
+ reference=hg19.fa
++ dirname scripts/vcf2gfa/pcf2gfa.sh
+ ruby scripts/vcf2gfa/json_to_breakpoint_list.rb .//test3_output.pcf.bp.tsv hg19.fa
+ cat .//test3_output.pcf.bp.tsv breakpoint_list_tmp.tsv
+ sort -k 1,1 -k 2,2n
+ uniq
++ dirname scripts/vcf2gfa/pcf2gfa.sh
+ ruby scripts/vcf2gfa/gfa_generator.rb .//test3_output.pcf.bp.merged.tsv .//test3_output.pcf.output.pcf hg19.fa
Traceback (most recent call last):
	5: from scripts/vcf2gfa/gfa_generator.rb:42:in `<main>'
	4: from scripts/vcf2gfa/gfa_generator.rb:42:in `open'
	3: from scripts/vcf2gfa/gfa_generator.rb:43:in `block in <main>'
	2: from scripts/vcf2gfa/gfa_generator.rb:43:in `each_line'
	1: from scripts/vcf2gfa/gfa_generator.rb:51:in `block (2 levels) in <main>'
scripts/vcf2gfa/gfa_generator.rb:25:in `fasta': unexpected sequence :0- (RuntimeError)
	5: from scripts/vcf2gfa/gfa_generator.rb:42:in `<main>'
	4: from scripts/vcf2gfa/gfa_generator.rb:42:in `open'
	3: from scripts/vcf2gfa/gfa_generator.rb:43:in `block in <main>'
	2: from scripts/vcf2gfa/gfa_generator.rb:43:in `each_line'
	1: from scripts/vcf2gfa/gfa_generator.rb:50:in `block (2 levels) in <main>'
scripts/vcf2gfa/gfa_generator.rb:53:in `rescue in block (2 levels) in <main>': [ERROR] in 'chr1 0' : unexpected sequence :0- (RuntimeError)

@6br 6br mentioned this issue Nov 18, 2020
@6br 6br closed this as completed in #46 Nov 18, 2020
@6br
Copy link
Contributor

6br commented Nov 18, 2020

It seems a corner case bug and I fixed in the latest master. Could you try it again?

@6br 6br reopened this Nov 18, 2020
@KokoroO7
Copy link

Thank you for your help. I git pulled again and retried. Then I got another error.
This time the script was executed for a while (about 2 hours, ruby and sometimes samtools on 'top' command) after saying [Failed to fetch sequence in chr1:0--1]. And 2hrs later it then stopped and gave an error message.

$ bash scripts/vcf2xg.sh test.chr1.vcf test4_output ../vg/bin/vg hg19
+ input=.//test4_output.pcf
+ reference=hg19
+ pref=.//test4_output.pcf
+ sort -t, -k 7n .//test4_output.pcf
+ awk -F '[,:]' '{print $1,"\t",$2,"\n",$4,"\t",$5}' /dev/fd/63
+ sed -e 's/[ ]*//g'
+ sort -k 1,1 -k 2,2n
++ sed 1d .//test4_output.pcf
+ uniq
+ '[' '!' -s hg19.fai ']'
+ '[' '!' -s hg19 ']'
+ '[' '!' -s hg19.fa ']'
+ samtools faidx hg19.fa
+ reference=hg19.fa
++ dirname scripts/vcf2gfa/pcf2gfa.sh
+ ruby scripts/vcf2gfa/json_to_breakpoint_list.rb .//test4_output.pcf.bp.tsv hg19.fa
+ cat .//test4_output.pcf.bp.tsv breakpoint_list_tmp.tsv
+ sort -k 1,1 -k 2,2n
+ uniq
++ dirname scripts/vcf2gfa/pcf2gfa.sh
+ ruby scripts/vcf2gfa/gfa_generator.rb .//test4_output.pcf.bp.merged.tsv .//test4_output.pcf.output.pcf hg19.fa
[W::fai_fetch] Reference chr1:0--1 not found in FASTA file, returning empty sequence
Failed to fetch sequence in chr1:0--1
terminate called after throwing an instance of 'std::runtime_error'
	what():  error:[PackedGraph] tried to create an empty node with ID 1
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_DrLccx/stacktrace.txt
Please include the stack trace file in your bug report!
error[VPKG::load_one]: Correct input type not found while loading handlegraph::MutablePathDeletableHandleGraph
error[VPKG::load_one]: Correct input type not found while loading handlegraph::HandleGraph

@6br
Copy link
Contributor

6br commented Nov 19, 2020

I updated the script not to raise the error above [Failed to fetch sequence in chr1:0--1].

@moldach
Copy link
Author

moldach commented Nov 20, 2020

Thanks @KokoroO7 for reporting this error also.
My turn.

I've git pulled again but I get an error after ~ 4 hours:

(base) mtg@mtg-ThinkPad-P53:~/MAVIS$ sudo bash ~/bin/MoMI-G/scripts/vcf2xg.sh samclip_RUFUS.vcf momig-vcf-test-output ~/bin/vg Homo_sapiens.GRCh37.dna_rm.toplevel.fa 
[sudo] password for mtg: 
+ input=.//momig-vcf-test-output.pcf
+ reference=Homo_sapiens.GRCh37.dna_rm.toplevel.fa
+ pref=.//momig-vcf-test-output.pcf
+ sort -t, -k 7n .//momig-vcf-test-output.pcf
+ sed -e 's/[ ]*//g'
+ awk -F '[,:]' '{print $1,"\t",$2,"\n",$4,"\t",$5}' /dev/fd/63
+ sort -k 1,1 -k 2,2n
++ sed 1d .//momig-vcf-test-output.pcf
+ uniq
+ '[' '!' -s Homo_sapiens.GRCh37.dna_rm.toplevel.fa.fai ']'
+ '[' '!' -s Homo_sapiens.GRCh37.dna_rm.toplevel.fa ']'
+ samtools faidx Homo_sapiens.GRCh37.dna_rm.toplevel.fa
++ dirname /home/mtg/bin/MoMI-G/scripts/vcf2gfa/pcf2gfa.sh
+ ruby /home/mtg/bin/MoMI-G/scripts/vcf2gfa/json_to_breakpoint_list.rb .//momig-vcf-test-output.pcf.bp.tsv Homo_sapiens.GRCh37.dna_rm.toplevel.fa
+ cat .//momig-vcf-test-output.pcf.bp.tsv breakpoint_list_tmp.tsv
+ sort -k 1,1 -k 2,2n
+ uniq
++ dirname /home/mtg/bin/MoMI-G/scripts/vcf2gfa/pcf2gfa.sh
+ ruby /home/mtg/bin/MoMI-G/scripts/vcf2gfa/gfa_generator.rb .//momig-vcf-test-output.pcf.bp.merged.tsv .//momig-vcf-test-output.pcf.output.pcf Homo_sapiens.GRCh37.dna_rm.toplevel.fa
/home/mtg/bin/MoMI-G/scripts/vcf2xg.sh: line 65: 699083 Killed                  $vg_path convert -g $ggf_file -p > $vg_file.old
error[VPKG::load_one]: Correct input type not found while loading handlegraph::MutablePathDeletableHandleGraph
error[VPKG::load_one]: Correct input type not found while loading handlegraph::HandleGraph

@6br
Copy link
Contributor

6br commented Nov 20, 2020

The error message seems to come from vg. Could you try it again with the following command?

vg convert -g -p [ggf file] > output.vg

@moldach
Copy link
Author

moldach commented Nov 20, 2020

Sure.

$ vg convert -g -p momig-vcf-test-output.ggf > output.vg
vg: /home/erik/vg-release/vg/include/gfakluge.hpp:569: size_t gfak::mmap_open(const string&, char*&, int&): Assertion `false' failed.
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_l1rNMW/stacktrace.txt
Please include the stack trace file in your bug report!

Stack trace

Crash report for vg v1.28.0 "Acquafredda"
Stack trace (most recent call last):
#12   Object "/home/mtg/bin/vg", at 0x661d0d, in _start
#11   Object "/home/mtg/bin/vg", at 0x1f49d1f, in __libc_start_main
#10   Object "/home/mtg/bin/vg", at 0x57f838, in main
#9    Object "/home/mtg/bin/vg", at 0xc3523b, in vg::subcommand::Subcommand::operator()(int, char**) const
#8    Object "/home/mtg/bin/vg", at 0xcad624, in main_convert(int, char**)
#7    Object "/home/mtg/bin/vg", at 0x118f2e5, in vg::algorithms::gfa_to_path_handle_graph(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, handlegraph::MutablePathMutableHandleGraph*, bool, bool)
#6    Object "/home/mtg/bin/vg", at 0x118e720, in vg::algorithms::gfa_to_handle_graph_on_disk(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, handlegraph::MutableHandleGraph*, bool, gfak::GFAKluge&)
#5    Object "/home/mtg/bin/vg", at 0x119370b, in gfak::GFAKluge::for_each_sequence_line_in_file(char const*, std::function<void (gfak::sequence_elem const&)>)
#4    Object "/home/mtg/bin/vg", at 0x118fbc5, in gfak::mmap_open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char*&, int&)
#3    Object "/home/mtg/bin/vg", at 0x1f59ff5, in __assert_fail
#2    Object "/home/mtg/bin/vg", at 0x57ec63, in __assert_fail_base.cold
#1    Object "/home/mtg/bin/vg", at 0x57ed93, in abort
#0    Object "/home/mtg/bin/vg", at 0x12e27bb, in raise

@6br
Copy link
Contributor

6br commented Nov 22, 2020

I'd like to know if

  • The input file exists and its file size is greater than zero.
  • The environment has enough memory quota to mmap the input file.

If both answers are yes, then it seems the error related to vg.

@moldach
Copy link
Author

moldach commented Nov 22, 2020

I'd like to know if

* The input file exists and its file size is greater than zero.

Yes, the input file momig-vcf-test-output.ggf (created by MoMI-G) exists and is > 0

 38G -rw-r--r-- 1 root root  38G Nov 19 23:13 momig-vcf-test-output.ggf
 12K -rw-r--r-- 1 root root  114 Nov 19 23:57 momig-vcf-test-output.json
 24K -rw-r--r-- 1 root root  14K Nov 19 20:28 momig-vcf-test-output.pcf
 14M -rw-r--r-- 1 root root  14M Nov 19 20:29 momig-vcf-test-output.pcf.bp.merged.tsv
 12K -rw-r--r-- 1 root root 2.6K Nov 19 20:28 momig-vcf-test-output.pcf.bp.tsv
 24K -rw-r--r-- 1 root root  14K Nov 19 20:28 momig-vcf-test-output.pcf.output.pcf
8.0K -rw-r--r-- 1 root root    0 Nov 19 23:57 momig-vcf-test-output.vg
 12K -rw-r--r-- 1 root root 3.3K Nov 19 20:01 momig-vcf-test-output.xg
* The environment has enough memory quota to mmap the input file.

Not sure what the memory requirements would be for this?
Here is what my laptop has.

$ grep MemTotal /proc/meminfo | awk '{print $2 / 1024}'
31764.5

If both answers are yes, then it seems the error related to vg.

I'm going to try and run this a cluster with lots of RAM to see what happens (prior to opening a vg issue?)

@moldach
Copy link
Author

moldach commented Nov 23, 2020

With 50 Gb of memory I received the following error:

$ vg convert -g -p momig-vcf-test-output.ggf > output.vg
error [vg convert]: Input GFA is not acceptable.
error:[gfa_to_handle_graph] Found edge record with missing source name

@6br
Copy link
Contributor

6br commented Nov 24, 2020

I updated the script to raise an error if there is an edge record with missing source/target name.
It is strange that the size of the output ggf format is 38GB because I assume the original genome size is 3GB.

@moldach
Copy link
Author

moldach commented Nov 26, 2020

I agree that the size is odd....

Also, the vg developer responded that the command is not working for the following:

The vg convert -g command expects a GFA file, not a GFF file. Is that where the error is occurring? If in fact you are giving it a GFA file, then that error suggests that there is an L line that is malformed (specifically that it's missing the ID of an S line).

What do you make of this?

@6br
Copy link
Contributor

6br commented Nov 27, 2020

I updated the script to generate gfa file instead of ggf file. Also, I added an error message when the line is malformed on generating gfa format.

@moldach
Copy link
Author

moldach commented Nov 30, 2020

This is the new error message I get when running the updated script:

$ sudo bash ~/bin/MoMI-G/scripts/vcf2xg.sh \
    samclip_RUFUS.vcf \
    momig-vcf-test-output \
    ~/bin/vg \
    Homo_sapiens.GRCh37.dna_rm.toplevel.fa
 
 
+ input=.//momig-vcf-test-output.pcf
+ reference=Homo_sapiens.GRCh37.dna_rm.toplevel.fa
+ pref=.//momig-vcf-test-output.pcf
+ sort -t, -k 7n .//momig-vcf-test-output.pcf
+ sed -e 's/[ ]*//g'
+ awk -F '[,:]' '{print $1,"\t",$2,"\n",$4,"\t",$5}' /dev/fd/63
+ sort -k 1,1 -k 2,2n
+ uniq
++ sed 1d .//momig-vcf-test-output.pcf
+ '[' '!' -s Homo_sapiens.GRCh37.dna_rm.toplevel.fa.fai ']'
++ dirname /home/mtg/bin/MoMI-G/scripts/vcf2gfa/pcf2gfa.sh
+ ruby /home/mtg/bin/MoMI-G/scripts/vcf2gfa/json_to_breakpoint_list.rb .//momig-vcf-test-output.pcf.bp.tsv Homo_sapiens.GRCh37.dna_rm.toplevel.fa
+ cat .//momig-vcf-test-output.pcf.bp.tsv breakpoint_list_tmp.tsv
+ sort -k 1,1 -k 2,2n
+ uniq
++ dirname /home/mtg/bin/MoMI-G/scripts/vcf2gfa/pcf2gfa.sh
+ ruby /home/mtg/bin/MoMI-G/scripts/vcf2gfa/gfa_generator.rb .//momig-vcf-test-output.pcf.bp.merged.tsv .//momig-vcf-test-output.pcf.output.pcf Homo_sapiens.GRCh37.dna_rm.toplevel.fa
Traceback (most recent call last):
	4: from /home/mtg/bin/MoMI-G/scripts/vcf2gfa/gfa_generator.rb:43:in `<main>'
	3: from /home/mtg/bin/MoMI-G/scripts/vcf2gfa/gfa_generator.rb:43:in `open'
	2: from /home/mtg/bin/MoMI-G/scripts/vcf2gfa/gfa_generator.rb:44:in `block in <main>'
	1: from /home/mtg/bin/MoMI-G/scripts/vcf2gfa/gfa_generator.rb:44:in `each_line'
/home/mtg/bin/MoMI-G/scripts/vcf2gfa/gfa_generator.rb:62:in `block (2 levels) in <main>': ERROR: Link is not found on '["10", "C"]': , 5006 (RuntimeError)

This time the output size of the .gfa is more modest:

$ ls -lsh
368M -rw-r--r--  1 root root 368M Nov 30 16:51 momig-vcf-test-output.gfa
368M -rw-r--r--  1 root root 368M Nov 30 16:51 momig-vcf-test-output.ggf
 12K -rw-r--r--  1 root root  114 Nov 30 16:52 momig-vcf-test-output.json
 24K -rw-r--r--  1 root root  14K Nov 30 16:50 momig-vcf-test-output.pcf
 14M -rw-r--r--  1 root root  14M Nov 30 16:50 momig-vcf-test-output.pcf.bp.merged.tsv
 12K -rw-r--r--  1 root root 2.6K Nov 30 16:50 momig-vcf-test-output.pcf.bp.tsv
 24K -rw-r--r--  1 root root  14K Nov 30 16:50 momig-vcf-test-output.pcf.output.pcf
177M -rw-r--r--  1 root root 177M Nov 30 16:52 momig-vcf-test-output.vg
213M -rw-r--r--  1 root root 213M Nov 30 16:52 momig-vcf-test-output.xg

Trying the command on vg produces the following error:

$ vg convert -g -p momig-vcf-test-output.gfa > output.vg
vg: /home/erik/vg-release/vg/include/gfakluge.hpp:569: size_t gfak::mmap_open(const string&, char*&, int&): Assertion `false' failed.
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_uw0zv2/stacktrace.txt
Please include the stack trace file in your bug report!
  1 Crash report for vg v1.28.0 "Acquafredda"
      2 Stack trace (most recent call last):
      3 #12   Object "/home/mtg/bin/vg", at 0x661d0d, in _start
      4 #11   Object "/home/mtg/bin/vg", at 0x1f49d1f, in __libc_start_main
      5 #10   Object "/home/mtg/bin/vg", at 0x57f838, in main
      6 #9    Object "/home/mtg/bin/vg", at 0xc3523b, in vg::subcommand::Subcommand::operator()(int, char**) const
      7 #8    Object "/home/mtg/bin/vg", at 0xcad624, in main_convert(int, char**)
      8 #7    Object "/home/mtg/bin/vg", at 0x118f2e5, in vg::algorithms::gfa_to_path_handle_graph(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, handlegraph::MutableHandleGraph*, bool, bool)
      9 #6    Object "/home/mtg/bin/vg", at 0x118e720, in vg::algorithms::gfa_to_handle_graph_on_disk(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, handlegraph:MutableHandleGraph*, bool, gfak::GFAKluge&)
     10 #5    Object "/home/mtg/bin/vg", at 0x119370b, in gfak::GFAKluge::for_each_sequence_line_in_file(char const*, std::function<void (gfak::sequence_elem const&)>)
     11 #4    Object "/home/mtg/bin/vg", at 0x118fbc5, in gfak::mmap_open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char*&, int&)
     12 #3    Object "/home/mtg/bin/vg", at 0x1f59ff5, in __assert_fail
     13 #2    Object "/home/mtg/bin/vg", at 0x57ec63, in __assert_fail_base.cold
     14 #1    Object "/home/mtg/bin/vg", at 0x57ed93, in abort
     15 #0    Object "/home/mtg/bin/vg", at 0x12e27bb, in raise

@6br
Copy link
Contributor

6br commented Dec 1, 2020

I improved the error message again. However, I suspect the input vcf is not supported in MoMI-G tools. If the current master still raises the same error message, I recommend confirming if the only SV is contained in the input vcf file and then using SURVIVOR for the input vcf file.
Also, it is possible to check if the GFA file is malformed.
https://github.com/sjackman/gfalint

@moldach
Copy link
Author

moldach commented Dec 8, 2020

The current master gives errors for two VCF files I have: Manta and RUFUS

RUFUS

$ ./MoMI-G/scripts/vcf2xg.sh samclip_RUFUS.vcf rufus-output Homo_sapiens.GRCh37.dna.toplevel.fa
./MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:105:in `block in <main>': Unsupported format: 1	1111111	X-DeNovo	C	A	32	PASS	PH=none;FEX=PASS;FS=8/25;RN=NODE_proband_trim_bwaMEM_sort_dedupped.bam.generator.V2_2230_L280_D7:2:1::MH0;MQ=60;cigar=280M;SB=0.666667;AS=2-1;CVT=X;HD=-1_-1_-1_-1_-1_-1_-1_-1_-1_8_8_9_9_9_9_8_8_-1_-1_-1_-1_-1_-1_-1_-1_;AO=8;VT=X	GT:DP:RO:AO	0/1:19:11:8	0/0:27:27:0	0/0:49:49:0 (RuntimeError)
, 1,1111111,+,1,A,-,0,,GT:DP:RO:AO,0/1:19:11:8,_1_1111111..1_A
	from ./MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:9:in `each'
	from ./MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:9:in `<main>'

Manta

$ ./MoMI-G/scripts/vcf2xg.sh diploidSV.vcf manta-output vg Homo_sapiens.GRCh37.dna.toplevel.fa
./MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:105:in `block in <main>': Unsupported format: HG1287_PAAAH	140000000	ManAaDEL:28:0:0:0:4:0	AAAAGAGAAAAAAAAAGGAAGAAGAAAAAG	A	191	MaxDepAh;MaxMQ0FraA	END=140000028;SVAYPE=DEL;SVLEN=-28;AIGAR=1M147D;AIPOS=0,6;HOMLEN=6;HOMSEQ=AAAGAG	GA:FA:GQ:PL:PR:SR	0/1:PASS:93:241,0,90:19,0:140,14 (RunAimeError)
, HG1287_PAAAH,140000000,+,HG1287_PAAAH,A,-,28,,GA:FA:GQ:PL:PR:SR,0/1:PASS:93:241,0,90:19,0:140,14,_HG1287_PAAAH_140000000..HG1287_PAAAH_A
	from ./MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:9:in `each'
	from ./MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:9:in `<main>'

Looks like it's a problem with the VCF files.

Whats the proper way to re-format with SURVIVOR?
I tried the following:

$SURVIVOR merge sample_files 1000 2 1 1 0 30 sample_merged.vcf

Where sample_files was one file, samclip_RUFUS.vcf.

But I see merging entries: 0 and get an empty output file.

From the number of SV calling tools we've tried so far (for short-read callers) Manta, GRIDSS, etc. we've seen the best results from RUFUS.

Could you please add support for RUFUS?

@6br
Copy link
Contributor

6br commented Dec 9, 2020

Actually, these variants of VCF are not supported now. For example, the output of RUFUS seems to include SNP records, though MoMI-G tools currently do not support them because MoMI-G is not suitable for visualizing SNPs. I plan to examine it moving forward if time allows.

@moldach
Copy link
Author

moldach commented Dec 9, 2020

Hi @6br okay good to know that SNPs are not supported.
What about indels?

Could you also please provide a list of SV calling tools which you've tried on MoMI-G here/in the README?

@moldach
Copy link
Author

moldach commented Dec 9, 2020

I've ran Manta on public data which only reports indels and SVs, then subset the vcf for only a few events and this is the error I get from the first record:

[moldach@gra-login3 POLARIS]$ ~/bin/MoMI-G/scripts/vcf2xg.sh test.vcf manta-output ~/bin/vg Homo_sapiens.GRCh37.dna.toplevel.fa 
/home/moldach/bin/MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:105:in `block in <main>': Unsupported format: HG1287_PATCH	144227453	MantaINS:0:29:29:18:2:2	GT	GACGACTCCATTCGATTCCTTTTGATGATGATTCCAATGTATTCCATTCGATTTCTCTATTCAATTCCATTCATTGATGATTCCATTCGGATCCATTATATAATGATTCTGTTAGGTTCCATTCGATGA	312	MaxMQ0Frac	END=144227454;SVTYPE=INS;SVLEN=128;CIGAR=1M128I1D	GT:FT:GQ:PL:PR:SR	0/1:PASS:46:362,0,43:0,0:31,25 (RuntimeError)
, HG1287_PATCH,144227453,+,HG1287_PATCH,GACGACTCCATTCGATTCCTTTTGATGATGATTCCAATGTATTCCATTCGATTTCTCTATTCAATTCCATTCATTGATGATTCCATTCGGATCCATTATATAATGATTCTGTTAGGTTCCATTCGATGA,-,128,ACGACTCCATTCGATTCCTTTTGATGATGATTCCAATGTATTCCATTCGATTTCTCTATTCAATTCCATTCATTGATGATTCCATTCGGATCCATTATATAATGATTCTGTTAGGTTCCATTCGATG,GT:FT:GQ:PL:PR:SR,0/1:PASS:46:362,0,43:0,0:31,25,acgactccattcgattccttttgatgatgattccaatgtattccattcgatttctctattcaattccattcattgatgattccattcggatccattatataatgattctgttaggttccattcgatg_HG1287_PATCH_144227453..HG1287_PATCH_GACGACTCCATTCGATTCCTTTTGATGATGATTCCAATGTATTCCATTCGATTTCTCTATTCAATTCCATTCATTGATGATTCCATTCGGATCCATTATATAATGATTCTGTTAGGTTCCATTCGATGA
	from /home/moldach/bin/MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:9:in `each'
	from /home/moldach/bin/MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:9:in `<main>'

Removing the first record:

[moldach@gra-login3 POLARIS]$ ~/bin/MoMI-G/scripts/vcf2xg.sh test2.vcf manta-output ~/bin/vg Homo_sapiens.GRCh37.dna.toplevel.fa 
/home/moldach/bin/MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:105:in `block in <main>': Unsupported format: HG1287_PATCH	144763328	MantaBND:0:860:864:3:0:0:0	C	[HG1287_PATCH:145569278[C	63PASS	SVTYPE=BND;MATEID=MantaBND:0:860:864:3:0:0:1;IMPRECISE;CIPOS=-383,383;BND_DEPTH=71;MATE_BND_DEPTH=30	GT:FT:GQ:PL:PR	0/1:PASS:63:113,0,221:16,11 (RuntimeError)
, HG1287_PATCH,144763328,+,HG1287_PATCH,[HG1287_PATCH:145569278[C,-,0,HG1287_PATCH:145569278[,GT:FT:GQ:PL:PR,0/1:PASS:63:113,0,221:16,11,hg1287_patch:145569278[_HG1287_PATCH_144763328..HG1287_PATCH_[HG1287_PATCH:145569278[C
	from /home/moldach/bin/MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:9:in `each'
	from /home/moldach/bin/MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:9:in `<main>'

Removing that record:

[moldach@gra-login3 POLARIS]$ ~/bin/MoMI-G/scripts/vcf2xg.sh test2.vcf manta-output ~/bin/vg Homo_sapiens.GRCh37.dna.toplevel.fa 
/home/moldach/bin/MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:105:in `block in <main>': Unsupported format: 1	826137	MantaDEL:0:349:5534:1:2:0	G	<DEL>	999	PASS	END=5727343;SVTYPE=DEL;SVLEN=-4901206;SVINSLEN=3;SVINSSEQ=AGG	GT:FT:GQ:PL:PR:SR	0/1:PASS:999:999,0,999:97,5:57,72 (RuntimeError)
, 1,826137,+,1,<DEL>,-,4901206,DEL,GT:FT:GQ:PL:PR:SR,0/1:PASS:999:999,0,999:97,5:57,72,del_1_826137..1_<DEL>
	from /home/moldach/bin/MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:9:in `each'
	from /home/moldach/bin/MoMI-G/scripts/vcf2gfa/vcf2pcf.rb:9:in `<main>'

and so on.

@6br
Copy link
Contributor

6br commented Dec 10, 2020

The list of software is here https://github.com/MoMI-G/MoMI-G#adapt-your-own-dataset.
MoMI-G currently does not support SV callers for Illumina.

@moldach
Copy link
Author

moldach commented Dec 10, 2020

The list of software is here https://github.com/MoMI-G/MoMI-G#adapt-your-own-dataset.
MoMI-G currently does not support SV callers for Illumina.

As I don't have any calling data from non-Illumina data it would be useful if you could provide some test data. Certainly it should be possible to transform a callset generated by SV callers for Illumina into a format the MoMI-G will accept?

I'm wondering if any SV calling data can be coerced into a format that MoMI-G accepts by using SURVIVOR, or if it only will work if 10X LongRanger and Sniffles callsets are input for SURVIVOR?

For example, I subset Manta calls for two DEL events and placed it in file_list.txt and ran SURVIVOR like:

$ SURVIVOR merge file_list.txt 0.1 1 1 1 1 50 survivor_merged.vcf

Next, I ran MoMI-G but get an odd error:

ERROR: Reference id 'chr1' is not found in Homo_sapiens.GRCh37.dna.toplevel.fa

It's odd because this was the same reference genome which I aligned the data to...

$ ~/bin/MoMI-G/scripts/vcf2xg.sh \
    survivor_merged.vcf \
    survivor-output \
    ~/bin/vg \
    Homo_sapiens.GRCh37.dna.toplevel.fa 
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_vx=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for this output (/cvmfs/soft.computecanada.ca/custom/software/lmod/lmod/init/bash)
Shell debugging restarted
+ unset __lmod_vx
+ input=.//survivor-output.pcf
+ reference=Homo_sapiens.GRCh37.dna.toplevel.fa
+ pref=.//survivor-output.pcf
+ sort -t, -k 7n .//survivor-output.pcf
+ sed -e 's/[ ]*//g'
+ awk -F '[,:]' '{print $1,"\t",$2,"\n",$4,"\t",$5}' /dev/fd/63
+ sort -k 1,1 -k 2,2n
++ sed 1d .//survivor-output.pcf
+ uniq
+ '[' '!' -s Homo_sapiens.GRCh37.dna.toplevel.fa.fai ']'
+ '[' '!' -s Homo_sapiens.GRCh37.dna.toplevel.fa ']'
+ samtools faidx Homo_sapiens.GRCh37.dna.toplevel.fa
++ dirname /home/moldach/bin/MoMI-G/scripts/vcf2gfa/pcf2gfa.sh
+ ruby /home/moldach/bin/MoMI-G/scripts/vcf2gfa/json_to_breakpoint_list.rb .//survivor-output.pcf.bp.tsv Homo_sapiens.GRCh37.dna.toplevel.fa
+ cat .//survivor-output.pcf.bp.tsv breakpoint_list_tmp.tsv
+ sort -k 1,1 -k 2,2n
+ uniq
++ dirname /home/moldach/bin/MoMI-G/scripts/vcf2gfa/pcf2gfa.sh
+ ruby /home/moldach/bin/MoMI-G/scripts/vcf2gfa/gfa_generator.rb .//survivor-output.pcf.bp.merged.tsv .//survivor-output.pcf.output.pcf Homo_sapiens.GRCh37.dna.toplevel.fa
/home/moldach/bin/MoMI-G/scripts/vcf2gfa/gfa_generator.rb:50:in `block (2 levels) in <main>': ERROR: Reference id 'chr1' is not found in Homo_sapiens.GRCh37.dna.toplevel.fa (RuntimeError)
	from /home/moldach/bin/MoMI-G/scripts/vcf2gfa/gfa_generator.rb:44:in `each_line'
	from /home/moldach/bin/MoMI-G/scripts/vcf2gfa/gfa_generator.rb:44:in `block in <main>'
	from /home/moldach/bin/MoMI-G/scripts/vcf2gfa/gfa_generator.rb:43:in `open'
	from /home/moldach/bin/MoMI-G/scripts/vcf2gfa/gfa_generator.rb:43:in `<main>'

@moldach
Copy link
Author

moldach commented Dec 10, 2020

Looking at the reference index (.fai) we see (excerpt):

HG497_PATCH     51306896        28597305445     60      61
HG531_PATCH     115179803       28649467538     60      61
HG1471_PATCH    249269426       28766567089     60      61
HG1472_PATCH    249251918       29019991090     60      61
HG1500_PATCH    141220264       29273397291     60      61
1       249250621       29416971283     60      61
10      135534747       29670376140     60      61
11      135006516       29808169858     60      61
12      133851895       29945426541     60      61
13      115169878       30081509359     60      61
14      107349540       30198598793     60      61
15      102531392       30307737550     60      61
16      90354753        30411977856     60      61
17      81195210        30503838579     60      61
18      78077248        30586387100     60      61
19      59128983        30665765693     60      61
2       243199373       30725880216     60      61
20      63025520        30973132969     60      61
21      48129895        31037208972     60      61
22      51304566        31086141089     60      61
3       198022430       31138300788     60      61
4       191154276       31339623648     60      61
5       180915260       31533963885     60      61
6       171115067       31717894456     60      61
7       159138663       31891861497     60      61
8       146364022       32053652528     60      61
9       141213431       32202456007     60      61
MT      16569   32346023050     60      61
X       155270560       32346039952     60      61
Y       59373566        32503898416     60      61

So, indeed, the reference genome does not contain chr1 - as the error says - however, why it's looking for chr1 and not 1 escapes me.

Since the records from survivor_merged.vcf also have 1 and not chr1:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  proband
1       826137  MantaDEL:0:349:5534:1:2:0       G       <DEL>   999     PASS    SUPP=1;SUPP_VEC=1;SVLEN=-4901206;SVTYPE=DEL;SVMETHOD=SURVIVOR1.0.7;CHR2=1;END=5727343;CIPOS=0,0;CIEND=0,0;STRANDS=+-    GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO     0/1:NA:4901206:0,0:+-:999:DEL:MantaDEL_0_349_5534_1_2_0:NA:NA:1_826137-1_5727343

Not sure why MoMI-G is looking for chr1 then?

Could you please fix this or let me know which reference genome(s) are tested on MoMI-G? Thanks

@6br
Copy link
Contributor

6br commented Dec 12, 2020

Historically reason, there are two variants of human reference genome, i.e. GRCh and hg. hg is chr-prefixed reference genome, and the earlier version of GRCh is not chr-prefixed genome.
https://www.biostars.org/p/119295/
Because the latest reference genome GRCh37.p13/hg19 or GRCh/hg38 is chr-prefixed, so MoMI-G tools currently supports with chr-prefixed vcf files. I will update the readme file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants