Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault at gvcf2bed #154

Open
meredith705 opened this issue Mar 28, 2022 · 7 comments
Open

segmentation fault at gvcf2bed #154

meredith705 opened this issue Mar 28, 2022 · 7 comments

Comments

@meredith705
Copy link

Hello,

I'm running the hap.py v0.3.12 docker using the Genome In A Bottle HG002 vs GRCh38 truth set using the files below and getting a segmentation fault error in the gvcf2bed program.

The GIAB bottle truth set is here https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/

The vcf I'm using as truth is:
HG002_GRCh38_1_22_v4.2.1_benchmark_phased_MHCassembly_StrandSeqANDTrio.vcf from their supplementary files here

High confidence bed file:
HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed file from the same directory as above

The GRCh38 reference is:
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz

The query vcf I'm using is created by running dipcall on two haploid assemblies of chr11 ( a small test assembly ) using the same GRCh38 assembly as a reference: paternal.dip.vcf.gz

Here is my command:

sudo docker run -it -u `id -u $USER`:`id -g $USER` -v `pwd`:/data jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/data/hg38/GIAB_HG002_GRCh38_1_22_v4.2.1_phased.vcf \
/data/wdl_scripts/paternal.dip.vcf.gz \
-f /data/hg38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-r /data/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz \
-o /data/happy/chr11 \
--pass-only --no-roc --no-json --engine=vcfeval \
--threads=10

I get a segmentation fault error after about 2 minutes of run time.

2022-03-28 18:17:44,433 WARNING  No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
Hap.py 
[W] overlapping records at chr6:29747433 for sample 0
[W] Variants that overlap on the reference allele: 5
[I] Total VCF records:         1759115
[I] Non-reference VCF records: 1759115
Segmentation fault
2022-03-28 18:19:58,760 ERROR    Command 'gvcf2bed /tmp/truth.ppMZDdGL.vcf.gz -r /data/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz -o /tmp/tmpdH4L04.bed -T /data/hg38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed' returned non-zero exit status 139
2022-03-28 18:19:58,763 ERROR    Traceback (most recent call last):
2022-03-28 18:19:58,763 ERROR      File "/opt/hap.py/bin/hap.py", line 529, in <module>
2022-03-28 18:19:58,764 ERROR        main()
2022-03-28 18:19:58,764 ERROR      File "/opt/hap.py/bin/hap.py", line 314, in main
2022-03-28 18:19:58,765 ERROR        conf_temp = Haplo.gvcf2bed.gvcf2bed(args.vcf1, args.ref, args.fp_bedfile, args.scratch_prefix)
2022-03-28 18:19:58,765 ERROR      File "/opt/hap.py/lib/python27/Haplo/gvcf2bed.py", line 39, in gvcf2bed
2022-03-28 18:19:58,765 ERROR        subprocess.check_call(cmdline, shell=True)
2022-03-28 18:19:58,765 ERROR      File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
2022-03-28 18:19:58,776 ERROR        raise CalledProcessError(retcode, cmd)
2022-03-28 18:19:58,776 ERROR    CalledProcessError: Command 'gvcf2bed /tmp/truth.ppMZDdGL.vcf.gz -r /data/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta.gz -o /tmp/tmpdH4L04.bed -T /data/hg38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed' returned non-zero exit status 139

I really appreciate your help figuring out this issue, please let me know what other information I can provide.

Thank you in advance,
Melissa

@KatharineME
Copy link

I am also getting this error. But what is it about the fasta file thats causing this error? Any more information that can help debug this?

Thanks.

@KatharineME
Copy link

Did you figure this out @meredith705? Looks like we are doing something very similar.

@meredith705
Copy link
Author

@KatharineME @YuanfengZhang I have tried using a different version of GRCh38 and have still gotten the same error.

The other version is https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz

@KatharineME
Copy link

KatharineME commented May 26, 2022

Hi All, I got hap.py working without changing the reference genome. I did the following. I'm assuming that you are using rtg as the vcf comparison engine:

  1. Create genome sdf with rtg
rtg format -o path_to_sdf path_to_decompressed_reference_genome`
  1. Get the pkrusche/hap.py docker container with latest tag

  2. Run the hap.py docker container like so

docker run --interactive --detach --tty --user root -v local_dir_with_all_needed_files:dir_in_container pkrusche/hap.py bash
  1. Calling hap.py like so. Notice that the reference genome passed is decompressed. Remember that all paths in the command are paths in the container.
docker exec --interactive docker_container_id bash -c "/opt/hap.py/bin/hap.py path_to_truth_vcf path_to_query_vcf -f path_to_confident_regions_bed -r path_to_decompressed_reference_genome -o output_dir --engine-vcfeval-path path_to_rtg_folder --engine-vcfeval-template path_to_genome_sdf" 

If this doesn't work for you, I recommend trying to run the examples from the readme in the docker container.

@barslmn
Copy link

barslmn commented Dec 30, 2022

I had the same issue. Decompressing the reference fastq.gz and reindexing it with samtools faidx solved it for me. My other steps are here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@KatharineME @meredith705 @barslmn and others