Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PreProcessing creates un-ordered VCF #163

Open
MattWellie opened this issue Aug 15, 2022 · 1 comment
Open

PreProcessing creates un-ordered VCF #163

MattWellie opened this issue Aug 15, 2022 · 1 comment

Comments

@MattWellie
Copy link

Error message:

Exception: Command line bcftools index -t /tmp/tmpIg13Dm.vcf.gz got return code 255.STDOUT: STDERR: [E::hts_idx_push] unsorted positions on sequence #5: 38598232 followed by 38598227index: failed to create index for "/tmp/tmpIg13Dm.vcf.gz"
Error running BCFTOOLS; please check your file for compatibility issues issues using vcfcheck

This is caused by running the following command

hap.py {truth_vcf.gz} {input_vcf.bgz} -r {path_to_fasta} -R {path_to_bed} -o {path_to_output} --engine-vcfeval-path={path_to_rtg} --threads 10 --engine-vcfeval-template {path_to_sdf} --engine=vcfeval --preprocess-truth

The pre-processing of the truth VCF (with tons of logging lines as noted in #129), but fails during the processing of the input VCF. The sites in question (chr5:38598232, chr5:38598227) are correctly ordered in the query VCF before processing, and are 2 deletions. It seems like at some point during the pre-processing the sites are becoming disordered, possibly as a result of the ref alleles overlapping. Without including any additional VCF content, the 2 lines as they appear in the sorted, indexed, input VCF:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
chr5	38598227	rs68118711	TTTTTTATTTATTTATTTATTTTTTTTTTTTTTC	T
chr5	38598232	rs373607072;rs71963527	TATTTATTTA	T

If this is a result of decomposition, could this be resolved by adding in a bcftools sort call prior to the index?

This issue appears to be the same as #109, but a fix was proposed for that issue on a branch that no longer exists, and the issue is closed.

@MattWellie MattWellie changed the title ProProcessing creates un-ordered VCF PreProcessing creates un-ordered VCF Aug 15, 2022
@MattWellie
Copy link
Author

I've solved this problem myself by updating the bcftools version within the image

The issue is that the BCFtools 1.4.1 bundled as an external tool here seems to have a dumb BED file filtering process, where each region in the BED file is processed in turn, and variants are appended to the output. Relevant regions from my BED file are:

chr5	38598232	38598244
chr5	38598259	38602497

In this example both variants overlap with the first interval, and the first (sorted by POS) ends in the second interval so it is being added twice: (example salvaged from the temp VCF generated by bcftools.py's preProcessVCF)

chr5	38598227	rs68118711	TTTTTTATTTATTTATTTATTTTTTTTTTTTTTC	T	...
chr5	38598232	rs373607072;rs71963527	TATTTATTTA	T	...
chr5	38598227	rs68118711	TTTTTTATTTATTTATTTATTTTTTTTTTTTTTC	T	...

PLEASE PLEASE PLEASE update the bundled tool binaries in this repository, the SAMtools/BCFtools are 5 years old now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant