Skip to content

Releases: EBIvariation/vcf-validator

v0.9.6

14 Feb 21:04
Compare
Choose a tag to compare
  • Fix for the Mac OS-X build

v0.9.5

18 Jul 12:49
Compare
Choose a tag to compare
  • Missing data is valid even if multiple values are expected

New flag "--require-evidence" and improved validation of strings and integers

08 Apr 14:11
Compare
Choose a tag to compare

This release includes 2 changes:

  • Added a new flag --require-evidence to check the presence of genotypes, allele frequencies or allele counts.
  • Fix a bug where number parsing and validation was not as strict as expected.

Bgzip and Ubuntu 18 (locale) fixes

06 Feb 14:22
Compare
Choose a tag to compare

This is a patch release that includes just 2 important fixes:

  • There was an error about locales when running in Ubuntu 18: #184
  • Bgzipped VCFs had a small chance of being read incompletely.

We recommend everyone to use this version instead of the previous ones.

Experimental additions to Assembly Checker

16 Aug 12:22
9d958b7
Compare
Choose a tag to compare

Note that everything except these new features is equally stable as in the previous release v0.9.1. Using the latest version is recommended.

This release adds 2 new experimental features to the assembly checker

The 2 new features were not present in v0.9.1 and might change its behaviour in the future.

1) Possibility of checking a VCF against a FASTA file, where they use a different chromosome naming system.

For instance, your VCF uses chromosome numbers:

#CHROM	POS...
1	100 ...

but you have a FASTA with chromosome accessions:

>CM000001.3 chromosome 1
ATCG...

Now you can use the -a parameter to provide the path to a file with the mapping. The file structure expected is that of NCBI's assembly reports such as ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/002/285/GCA_000002285.2_CanFam3.1/GCA_000002285.2_CanFam3.1_assembly_report.txt

For each chromosome, the assembly checker will try to find in the FASTA any synonym under the columns "Sequence-Name", "GenBank-Accn", "RefSeq-Accn" and "UCSC-style-name".

2) Remote sequence retrieval.

If no FASTA file is provided, EBI-ENA will be queried to download the sequence of each chromosome used in the VCF to check every reference allele.

Duplicate sample detection and warnings for unused parameters

01 Oct 13:13
Compare
Choose a tag to compare

This small release contains only small fixes and the next improvements:

If the header line in a VCF file contains several samples with the same name, it is now flagged as an error, as recently clarified in the VCF specification.

Warnings are now logged if there are unused parameters in the command used to run any of the tools. Thanks @srbcheema1 for the contributions!

New reference checker tool and Windows support

12 Sep 19:30
Compare
Choose a tag to compare

A new tool has been added to the suite! This one checks that the REF column in a VCF matches the sequence contained in a FASTA file, and reports any mismatches in a summary or plain text file, in a similar fashion to the VCF validator reporting. A new report type that only outputs the valid lines is also included in this tool.

We have also added support for Windows, making the suite compatible with the 3 major operating systems. Please be aware that you will need to decompress your files before validating them on Windows due to a known issue.

You can find the binaries for all versions, ready for direct download, attached to these notes.

New MacOS version and built-in support for compressed files

20 Jun 15:29
5270a11
Compare
Choose a tag to compare

MacOS users can now run the validation suite in their favorite OS, without needing Docker or admin permissions. Just copy the executable in the link into your machine and run it in exactly the same way as in Linux. Please let us know if you find any compatibility issues by creating a bug report.

The validator can also read files compressed in multiple formats without the need of a pipe. You can find instructions in the updated README file.

Thanks to @srbcheema1 for these contributions!

gVCF support, ploidy fixes and usability improvements

14 Mar 09:29
8caf422
Compare
Choose a tag to compare

The validator can now check fields specific of the gVCF extension. This includes <*> alternate alleles and how they relate to the END INFO field and sample genotypes.

Following some user reports (#101, #102) of incorrect counts being expected for FORMAT fields with Number=G, we confirmed with the specification that their cardinality depends on the ploidy of each sample genotype and not on the ALT column. The issue should be solved now, but if you find any problems please open a new ticket!

This version also introduces some usability improvements. The biggest is a summary report in addition to the existing text and database outputs. This is human-readable and lists each type of error detected, the number of times it occurred, and the first line where it was observed.

The --version option now reports which version of the validator are you running. Please note that in vcf-validator 0.4 or previous this option was used to note which version of the specification the input file should match.

And finally, the validator now warns the user if the input is compressed, instead of reporting a confusing list of errors.

You can download the Linux binaries using the links, and also visit this page if you are interested in the full list of changes.

Improved structural variation support

11 Sep 09:04
Compare
Choose a tag to compare

It has been a really productive summer thanks to @Anishka0107, the Google Summer of Code student who has improved the support for structural variants in the validator and the debugulator 😃

She has added new metadata validations to ensure that INFO and FORMAT fields match the header definition, and that said header matches the VCF specification itself. These validations apply not only to short variants but also to structural variation tags, which hadn't been fully supported until now!

She also expanded the checks (added to last version) that guarantee no duplicate values in the ID and FORMAT columns in a single line, to also include the FILTER and INFO columns. The debugulator can now automatically fix these duplicates, as well as the values assigned to some INFO tags (see #78 for more details).

The last phase of GSoC was more focused on the purely technical aspects of the project: cleaning up the code, improving the documentation and slightly simplifying the grammar that detects syntax errors.

Please download the Linux binaries using the links below, and visit this page if you are interested in the full list of changes.