Skip to content

A bunch of C++ programs for working with bioinformatics files

Notifications You must be signed in to change notification settings

alexcoppe/genomicpp

Repository files navigation

genomiC++ 🧬

A bunch of C++ programs for working with bioinformatics files. It uses only the standard libraries of C++ and it's in continuos alpha version as I create and iprove the difereent programs when needed.

Its compilation and functionality have been verified on the following operating system:

  • macOS 🍏
  • Linux 🐧

Download and Compilation 💾

>>> git clone https://github.com/alexcoppe/genomicpp.git
>>> cd genomicpp
>>> make

After compilation, move the generated executables to a directory listed in the $PATH variable. You can identify these directories by using the echo $PATH command.

Programs available 👨‍💻

vcf_to_table

This software transforms an uncompressed VCF file to a tab-separated values (tsv) file. It also operates with VCFs generated by SnpEff and ANNOVAR. To run it, you need two arguments: the VCF file and a text file specifying the desired fields. Refer to the table below for guidance on creating this file.

When utilizing a SnpEff annotated VCF, the tool currently displays each transcript indicated by SnpEff in separate rows.

Starting character What you get
None get the fields from the VCF
: get a subfield from the INFO field added by SnpEff
; get a specific subfiled from the IMFO field
| get a specific subfield from the Genotype fields

Example of a text file specifying the desired fields and subfields:

:hgvs_c
position
;gnomAD_genome_AMR
|AD

Launching the program with the above text file

vcf_to_table a_vcf_file_path.vcf wanted_fields.txt

Output:

n.-3702C>T      157370625       0.0020  14,1    31,5
n.*1931C>T      157370625       0.0020  14,1    31,5
n.-3707C>T      157370630       0       15,1    33,4
...

Currently, the software operates exclusively on 1 or 2 genotype fields.

The table below displays all the sub-fields added by SnpEff along with the corresponding sub-field names used in vcf_to_table (listed in the first column).

Subfield by vcf_to_table Subfield by SnpEff Explanation
:allele Allele (or ALT) The alternative allele
:annotation Annotation (a.k.a. effect) Annotated using Sequence Ontology terms
:putative_impact Putative_impact A simple estimation of putative impact / deleteriousness : {HIGH, MODERATE, LOW, MODIFIER}
:gene_name Gene Name Common gene name (HGNC)
:gene_id Gene ID Gene ID
:feature_type Feature type Which type of feature is in the next field
:feature_id Feature ID Depends on the annotation
:transcript_biotype Transcript biotype The bare minimum is at least a description on whether the transcript is {"Coding", "Noncoding"}. Whenever possible, use ENSEMBL biotypes
:rank Rank / total Exon or Intron rank / total number of exons or introns
:hgvs_c HGVS.c Variant using HGVS notation (DNA level)
:hgvs_p HGVS.p If variant is coding, this field describes the variant using HGVS notation (Protein level)
:cdna_position cDNA_position / cDNA_len Position in cDNA and trancript's cDNA length (one based)
:cds_position CDS_position / CDS_len Position and number of coding bases (one based includes START and STOP codons)
:protein_position Protein_position / Protein_len Position and number of AA (one based, including START, but not STOP)
:distance_to_feature Distance to feature All items in this field are options see SnpEff page for details
:errors Errors, Warnings or Information messages Errors, warnings or informative message that can affect annotation accuracy

get_pass_variants

This program filters a VCF file annotated by SnpEff, retaining only the variants marked as 'PASS' in the FILTER field.

Option What does it do
-h Show help
Example
>>> get_pass_variants /path_to_vcf_file/variants.vcf

About

A bunch of C++ programs for working with bioinformatics files

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published