Skip to content

A Directed Acyclic Word Graph built for searching mass spectrometry data agains a protein database

Notifications You must be signed in to change notification settings

zmcgrath96/mass_DAWG

Repository files navigation

Mass Directed Acyclic Word Graph (DAWG) build and test

A word graph made for the sole purpose of identify mass spectra data

Background

What is mass spectrometry data?

If you hadn't heard of mass spectrometry, I doubt you will get much out of this datastructure, but feel free to stick around and learn!

Mass spectrometry data is data that is generated from a mass spectrometer (duh). More specifically, this data structure is meant for identify peptides (short sequences of amino acids) from the list of numbers (floats) from a mass spectrometer run. Amino acids are described as a sequence masses.

The primary use of this data structure, to be even MORE specific, is by using singly and doubly charged masses to identify a peptide. The output, at its most basic form, is a list of floating point numbers. It might look something like the following

[99.023, 140.743, 209.887, 288.115, 402.778]

This sequence of floating point numbers describe a particular sequence of amino acids. The way this works is as follows:

Say our sequence of amino acids is: MALW (don't shoot me I know the masses don't make sense). The above sequence of amino acids describe any arrangement of ion types and charges. For the sake of this example, we will limit our scope to what're called b and y ions with possible charges of 1 and 2. The b ions describe amino acids in a left-to-right fashion, and y right-to-left. So for our example sequence, we can break it down to something like:

  b1  b2  b3
M | A | L | W
  y3  y2  y1

So M is described by both a b ion and a y ion. The difference is that b is the mass of M and y is really the mass of ALW. But you can see how these ions complement eachother.

To dive a bit deeper, charges, as the name suggests, are the actual charges of the molecules and amino acid chains from the mass spectrometer. The most common of which are 1 and 2. So we could have up to (in this case) 4 different combinations of ions (b+, b++, y+, y++) that describe each junction.

What is a DAWG?

A Directed Acyclic Word Graph (DAWG) is an extension of a prefix tree Prefix trees are a way to store a lot of sequencial data (typically strings) in an easy to search and somewhat compressed form. DAWGs are a graph extension of these. They do the same sort of thing, but in a more compressed manner. Nodes with the same value are merged together to reduce the space usage. The wikipedia page for these data structures can be found here.

Great, but wheres the connection?

Mass spectrometry data, at its core, is sequential data. There are two mainstream ways to identify peptides (amino acid sequences) from mass spec data: (1) a database serach, (2) de novo sequencing. The application this data structure is geared towards is database searches.

Protein databases are typically saved in .fasta files, which contain a little bit of header data and then the sequence. Entries look like the following:

>sp|someIdentifier|someName some extra info
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWX
YZABCDEFGHIJKLMNOPQRSTUVWXYZ...

At a very high level, database searches take a spectrum (the list of numbers) like the one shown above and go over each entry and each position in the database, generate a theoretical spectrum from some subsequence of the protein, and compare the two to see how well they match.

So for our database, if we wanted to try to identify the sequence, we would essentially go through each position of each protein and generate a spectrum. So we would start with ABCD..., then move onto BCDE... and so on and so forth.

Going through a database like this without some insane optimizations for each spectrum would take ages. So thats where the DAWG comes in.

If we limit our searches to some upper bound of peptide length (say 20), then we can fairly quickly iterate through the entire database, find all sequences who's length is < 21, generate theoretical spectra for these sequences, and put them in a data structure to quickly search.

Example


Our set of sequences would be (for now, just b ions, so left to right):

ABCDEFGHIJKLMNOPQRST
BCDEFGHIJKLMNOPQRSTU
ABCDEFGHIJKLMNOPQRST
...
XYZ
YZ
Z

Notice that after Z becomes the end letter, our sequences shrink on the left side. Since we are only looking at b ions in this example, we need to get all of the left starting sequences. If we had stuck to ONLY doing peptides of length 20, we would never find Z, YZ, XYZ, etc. in our searches.

Once we have our set of sequences, we can generate theoretical spectra for this finite set.

NOTE: at this point in time, only singly charged and doubly charged masses are allowed into the DAWG.

This would look something like

Sequence: ABCD...
singly charged masses: [100.123, 167.9877, 299.773, ...]
doubly charged masses: [50.0625, 83.4588, 149.9932, ...]

We would then add these two charged seqeuences to the DAWG tagged at each step with the corresponding subsequence. So our DAWG at this point would look like

root --> {100.123, 50.0625, [A]} --> {167.9877, 83.4588, [AB]} 
                                                ||
                                                ||
                                                \/
                                     {299.773, 149.9932, [ABC]} 

Now if we had another sequence AXC enter the set with the mass sequences

singly: [100.123, 189.9877, 299.773, ...]
doubly: [50.0625, 94.998, 149.9932, ...]

we could add it to the graph, and our graph would then look like

root --> {100.123, 50.0625, [A]} --> {167.9877, 83.4588, [AB]} 
                ||                              ||
                ||                              ||
                \/                              \/
        {189.9877, 94.998, [AX]} ---> {299.773, 149.9932, [ABC, AXY]} 

Our graph is now connected, unlike a tree would have been (AXY would have become its own branch). Its important to note that we now have a bit of ambiguity in which path we took to get to ABC, AXY, but the important step is we've reduced the set of peptides that we care about from what seemed infinite to 2, so some post analysis of these sequences could narrow this down.


Mass DAWG

This module is written in C++ 11 for its speed and ease of use to write. In order to compile this, you must have the g++ compiler.

Installation and usage

To install, run the following:

$>git clone https://github.com/zmcgrath96/mass_DAWG.git

To build and run tests, change to the mass_DAWG directory, compile the bash script build_and_test_cc.sh then run ./build_and_test_cc.sh

$> cd mass_DAWG
$mass_DAWG> chmod u+x build_and_test.sh
$mass_DAWG> ./build_and_test.sh

To just build and NOT run tests, follow those steps but do so for the file build_cc.sh rather than build_and_test_cc.sh

To include in your workflow, simply import MassDawg.hpp in your file.

UPDATE: As of August 4th 2020, sequences can be inserted out of order

Example

Taken from the main.cpp file in the src directory

Example main.cpp file

#include <iostream>
#include <vector>

#include "MassDawg.hpp"

using namespace std;


int main() {
    // create a mass dawg
    MassDawg * md = new MassDawg();

    // add a sequence
    vector<float> singly = {100.1, 200.2, 300.3, 400.4};
    vector<float> doubly = {200.2, 400.4, 600.6, 800.8};
    string sequence = "ABCD";

    cout << "inserting ABCD into graph";
    md->insert(singly, doubly, sequence);

    // show the graph
    md->show();

    // add a new series to it. Graph should reconverge on "D"
    singly = {100.1, 200.4, 300.6, 400.4};
    doubly = {200.2, 400.4, 600.6, 800.8};
    sequence = "AXYD";

    cout << "inserting AXYD into graph";
    md->insert(singly, doubly, sequence);

    // finising the graph does a final round
    // of merging nodes with the same value
    // for this example, it'll turn the nodes with masses (400.4, 800.8)
    // into 1 node with kmers "ABCD", "AXYD"
    md->finish();

    md->show();

    // search for a seqeunce missing 2 masses
    cout << "searching for a sequence with no gaps\n";
    vector<float> searching = {600.6, 800.8};
    vector<string> results = md->fuzzySearch(searching, 2, 10);
    cout << "Number of results: " + to_string(results.size()) + "\n";

    for (int i = 0; i < results.size(); i++){
        cout << results[i] + "\n"; 
    }

    // clean up 
    delete md;
}

Example output

inserting ABCD into graph
root
  |---> kmers: {A} 	 masses: 100.099998, 200.199997
    |---> kmers: {AB} 	 masses: 200.199997, 400.399994
      |---> kmers: {ABC} 	 masses: 300.299988, 600.599976
        |---> kmers: {ABCD} 	 masses: 400.399994, 800.799988
inserting AXYD into graph
root
  |---> kmers: {A} 	 masses: 100.099998, 200.199997
    |---> kmers: {AB} 	 masses: 200.199997, 400.399994
      |---> kmers: {ABC} 	 masses: 300.299988, 600.599976
        |---> kmers: {ABCD, AXYD} 	 masses: 400.399994, 800.799988
    |---> kmers: {AX} 	 masses: 200.399994, 400.399994
      |---> kmers: {AXY} 	 masses: 300.600006, 600.599976
        |---> kmers: {ABCD, AXYD} 	 masses: 400.399994, 800.799988

searching for a sequence with gaps
Number of results: 2
ABCD
AXYD

Exposed MassDawg functions (API)

  • void show(): Print the graph to the console as a tree (merged nodes have their kmers put into a list)
  • void insert(vector singlySequence, vector doublySequence, string kmer): Insert a pair of singly charged and doubly charged masses into the dawg associated with the kmer (all 3 parameters MUST be the same length)
  • vector fuzzySearch(vector sequence, int gapAllowance, int ppmTol): Search the graph for a sequence of floats allowing for up to gapAllowance missed masses. ppmTol is the allowed tolerance for a mass to fall within (ppm = parts per million) *vector search(vector sequence, int ppmTol): Search the graph for a sequence of floats with no missed masses. ppmTol is the allowed tolerance for a mass to fall within (ppm = parts per million)
  • void finish(): Go through the graph one final time to merge all remaining nodes that have not been checked for duplicates.

Python bindings

More information on how to use the python version of this module can be found here

About

A Directed Acyclic Word Graph built for searching mass spectrometry data agains a protein database

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages