Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential useful information to review and add to the readthedocs page - from learning unit 7 #370

Open
timosachsenberg opened this issue Mar 9, 2023 · 3 comments

Comments

@timosachsenberg
Copy link
Contributor

Workflow
The search engine definitely plays the most important role in the peptide identification process. As illustrated below, what the search engine itself does is bascically first in silico digestion of provided protein sequences by specific database setting. second, a theoretical mass list is generated, which is subsequently compared with the experimental mass list. Finally a list of matched masses is found and ready for the statitical significance analysis.

Search parameters
The search and presentation of results are controlled by specifying several search parameters. Some prominent examples are:
The enzyme used for digestion.
The modifications to consider: fixed modification or variable modification. (will be introduced in LU7B in details)

The mass can be specified in different ways, with and without (positive or negative) charge, monoisotopic or average mass.

Maximum number of missed cleavage sites for a peptide.
The tolerance for comparing protein masses.
The tolerance for comparing peptide masses. This value depends on the expected accuracy of the MS instrument; since no mass spectrometer has perfect accuracy, this parameter is always specified.
And many other options.

Organization of the database
In order to quickly obtain the protein sequences of interest (filtering), the database can be re-organized or with an explicit set of index tables. Each protein mass in such an index table then has pointers to the protein sequences with this mass. Another way to increase the speed relies on saving the result of in silico digestions. The theoretical masses obtained are then sorted, and each mass is provided with indices
that point to the sequences in which they occur, together with some peptide information (modifications etc.).

Search engines
Particularly, the peptide identification process consists of following steps:
From the database, extract all sequences that fit the precursor mass of the MS2 spectrum with a given error tolerance
For each of these candidates a theoretical spectrum is generated
All theoretical spectra are aligned / compared to the experimental spectrum
The alignments are scored and the candidates are ranked according to the score
The top ranked candidate is assumed to be the correct PSM (Peptide Spectrum Matching)

image

Extract all candidates (search space)
In this stage, an experimental spectrum S is given and we want to identify the correct sequence for S from a given protein database.
Firstly, the search space for S for a given mass tolerance d is defined:
m_prec is the mass of the precursor ion of spectrum S. From the database, extract all peptide sequences with mass m_cand given that
|mprec−mcand|≤d.

This set of candidates is defined as the search space for spectrum S and denoted as ΩS.

Generate theoretical spectra
There are two options of generating theoretical spectra. The first option is to extract all masses from the MS2 spectrum and 2nd option is trying to model fragment ion intensities. Note the generated theoretical spectrum T usually have uniform intensity information.

Comparison to experimental spectra

image

The main task is to compare two lists of masses, and the straightforward approach is to sort the two lists on masses and perform a parallel comparison. Some aspects that have to be taken into account are as follows:
• An experimental mass may match more than one theoretical peptide mass within the given threshold.
• A theoretical mass may match more than one experimental peptide.
• A theoretical mass may match both an unmodified peptide and a second modified peptide.
• Both a concatenated theoretical peptide (missed cleavages) and one of its parts may find matches.
• Some of the experimental masses may come from noise.
• Different peptides can have similar masses, due to permutations of the amino acids.

Thus for each experimental mass there can be a number of false matches (matches to other peptides than the correct one), and this number depends on the accuracy of the measurements.
Scoring of peptide candidates
There are numerous tools for the comparison of theoretical and experimental candidate peptides. The main difference of search engines is the implementation of the scoring schemes (resulting in differences in runtime and performance). However, conceptually all search engine algorithms are based on fragment ion comparison.

@timosachsenberg
Copy link
Contributor Author

Target­ decoy concept
Description
The idea of target-decoy strategy is very simple: to extract the false positive information by simulation. Here we introduce the detailed validation approach based on target-decoy strategy: FDR, q-value (local FDR), posterior error probability. All of this rely on the distribution of negative PSMs.

Calculation of FDR using target decoy strategy
In the field of MS/MS-based proteomics, the methods for estimating FDR can be broadly grouped into two categories. The discussion here starts with a simple approach based on the use of the target-decoy database search strategy [1].
[1] Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Meth 2007;4:207–14.
The basic assumption is that matches to decoy peptide sequences (decoy PSMs) and false matches to sequences from the target database follow the same distribution. In the second step, PSMs are filtered using various score cut-offs (e.g. a certain X! Tandem E-value or MASCOT Ion Score cut-off), and the corresponding FDR for each cut-off is estimated. There are two ways how FDRs are calculated based on target-decoy search results:
Käll et al. suggest FDR = \frac{#decoy}{#target}
,
Zhang et al. suggest FDR = \frac{2 * #decoy}{#target + #decoy}
.

False discovery rate (FDR)
FDR is the expected proportion of incorrect predictions amongst a selected set of prediction. For our MS problem, this can be interpreted as a fraction of incorrect PSMs within a selected set of PSMs above a certain score threshold.

image

Example: in the following table, there are 3 false discoveries out of 13 PSMs. 10 true positives are considered identified at a FDR of 23 %.

image
Example: If setting an FDR threshold of 1%, this means that we accept a list of PSMs in which 99% of the PSMs are accepted as true discoveries, 1% are false.

@timosachsenberg
Copy link
Contributor Author

q-value
The q-value can be understood as the minimal FDR level at which a PSM can be accepted.
The q-value of a PSM scoring x
is q(x)=minx≥x′{FDR(x′)}
.
Example: A q-value of 0.01 for a PSM means, 1% is the minimal FDR threshold at which this PSM will appear in the accepted list when we try out all possible FDR thresholds.
Example: With the same dataset of the previous section, the corresponding q-values are listed as below.

image

@timosachsenberg
Copy link
Contributor Author

Fragment ion mass
Fragment ion masses can be calculated using the table shown below. M corresponds to the sum of residual (=dehydrated amino acid) masses. MN−term
and MC−term
of the neutral N- or C-terminating group (usually H and OH).

image

Calculation of fragment ion masses

Example: The Peptide with neutral mass M_N−term+M_prefix+M_suffix+M_C−term
breaks after the prefix position and yield either the b_prefix or y_suffix
ion of this sister ion pair. The neutral mass of the b_prefix
ion is according to the table: M_N−term+M_prefix−H
and the mass of the y_suffix
ion: M_C−term+M_suffix+H
.
To calculate the mass of the fragment ion we add charge times the proton mass to the neutral mass. The m/z of the fragment ion is this mass divided by the charge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

No branches or pull requests

1 participant