Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential useful information to review and add to the readthedocs page - from learning unit 9 #371

Open
timosachsenberg opened this issue Mar 9, 2023 · 3 comments

Comments

@timosachsenberg
Copy link
Contributor

Peptides FDRs do not correspond to protein FDRs.
Currently, large-scale studies often have dozens or hundreds of LC-MS runs that are being accumulated. Repeated measurements lead to an accumulation of false positive identifications. As a rule of thumb, protein FDR increases linearly with the number of repeat measurements. Then an existing solution is to estimate FDR in the same fashion as peptides FDRs through a target-decoy approach, which is called MAYU (no acronym).

@timosachsenberg
Copy link
Contributor Author

Protein inference problem
Peptide identification methods only match peptides to spectra and output a ranked list of PSMs with corresponding scores. Then false discovery rates can be computed and help to filter away the PSMs of low accuracy with a certain threshold. It's clear that each PSM above the threshold contributes a match of a spectrum to a peptide and also a match of a peptide to a protein. The loss of connectivity between peptides and proteins due to protein digestion is creating the protein inference problem: assembling identified peptides into proteins. Note that the peptides are not necessarily unique: different spectra might be assigned the same peptide and different proteins might also contain the same peptide.
Peptide uniqueness
Non-unique peptide sequences can stem from different proteins, e.g. homologous proteins, alternatice splice variants, redundant entries. They make it difficult to infer which protein exactly present in the sample. In terms of the length of peptides, uniqueness becomes more likely for longer peptide sequences (the number of peptides of length >40 is already very low).

image
Nesvizhskii A I , Aebersold R Mol Cell Proteomics 2005;4:1419-1440.

Parsimony-based inference
In the presence of shared peptides (i.e., peptides whose sequence is present in multiple protein sequences), the task of computing protein confidence scores becomes more complicated. In early studies, some were reporting all proteins identified with at least one non-shared peptide, whereas others reported everything or selected one representative protein among isoforms or homologs. Now many tools present the results by creating protein groups. This approach is partly based on the parsimony principle or the Occam's razor – “entities must not be multiplied beyond necessity” – which suggests to find the smallest number of proteins (protein groups) that can explain all observed peptides [1]. If all peptides mapping to one protein family can be explained by a single protein, then it is quite likely that only this protein is present (but this must not necessarily be the case).

In this approach, protein database entries that are indistinguishable given the sequences of identified peptides are collapsed into a single protein group. Other scenarios include subset proteins, i.e. proteins that share all of its peptides with another protein that is identified by at least one non-shared peptide, and other more complicated cases. Such a nomenclature provides a more consistent and concise format for representing the results.

[1] Nesvizhskii AI, Aebersold R. Interpretation of shotgun proteomic data — the protein inference problem. Mol Cell Proteomics 2005;4:1419–40.
Protein Ambiguity Groups

In creating a protein summary list that accurately represents the data, various peptide grouping scenarios have to be considered that are schematically illustrated in Fig.1.
The diagram in Fig. 1a describes a case of two distinct proteins, A and B, each identified by distinct peptides only, i.e. peptides corresponding to that one protein and no other proteins (peptides 1 and 2 are unique to protein A, and peptides 3 and 4 are unique protein B).
Fig. 1b shows a case of two differentiable proteins, which are identified by at least one distinct peptide (peptide 1 is unique to A, and peptide 4 is unique to protein B) but also by one or more shared peptides (peptides 2 and 3 are shared between the two proteins).
A different scenario is shown in Fig. 1c where all peptides are shared between proteins A and B. These two proteins are indistinguishable given the sequences of the identified peptides.
Fig. 1d and e, each show a situation where all identified peptides corresponding to protein B are shared and can be accounted for by another protein A or a combination of several other proteins (proteins A and C in Fig. 5e) certain to be in the sample because they are identified by at least one distinct peptide.
A special case is shown in Fig. 5f where all identified peptides are shared by a group of proteins. The presence of protein A in the sample is sufficient to explain all observed peptides (B and C are subset protein identifications). Although protein A is the most likely candidate, its presence in the sample is not required to explain the data; it is identified by shared peptides only. In the absence of protein A, a combination of proteins B and C would account for all four peptides.

image
Fig 1: Basic peptide grouping scenarios. (Nesvizhskii A I , Aebersold R Mol Cell Proteomics 2005;4:1419-1440.)

Therefore, there are following scenarios for different proteins given a set of observed peptides:
Distinctproteins do not share peptides
Differentiable proteins can be distinguished by at least one distinct peptide
Indistinguishable proteins share all peptides
Subsetproteins contain only peptides also contained in another protein
Subsumableproteins contain only peptides that are also contained in other proteins
The set of proteins sharing one or multiple peptides is often referred to as a protein ambiguity group.
Parsimony-Based Inference
The nomenclature described above, coupled with the Occam’s razor constraint, would provide a minimal list of proteins sufficient to explain all observed peptides. Such a minimal list would contain all distinct and differentiable proteins, e.g. proteins A and B in Fig. 1a and 1b, and proteins A and C in Fig. 1e but no subsumable or subset proteins, e.g. only protein A would be included in the list in the cases shown in Fig. 1d and 1f. In the case of indistinguishable protein identifications, Fig. 1c, it would be most accurate to collapse all such identifications into a single entry in the protein summary report as there is often no basis to eliminate any of them.

Presenting results of large scale shotgun experiments in terms of such minimal lists of protein identifications has several advantages (simplification) but also limitations. For example, a researcher interested in a particular gene might want to observe all related protein isoforms annotated in the protein sequence database that are implicated by at least one peptide identified in the experiment. Moreover the strict implementation of the Occam’s razor approach can be misleading when applied to complex protein families. Therefore, the most advantageous presentation would include the following:
a minimal list with indistinguishable proteins collapsed into a single entry (but showing all protein names) and with all members of protein groups listed
means to observe the proteins implicated by at least one peptide that cannot be called conclusively identified.
A simplified illustration of such a format of presentation is shown in Fig. 2.
In Fig. 2, peptides are apportioned among all their corresponding proteins, and the minimal list of proteins is derived. Proteins that are impossible to differentiate on the basis of identified peptides are collapsed into a single entry (F and G) or presented as a group (H, I, and J). Shared peptides are marked with an asterisk. Proteins that cannot be conclusively identified are shown at the end of the list but do not contribute toward the protein count.

image
Fig 2: A simplified example of a protein summary list. (Nesvizhskii A I , Aebersold R Mol Cell Proteomics 2005;4:1419-1440.)

@timosachsenberg
Copy link
Contributor Author

Significance
What is the meaning of a PSM for a protein identification?
The FDR is calculated on the PSM level and 1% FDR means that one in 100 PSMs yields an incorrect match. This does not mean that there is also an FDR rate of 1% on the protein level. In particular in the large scale studies, protein FDRs are much higher than peptide FDRs. The mapping of correct PSMs to proteins is an abundance-driven process, reflecting the fact that more abundant proteins are identified by a higher number of unique peptides and PSMs. For example, in a typical shotgun proteome profiling experiment of a fairly complex organism having 20,000 genes (proteins), a typical outcome would be the identification of ~1000 proteins from an order of magnitude higher number of correct PSMs (filtered at a low FDR) [1]. Thus, correct PSMs tend to group into a relatively small number of proteins compared to the size of the proteome of the organism. In contrast, incorrect PSMs are due to semi-random matching to any of the entries (20,000 in this example) from the sequence database. The non-randomness here comes from the differences between proteins in terms of their sequence length and the homology problem (as shown in Fig. 1).

[1] Nesvizhskii, Alexey I. "A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics." Journal of proteomics 73.11 (2010): 2092-2123.

Generally speaking, the number of correctly identified proteins does not increase significantly with the number of spectra (it is always the same proteins being identified, additional (correct) PSMs do not increase the number of proteins). The number of false positives increases with the number of PSMs. (yields hits to random proteins, so initially mostly novel false positives!)

@timosachsenberg
Copy link
Contributor Author

One hit wonders
In many cases, proteins are identified through a single PSM: if a protein contains at least one identified PSM, it is accepted as an identification. These ‘single hit wonders’ have long been considered problematic: a single false PSM can lead to a wrongly identified protein. In fact, the so-called ‘Paris guidelines’ for data deposition in proteomics recommend only reporting identifications for which at least two peptides have been identified. This also became known as the ‘two peptide rule’. Obviously, just dropping a large part of PSMs is inadequate to address this problem. There is some work discussing this issue. See [1] and [2].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

No branches or pull requests

1 participant