spectra-cluster - A MS/MS spectrum clustering Java API

https://spectra-cluster.github.io provides a complete overview over all the tools we provide on spectrum clustering and the spectra-cluster algorithm.

Introduction

The spectra-cluster Java API is the central collection of algorithms used to develop and run the PRIDE Cluster project. The library was built to quickly test different combinations of clustering approaches and contains implementations of a variety of, for example, similarity metrics for MS/MS spectrum clustering.

It is currently used in two applications:

spectra-cluster-hadoop: The Hadoop implementation of the re-developed PRIDE Cluster algorithm
spectra-cluster-cli: A (still in beta) stand-alone implementation of the PRIDE Cluster algorithm.

spectra-cluster is an open-source (Apache 2 licensed) library. It offers the following features out-of-box:

A collection of both classic and new algorithms for measuring spectra similarities.
A set of engines for clustering spectra together.
A set of normalizers for normalising spectral peaks.
A set of filters and functions for pre-processing spectra, such as removing noisy peaks.
A set of cleanly defined data models and interfaces that represents spectra, peptide spectrum matches, and clusters.
Read in spectra and write out clustering results

Changelog

1.1.0

Moved to Java 1.8
Changed default consensus spectrum builder to a binned version of the GreedyConsensusSpectrum builder
Added features to estimate the number of comparisons directly from the data
Optimised the MGF parser
Added predicates to being able to only cluster identified and / or unidentified spectra
Added support for additional MGF parameters and encode these in the .clustering file using JSON strings
Added feature to output similarity scores at the time a spectrum is added to a cluster

1.0.10

Added new function to remove contaminant ions (RemoveContaminantsPeaksFunction). Currently, this function removes all commonly observed immonium ions.
Added a new function to remove all peaks outside a given m/z range (RemoveWindowsPeaksFunction). By default, all peaks below 200 m/z are being ignored.

1.0.9

Adapted the RemoveImpossibleHighPeaksFunction and the RemovePrecursorPeaksFunction classes to work with spectra where the charge state is unknown (ie. < 1). In these cases the unchanged original spectrum is returned.

1.0.8

Fixed bug in the function removing precursor peaks
Added the mass of the complete TMT tag to the functions removing reporter peaks

Getting started

Installation

You will need to have Maven installed in order to build and use the spectra-cluster library.

Add the following snippets in your Maven pom file:

<!-- spectra-cluster dependency -->
<dependency>
    <groupId>uk.ac.ebi.pride.spectracluster</groupId>
    <artifactId>spectra-cluster</artifactId>
    <version>${current.version}</version>
</dependency>

 <!-- EBI repo -->
 <repository>
    <id>pst-release</id>
    <url>http://www.ebi.ac.uk/Tools/maven/repos/content/repositories/pst-release</url>
 </repository>

 <!-- EBI SNAPSHOT repo -->
 <snapshotRepository>
    <id>pst-snapshots</id>
    <url>http://www.ebi.ac.uk/Tools/maven/repos/content/repositories/pst-snapshots</url>
 </snapshotRepository>

Running the library

The clustering process itself is done by a clutering engine. The following examples use the implementations used for PRIDE Cluster.

float WINDOW_SIZE = 4.0F;
float FRAGMENT_TOLERANCE = 0.5F;
double CLUSTERING_PRECISION = 0.01;

/**
 * This creates an incremental clustering engine that
 * uses the CombinedFisherIntensityTest with a fragment
 * ion tolerance of 0.5 m/z as similarity metrics. The
 * ClusterComparator is only used for sorting of the clusters
 * during the clustering process. The WINDOW_SIZE of 4.0 m/z
 * means that as soon as a new cluster is added, any cluster
 * with an average precursor m/z lower than 4.0 m/z than the
 * newly added cluster is automatically returned during the
 * clustering process. The CLUSTERING_PRECISION is the defined
 * accuracy for the clustering process (benchmarked on the
 * PRIDE Cluster test dataset). Finally, the FrationTICPeakFunction
 * is a peak filter function that is applied to every spectrum
 * before comparison (in this case all peaks that represent
 * 50% of the total ion current, but a minimum of 20 peaks).
 * For consensus spectrum building, the complete unfiltered
 * spectrum is used.
 */
IIncrementalClusteringEngine clusteringEngine = new GreedyIncrementalClusteringEngine(
    new CombinedFisherIntensityTest(FRAGMENT_TOLERANCE),
    ClusterComparator.INSTANCE,
    WINDOW_SIZE,
    CLUSTERING_PRECISION,
    FractionTICPeakFunction(0.5f, 20));

// during clustering the clusters must be sorted
// according to precursor m/z. Otherwise an
// exception is thrown
for (ICluster clusterToAdd : clusterIterable) {
    // clusters are simply added through the 'addClusterIncremental'
    // function. Clusters that have a lower precursor m/z
    // than the added cluster (based on the set window size)
    // are returned.
    Collection<ICluster> removedClusters = clusteringEngine.addClusterIncremental(clusterToAdd);

    if (!removedClusters.isEmpty()) {
        // use some method to save the removed and thereby
        // "final" clusters
        writeOutClusters(removedClusters);
    }
}

// after all spectra were clustered, save the finally
// remaining clusters still stored in the clustering 
// engine
Collection<ICluster> clusters = clusteringEngine.getClusters();
writeOutClusters(clusters);

Getting help

If you have questions or need additional help, please contact the PRIDE help desk at the EBI.

email: pride-support@ebi.ac.uk

Feedback

Please give us your feedback, including error reports, suggestions on improvements, new feature requests. You can do so by opening a new issue at our issues section

How to cite

Please cite this library using one of the following publications:

Griss J, et al. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nature methods. 2016; doi:10.1038/nmeth.3902
Griss J, Foster JM, Hermjakob H, Vizcaíno JA. PRIDE Cluster: building the consensus of proteomics data. Nature methods. 2013;10(2):95-96. doi:10.1038/nmeth.2343. PDF, HTML, PubMed

Contribute

We welcome all contributions submitted as pull request.

License

This project is available under the Apache 2 open source software (OSS) license.

Name		Name	Last commit message	Last commit date
Latest commit History 621 Commits
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

.travis.yml

.travis.yml

README.md

README.md

pom.xml

pom.xml

Repository files navigation

spectra-cluster - A MS/MS spectrum clustering Java API

Introduction

Changelog

1.1.0

1.0.10

1.0.9

1.0.8

Getting started

Installation

Running the library

Getting help

Feedback

How to cite

Contribute

License

About

Releases

Packages

Contributors 3

Languages

Navigation Menu

spectra-cluster/spectra-cluster

Folders and files

Latest commit

History

Repository files navigation

spectra-cluster - A MS/MS spectrum clustering Java API

Introduction

Changelog

1.1.0

1.0.10

1.0.9

1.0.8

Getting started

Installation

Running the library

Getting help

Feedback

How to cite

Contribute

License

About

Resources

Stars

Watchers

Forks

Languages