Skip to content

rcsb/strucmotif-search

Repository files navigation

Maven Central Changelog

Structural Motif Search

Biochemical and biological functions of proteins are the product of both the overall fold of the polypeptide chain, and, typically, structural motifs made up of smaller numbers of amino acids constituting a catalytic center or a binding site that may be remote from one another in amino acid sequence. Detection of such structural motifs can provide valuable insights into the function(s) of previously uncharacterized proteins.

alt motifs

Technically, this remains an extremely challenging problem because of the size of the Protein Data Bank (PDB) archive. We have developed a new approach that uses an inverted index strategy capable of analyzing >200,000 PDB structures with unmatched speed. The efficiency of our inverted index method depends critically on identifying the small number of structures containing the query motif and ignoring most of the structures that are irrelevant. Our approach enables real-time retrieval and superposition of structural motifs, either extracted from a reference structure or uploaded by the user.

See it in Action

Structural motif searching is available as part of the RCSB Advanced Search and RCSB Mol* plugin. Help documentation is available.

Performance

Current benchmark times to search in 208,702 PDB structures and 1,068,577 AlphaFold/RoseTTAFold predictions as of 8/16/23, obtained on an instance with 6 cores and 64 GB memory. All structure data is held in memory, inverted index data is read from an SSD.

Allowing only Experimental/Archived Structures

Motif Definition Found Assemblies 'Paths' Time [ms] 'Score' Time [ms] Total Time [ms]
Serine Protease 4cha - His:B-42, Asp:B-87, Ser:C-47 5,309 618 22 673
Aminopeptidase 1lap - Lys:A-250, Asp:A-255, Asp:A-273, Asp:A-332, Glu:A-334 91 158 1 181
Zinc Fingers 1g2f - Cys:F-7 His:F-25 His:F-29 739 135 3 160
Enolase Superfamily 2mnr - Lys:A-162, Asp:A-193, Glu:A-219, Glu:A-245, His:A-295 192 253 2 275
Enolase Superfamily (exchanges) 2mnr - Lys/His:A-162, Asp:A-193, Glu:A-219, Glu/Asp/Asn:A-245, His/Lys:A-295 210 2,996 14 3,032
RNA G-Quadruplex 3ibk - G:A-4, G:A-10, G:B-4, G:B-10 85 2,364 236 2,622

Including Computed Structure Models

Motif Definition Found Assemblies 'Paths' Time [ms] 'Score' Time [ms] Total Time [ms]
Serine Protease 4cha - His:B-42, Asp:B-87, Ser:C-47 10,254 1,710 125 1,988
Aminopeptidase 1lap - Lys:A-250, Asp:A-255, Asp:A-273, Asp:A-332, Glu:A-334 647 352 7 389
Zinc Fingers 1g2f - Cys:F-7 His:F-25 His:F-29 9,442 492 92 686
Enolase Superfamily 2mnr - Lys:A-162, Asp:A-193, Glu:A-219, Glu:A-245, His:A-295 328 659 5 689
Enolase Superfamily (exchanges) 2mnr - Lys/His:A-162, Asp:A-193, Glu:A-219, Glu/Asp/Asn:A-245, His/Lys:A-295 350 5,246 25 5,296
RNA G-Quadruplex 3ibk - G:A-4, G:A-10, G:B-4, G:B-10 85 2,453 253 2,742

Search for all assemblies that contain hits with an RMSD <2 Å. 'Paths' refers to the time spent on inverted index operations, which identify all candidate structures that contain the motif. 'Score' refers to the time spent on aligning candidate structures to the query and computing RMSD values.

Computed structure models ignore unreliable regions with pLDDT <70.

Features

  • nucleotide support
  • inter-chain & assembly support
  • position-specific exchanges
  • modified residues
  • support for computed structure models, like from AlphaFold
  • detect motifs in a structure of interest

Getting Started with a Dependency

strucmotif-search is distributed by maven and supports Java 11+. To get started, append your pom.xml by:

<dependency>
  <groupId>org.rcsb</groupId>
  <artifactId>strucmotif-search</artifactId>
  <version>0.19.6</version>
</dependency>

Getting Started by Cloning

An alternative way to use the library is cloning this repository and building the corresponding Maven modules.

Search for Similar Structures by A Single Motif

The Strucmotif class provides a fluent API to process structural motif queries.

Strucmotif.searchForStructures()
        // several ways can be used to define the query motif - e.g., specify a PDB entry id
        .defineByPdbIdAndSelection("4cha",
                // and a collection of sequence positions to extract residues to use as motif
                List.of(new LabelSelection("B", "1", 42), // HIS
                        new LabelSelection("B", "1", 87), // ASP
                        new LabelSelection("C", "1", 47))) // SER
        .rmsdCutoff(1.0)
        .buildParameters()
        .buildContext()
        .run()
        .getHits()
        .stream()
        .map(hit -> hit.structureIdentifier() + "_" + 
                    hit.assemblyIdentifier() + " @ " + 
                    hit.labelSelections() + " - RMSD: " +
                    hit.rmsd())
        .forEach(System.out::println);

Detect if a Structure Contains Motifs of Interest

This process can also be reversed to detect whether a structure of unknown function contains characteristic motifs.

// acquire a collection of motifs to screen for
Set<EnrichedMotifDefinition> motifs = Strucmotif.getMotifDefinitionRegistry().getEnrichedMotifDefinitions();

Strucmotif.detectMotifs()
        .defineByPdbIdAndAssemblyId("2mnr", "1")
        .withMotifs(motifs)
        .rmsdCutoff(1.0)
        .buildParameters()
        .buildContext()
        .run()
        .getHits()
        .stream()
        .map(hit -> hit.motifIdentifier() + " @ " +
                    hit.labelSelections() + " - RMSD: " +
                    hit.rmsd())
        .forEach(System.out::println);

Configuration

Property Action Default Value/Behavior
ccd-url URL to the chemical component dictionary wwPDB
decimal-places-score Number of decimal places reported for scores 2
decimal-places-matrix Number of decimal places reported in transformation matrices 3
in-memory-strategy Preload structure data for increased performance? off
loading-chunk-size Batch size when holding structure data in memory 200,000
max-results Maximum number of results that will be returned 50,000
max-motif-size Maximum number of residues that may define a motif 10
per-query-threads Number of worker threads per query available processors
query-timeout Interrupt queries after n milliseconds none
root-path Path where data files will be written /opt/data/

Configure by placing your application.properties on the classpath. All properties specific to this project must be prefixed with strucmotif..

Index Structure Data and Run Updates

You will need to process your corpus of structure data before using the service. This will create an optimized version of all structure files and add them to an inverted index that allows efficient searching.

Details can be found in: UPDATE.md

Implementation Details

Addressing Residues

Two address schemes exist. LabelSelection is a high-level, object-based way of referencing individual residues. It uses a combination of mmCIF properties, namely label_asym_id, struct_oper_id, and label_seq_id:

LabelSelection ref = new LabelSelection("A", "1", 123);

Internally, access is facilitated using 32-bit unsigned primitive encoded integers. It doesn't follow any particular layout rather, all encountered residues are addressed by their index. Chain boundaries are ignored. Operations required for assemblies are honored as they occur in the source file and merely increment the counter. Additional work is done to preserve information on chains and assemblies. Chain and operator names as well as boundaries are stored in memory and can be used to reconstruct LabelSelection instances if needed.

Residue pairs are identified by pairs of these int values. They can be stored as long value by chaining together 1st and 2nd value.

Residue Pair Descriptor

Residue pair descriptors capture the label_comp_id of both interacting residue, their backbone distance, their side-chain distance, and the angle defined between both.

These values are the Cartesian product of ResidueType (A, 36 states, 6 bits) x ResidueType (B, 36 states, 6 bits) x DistanceType (C, 32 states, 5 bits) x DistanceType (D, 32 states, 5 bits) x AngleType (E, 10 states, 4 bits) and are stored in an unsigned 32-bit integer. The 32-bit descriptors will use their 4th bit to store metadata (M) that tracks whether the identifier is flipped.

XXXMAAAA AABBBBBB XXCCCCCD DDDDEEEE

A second flavor exists that only tracks DistanceType x DistanceType x AngleType and can be held in an unsigned 16-bit short.

XXCCCCCD DDDDEEEE

Convenience functions to work with these descriptors are provided in the ResiduePairDescriptor class.

Related Projects

Publication

Bittrich S, Burley SK, Rose AS (2020) Real-time structural motif searching in proteins using an inverted index strategy. PLoS Comput Biol 16(12): e1008502. https://doi.org/10.1371/journal.pcbi.1008502

About

Real-time structure motif searching in protein 3D structures using an inverted index strategy

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages