Skip to content
John May edited this page Feb 3, 2016 · 15 revisions

New Fingerprint (nfp)

Improved API for the representation, storage, scoring, and indexing of fingerprints.

Major Goals

  • Unified abstraction for representing both binary and frequency fingerprints
  • Read/Write FPS format
  • Efficient index implementation(s) allow users to build an index and search for a small datasets (<2GB: ~15 mil 1024 bit FPs).
  • Efficient 'feature encoding' utilities allow hashing of selected atom/bond type info from different features. This will allow users to combined/build their own implementations whilst simplifying the defaults. Examples of features include:
    • Path
    • Radial
    • Tree
    • Ring

Minor Goals

  • Fingerprint Naming/Versioning
  • Adapters to use old IFingerprinter implementations whilst in development

Fingerprint Representation

Fingerprint storage

Fingerprints can be read and written to the FPS format. The format encodes binary fingerprints in base 16 (hexadecimal) and includes a title suffixed by a tab.

#FPS1
#num_bits=256
#software=RDKit/2009Q3_1
#type=RDKit-Fingerprint/1 minPath=1 maxPath=7 fpSize=256 nBitsPerHash=4 useHs=True
#source=/Users/dalke/databases/Compound_00000001_00025000.sdf.gz
#date=2010-01-27T02:22:26
fffeffbfb7fffedff7beefdbddf7ffffabff76cf6df7fcf6f7fffebf7d7ffd6f        1
fffeffbfb7fffedff7beefdbddf7ffffabff76cf6df7fcf6f7fffebf7d7ffd6f        2
ffffbfdfffffffffbfeffffffffffffffffffffffff77efffffffebfffffffef        3
00c02010002610000080800041100002084000440d100000c055048801224400        4

FPS round tripping.

try (FpsInput  in  = new FpsInput("input.fps");
     FpsOutput out = new FpsOutput("output.fps")) {
  
  out.writeHeader(in.getHeader()); // copy header
  
  Fp fp = new BinaryFp(in.getFpLen());

  while (in.read(fp)) {
    out.write(fp);
  }
}

The FPS header contains key value pairs, the following constants can be use to set the values.

  • FpsInput.HeaderNumBits
  • FpsInput.HeaderAromaticity
  • FpsInput.HeaderType
  • FpsInput.HeaderDate
  • FpsInput.HeaderSoftware
  • FpsInput.HeaderSource

Example of creating a header.

Map<String,String> header = new LinkedHashMap<>();
header.put(FpsInput.HeaderSource, "chembl_20.smi");
header.put(FpsInput.HeaderSoftware, "CDK");
header.put(FpsInput.HeaderNumBits, "1024");

Encoding Features

FpEncoder encoder = new FpEncoder(mol);

IAtomContainer mol;
BinaryFp       fp = new BinaryFp(1024);

encoder.encodePath(fp, lo, hi, atype, btype);
encoder.encodeTree(fp, lo, hi, atype, btype);
encoder.encodeRing(fp, lo, hi, atype, btype);
encoder.encodeCirc(fp, lo, hi, atype, btype);

Similarity Index

FpSimIdx idx = ...;
BinaryFp qry = new BinaryFp(1024);