Locality Sensitive Hashing for very large scale clustering #644

tornikeo · 2024-05-13T18:59:06Z

Is your feature request related to a problem? Please describe.
I'm working with the recent DreaMS paper, and I find the LSH approach to cluster mass-spectra on the scale of 1e+15 quite useful. The original idea of the this clustering approach is given in this paper.

Describe the solution you'd like
A new similarity score, something like:

class RandomProjection(BaseSimilarity):
    def __init__(self, n_elems: int, n_hyperplanes: int, seed=42):
        np.random.seed(seed)
        self.H = np.random.randn(n_hyperplanes, n_elems)

    def matrix(self, r: np.array, q: np.array):
        N, R = r.shape # N = Number of peaks, R = Number of spectra
        r_proj = (self.H @ r) >= 0 # [n_hyp, N] x [N,R] -> boolean  [n_hyp, R]
        q_proj = (self.H @ q) >= 0 #  [n_hyp, N] x [N,Q] -> boolean  [n_hyp, Q]
        
        r_hash = ... # int64 [R] Convert each boolean col in r_proj, into a single hash number
        q_hash = ... # int64 [Q]
        sparse_similarity = ... # Hash equality means similarity, Build`SparseStack` from r_hash and q_hash here...
        return sparse_similarity

It should be used like:

lsh = RandomProjection(128, 1024, 42)
lsh.matrix(r_large, q_large) # -> outputs sparsestack filled with binary similarities (0, or 1).

This score will scale with O(R+Q), instead of O(R * Q), making it by far the fastest at large scales.

Any good starting points?
This file in DreaMS is 90% of what we need.

Scientific reference

From DreaMS:

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locality Sensitive Hashing for very large scale clustering #644

Locality Sensitive Hashing for very large scale clustering #644

tornikeo commented May 13, 2024 •

edited

Locality Sensitive Hashing for very large scale clustering #644

Locality Sensitive Hashing for very large scale clustering #644

Comments

tornikeo commented May 13, 2024 • edited

tornikeo commented May 13, 2024 •

edited