Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locality Sensitive Hashing for very large scale clustering #644

Open
tornikeo opened this issue May 13, 2024 · 0 comments
Open

Locality Sensitive Hashing for very large scale clustering #644

tornikeo opened this issue May 13, 2024 · 0 comments

Comments

@tornikeo
Copy link
Contributor

tornikeo commented May 13, 2024

Is your feature request related to a problem? Please describe.
I'm working with the recent DreaMS paper, and I find the LSH approach to cluster mass-spectra on the scale of 1e+15 quite useful. The original idea of the this clustering approach is given in this paper.

Describe the solution you'd like
A new similarity score, something like:

class RandomProjection(BaseSimilarity):
    def __init__(self, n_elems: int, n_hyperplanes: int, seed=42):
        np.random.seed(seed)
        self.H = np.random.randn(n_hyperplanes, n_elems)

    def matrix(self, r: np.array, q: np.array):
        N, R = r.shape # N = Number of peaks, R = Number of spectra
        r_proj = (self.H @ r) >= 0 # [n_hyp, N] x [N,R] -> boolean  [n_hyp, R]
        q_proj = (self.H @ q) >= 0 #  [n_hyp, N] x [N,Q] -> boolean  [n_hyp, Q]
        
        r_hash = ... # int64 [R] Convert each boolean col in r_proj, into a single hash number
        q_hash = ... # int64 [Q]
        sparse_similarity = ... # Hash equality means similarity, Build`SparseStack` from r_hash and q_hash here...
        return sparse_similarity

It should be used like:

lsh = RandomProjection(128, 1024, 42)
lsh.matrix(r_large, q_large) # -> outputs sparsestack filled with binary similarities (0, or 1).

This score will scale with O(R+Q), instead of O(R * Q), making it by far the fastest at large scales.

Any good starting points?
This file in DreaMS is 90% of what we need.

Scientific reference

From DreaMS:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant