Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locality Sensitive Hashing / Approximate Nearest Neighbours Techniques #6

Open
GeorgePearse opened this issue Nov 2, 2021 · 2 comments

Comments

@GeorgePearse
Copy link

GeorgePearse commented Nov 2, 2021

Does anyone know of any similar implementations that uses something like Faiss to improve the performance of the nearest neighbour step of the calculation? If not is it something that would be reasonable to add to this library?

Something like

class FaissKNeighbors:
    def __init__(self, k=5):
        self.index = None
        self.y = None
        self.k = k

    def fit(self, X, y):
        self.index = faiss.IndexFlatL2(X.shape[1])
        self.index.add(X.astype(np.float32))
        self.y = y

    def predict(self, X):
        distances, indices = self.index.search(X.astype(np.float32), k=self.k)
        votes = self.y[indices]
        predictions = np.array([np.argmax(np.bincount(x)) for x in votes])
        return predictions

from https://gist.github.com/j-adamczyk/74ee808ffd53cd8545a49f185a908584#file-knn_with_faiss-py provides an sklearn like interface to faiss and should almost be able to act as a drop in replacement.

As it then leads to an approximate result it should only be added as an option. e.g. nn_backend='faiss' or nn_backend='sklearn'

@johny-c
Copy link
Owner

johny-c commented Nov 7, 2021

This is interesting. Currently you can control the neighbor_params passed to the NearestNeighbors object, such as method (e.g. kdtree or balltree) and n_jobs which can parallelize things. Have you tried these out? It would be good to see how much you would gain with faiss against a well-configured NearestNeighbors. My guess is this will make more of a difference for very large datasets.

@GeorgePearse
Copy link
Author

I hadn't carried out much experimentation with the in-built optimizations. Not so relevant to my work for the minute but will try to find time to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants