Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IVFPQIndex ? #13

Open
apiszcz opened this issue Aug 3, 2021 · 6 comments
Open

IVFPQIndex ? #13

apiszcz opened this issue Aug 3, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@apiszcz
Copy link

apiszcz commented Aug 3, 2021

I'm not sure if this is my problem, (possible).
I am testing the index types all work, (SSG had extensive index times > 24 hours so I aborted that).
IVFPQIndex

I am doing this with all index types, no issues except this case.

    index = IVFPQIndex(dimension, "usize")
    for i,d in tqdm(enumerate(hd[hashtype].tolist())):
        index.add(d,i)

FYI:
print(i,d)
15746524 [248, 225, 188, 223, 199, 174, 144, 146]

Error:

lib\site-packages\horapy\__init__.py in build(self, metrics)
     36 
     37     def build(self, metrics=""):
---> 38         self.ann_idx.build(metrics)
     39 
     40     def add(self, vs, idx=None):

PanicException: attempt to calculate the remainder with a divisor of zero
@salamer
Copy link
Contributor

salamer commented Aug 3, 2021

SSGIndex's building time is slow, though its performance and accuracy is awesome

SSGIndex will build a KNN graph first, and then pruning, the complexity is about O(n^2), that's why it cost more time to do the build operation.

we will implement https://arxiv.org/abs/1609.07228 and https://www.cs.princeton.edu/cass/papers/www11.pdf to decrease the complexity to O(n^1.14), which will significantly fast

these feature we plan to release in version 0.2.0 or 0.3.0

@salamer
Copy link
Contributor

salamer commented Aug 3, 2021

could you give more information about your IVFPQIndex issue?

@salamer salamer added enhancement New feature or request question Further information is requested bug Something isn't working and removed question Further information is requested bug Something isn't working labels Aug 3, 2021
@tang3848366
Copy link
Collaborator

tang3848366 commented Aug 3, 2021

default n_kmeans_center is 256.
when input data size is too small, kmeans cluster will have too many centers.
Also, data is not enough for a single kmeans cluster
we will fix this bug later.

@apiszcz
Copy link
Author

apiszcz commented Aug 3, 2021

Thank you.
The data set size is near 16 million. Speculation is there are probably 500 centers.
I can try some variations of the input parameters.
Is there a way to create a resulting cluster/partition graph such as shown in the annoy github page?

image

@salamer
Copy link
Contributor

salamer commented Aug 3, 2021

Thank you.
The data set size is near 16 million. Speculation is there are probably 500 centers.
I can try some variations of the input parameters.
Is there a way to create a resulting cluster/partition graph such as shown in the annoy github page?

image

good point! we have a API(temporarily not in Python lib)to display the statistics info of the index we built, and we would return these info, and we will also release a PCA method utils which you can decrease the dimension into 2. And the rest of the drawing work may need the support of other libraries

@apiszcz
Copy link
Author

apiszcz commented Aug 3, 2021

Understanding the index structure and clustering would be helpful. Data output that would be compatible with https://opentsne.readthedocs.io/en/latest/ or similar tools would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants