Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HnswDocumentIndex treats document IDs as string, they can be str, int, ID #1850

Open
4 of 6 tasks
oytuntez opened this issue Feb 7, 2024 · 4 comments
Open
4 of 6 tasks

Comments

@oytuntez
Copy link

oytuntez commented Feb 7, 2024

Initial Checks

  • I have read and followed the docs and still think this is a bug

Description

I noticed this behavior when I wanted to access multiple documents in the index:

@requests(on='/find')
    def find(self, docs: DocList[QuoteFile], **_) -> DocList[QuoteFile]:
        return self._cache_di[docs.id]

And when I issue POST /find with body {"data":[{"id":"300055"}]}, this code yields:

       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 544, in _get_docs_sqlite_doc_id                                     
           hashed_ids = tuple(self._to_hashed_id(id_) for                       
       id_ in doc_ids)                                                          
         File                                                                   
       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 544, in <genexpr>                                                   
           hashed_ids = tuple(self._to_hashed_id(id_) for                       
       id_ in doc_ids)                                                          
         File                                                                   
       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 445, in _to_hashed_id                                               
           return                                                               
       int(hashlib.sha256(doc_id.encode('utf-8')).hexdigest…                    
       16) % 10**18                                                             
       AttributeError: 'int' object has no attribute                            
       'encode'                        

Upon investigation, I saw that most of HnswDocumentIndex treats IDs as str. However, it is my understanding that IDs can be int, see this type definition:

class ID(str, AbstractType):
    """
    Represent an unique ID
    """

    @classmethod
    def _docarray_validate(
        cls: Type[T],
        value: Union[str, int, UUID],
...

I think ID values should be cast to str if necessary (it would be in _to_hashed_id case).

Example Code

No response

Python, DocArray & OS Version

Python 3.8.12
docarray==0.40.0

Affected Components

@JoanFM
Copy link
Member

JoanFM commented Feb 8, 2024

hey @oytuntez ,

thanks for the issue reported, would it be possible for you to share a code snippet showing the issue with DocArray code, avoiding as much as possible other dependencies?

Thanks

@ai-naymul
Copy link
Contributor

I noticed this behavior when I wanted to access multiple documents in the index:

@requests(on='/find')
    def find(self, docs: DocList[QuoteFile], **_) -> DocList[QuoteFile]:
        return self._cache_di[docs.id]

And when I issue POST /find with body {"data":[{"id":"300055"}]}, this code yields:

       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 544, in _get_docs_sqlite_doc_id                                     
           hashed_ids = tuple(self._to_hashed_id(id_) for                       
       id_ in doc_ids)                                                          
         File                                                                   
       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 544, in <genexpr>                                                   
           hashed_ids = tuple(self._to_hashed_id(id_) for                       
       id_ in doc_ids)                                                          
         File                                                                   
       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 445, in _to_hashed_id                                               
           return                                                               
       int(hashlib.sha256(doc_id.encode('utf-8')).hexdigest…                    
       16) % 10**18                                                             
       AttributeError: 'int' object has no attribute                            
       'encode'                        

Upon investigation, I saw that most of HnswDocumentIndex treats IDs as str. However, it is my understanding that IDs can be int, see this type definition:

class ID(str, AbstractType):
    """
    Represent an unique ID
    """

    @classmethod
    def _docarray_validate(
        cls: Type[T],
        value: Union[str, int, UUID],
...

I think ID values should be cast to str if necessary (it would be in _to_hashed_id case).

Hi @JoanFM,
Hope you are doing great as well. I gotta this issue, I think the _to_hashed_id method in the HnswDocumentIndex class expects a string input for hashing, but it receives an integer. That's why this occurs. Here is the method that cause the error:

error docarray

Here is my proposed solution:

error solved

Let me know what's your thought on that.
@oytuntez @JoanFM

@JoanFM
Copy link
Member

JoanFM commented May 4, 2024

can you please provide an isolated code to reproduce the issue?

@ai-naymul
Copy link
Contributor

can you please provide an isolated code to reproduce the issue?

Hey I have tried to simulate the scenario where an integer ID is passed to a method that expects a string ID and tries to hash it, here it is:

import hashlib

def _to_hashed_id(doc_id):
return int(hashlib.sha256(doc_id.encode('utf-8')).hexdigest(), 16) % 10**18

doc_ids = [300055] # This should be a string, not an integer
hashed_ids = tuple(_to_hashed_id(id_) for id_ in doc_ids)

That would be great if @oytuntez can give us the actual reproducable code!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants