HnswDocumentIndex treats document IDs as string, they can be str, int, ID #1850

oytuntez · 2024-02-07T23:23:17Z

Initial Checks

I have read and followed the docs and still think this is a bug

Description

I noticed this behavior when I wanted to access multiple documents in the index:

@requests(on='/find')
    def find(self, docs: DocList[QuoteFile], **_) -> DocList[QuoteFile]:
        return self._cache_di[docs.id]

And when I issue POST /find with body {"data":[{"id":"300055"}]}, this code yields:

       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 544, in _get_docs_sqlite_doc_id                                     
           hashed_ids = tuple(self._to_hashed_id(id_) for                       
       id_ in doc_ids)                                                          
         File                                                                   
       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 544, in <genexpr>                                                   
           hashed_ids = tuple(self._to_hashed_id(id_) for                       
       id_ in doc_ids)                                                          
         File                                                                   
       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 445, in _to_hashed_id                                               
           return                                                               
       int(hashlib.sha256(doc_id.encode('utf-8')).hexdigest…                    
       16) % 10**18                                                             
       AttributeError: 'int' object has no attribute                            
       'encode'

Upon investigation, I saw that most of HnswDocumentIndex treats IDs as str. However, it is my understanding that IDs can be int, see this type definition:

class ID(str, AbstractType):
    """
    Represent an unique ID
    """

    @classmethod
    def _docarray_validate(
        cls: Type[T],
        value: Union[str, int, UUID],
...

I think ID values should be cast to str if necessary (it would be in _to_hashed_id case).

Example Code

No response

Python, DocArray & OS Version

Python 3.8.12
docarray==0.40.0

Affected Components

The text was updated successfully, but these errors were encountered:

JoanFM · 2024-02-08T07:36:20Z

hey @oytuntez ,

thanks for the issue reported, would it be possible for you to share a code snippet showing the issue with DocArray code, avoiding as much as possible other dependencies?

Thanks

ai-naymul · 2024-05-04T05:35:23Z

I noticed this behavior when I wanted to access multiple documents in the index:

@requests(on='/find')
    def find(self, docs: DocList[QuoteFile], **_) -> DocList[QuoteFile]:
        return self._cache_di[docs.id]

And when I issue POST /find with body {"data":[{"id":"300055"}]}, this code yields:

       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 544, in _get_docs_sqlite_doc_id                                     
           hashed_ids = tuple(self._to_hashed_id(id_) for                       
       id_ in doc_ids)                                                          
         File                                                                   
       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 544, in <genexpr>                                                   
           hashed_ids = tuple(self._to_hashed_id(id_) for                       
       id_ in doc_ids)                                                          
         File                                                                   
       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 445, in _to_hashed_id                                               
           return                                                               
       int(hashlib.sha256(doc_id.encode('utf-8')).hexdigest…                    
       16) % 10**18                                                             
       AttributeError: 'int' object has no attribute                            
       'encode'

Upon investigation, I saw that most of HnswDocumentIndex treats IDs as str. However, it is my understanding that IDs can be int, see this type definition:

class ID(str, AbstractType):
    """
    Represent an unique ID
    """

    @classmethod
    def _docarray_validate(
        cls: Type[T],
        value: Union[str, int, UUID],
...

I think ID values should be cast to str if necessary (it would be in _to_hashed_id case).

Hi @JoanFM,
Hope you are doing great as well. I gotta this issue, I think the _to_hashed_id method in the HnswDocumentIndex class expects a string input for hashing, but it receives an integer. That's why this occurs. Here is the method that cause the error:

Here is my proposed solution:

Let me know what's your thought on that.
@oytuntez @JoanFM

JoanFM · 2024-05-04T08:54:12Z

can you please provide an isolated code to reproduce the issue?

ai-naymul · 2024-05-05T17:06:56Z

can you please provide an isolated code to reproduce the issue?

Hey I have tried to simulate the scenario where an integer ID is passed to a method that expects a string ID and tries to hash it, here it is:

import hashlib

def _to_hashed_id(doc_id):
return int(hashlib.sha256(doc_id.encode('utf-8')).hexdigest(), 16) % 10**18

doc_ids = [300055] # This should be a string, not an integer
hashed_ids = tuple(_to_hashed_id(id_) for id_ in doc_ids)

That would be great if @oytuntez can give us the actual reproducable code!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HnswDocumentIndex treats document IDs as string, they can be str, int, ID #1850

HnswDocumentIndex treats document IDs as string, they can be str, int, ID #1850

oytuntez commented Feb 7, 2024

JoanFM commented Feb 8, 2024

ai-naymul commented May 4, 2024

JoanFM commented May 4, 2024

ai-naymul commented May 5, 2024

HnswDocumentIndex treats document IDs as string, they can be str, int, ID #1850

HnswDocumentIndex treats document IDs as string, they can be str, int, ID #1850

Comments

oytuntez commented Feb 7, 2024

Initial Checks

Description

Example Code

Python, DocArray & OS Version

Affected Components

JoanFM commented Feb 8, 2024

ai-naymul commented May 4, 2024

JoanFM commented May 4, 2024

ai-naymul commented May 5, 2024