Generalize document handler extraction #297

pmeier · 2024-01-24T21:31:58Z

When we started out with the document handlers, we assumed that a Page is a good unit to base the extraction on:

Lines 179 to 188 in 51fcda4

    
           class Page(BaseModel): 
        
               """Dataclass for pages of a document 
        
               Attributes: 
        
                   text: Text included in the page. 
        
                   number: Page number. 
        
               """ 
        
               text: str 
        
               number: Optional[int] = None

ragna/ragna/core/_document.py

Lines 203 to 213 in 51fcda4

    
               @abc.abstractmethod 
        
               def extract_pages(self, document: Document) -> Iterator[Page]: 
        
                   """Extract pages from a document. 
        
                   Args: 
        
                       document: Document to extract pages from. 
        
                   Returns: 
        
                       Extracted pages. 
        
                   """ 
        
                   ...

However, right from the beginning this was already an issue for plain text documents like .txt and after #270 also .md since they don't have a concept of pages. In addition, we were disillusioned by the fact not even .docx documents can provide page information (#281 (comment)). There is some hope for .pptx (#296 (comment)).

In any case, it is clear now that besides .pdf most other popular formats do not support page information. Let's have a look on how the information is used next.

As seen above, the Page object is just a thin dataclass. It is processed by the builtin source storages by transforming the objects into chunks:

ragna/ragna/source_storages/_vector_database.py

Lines 44 to 48 in 51fcda4

    
           @dataclasses.dataclass 
        
           class Chunk: 
        
               text: str 
        
               page_numbers: Optional[list[int]] 
        
               num_tokens: int

ragna/ragna/source_storages/_vector_database.py

Lines 79 to 97 in 51fcda4

    
           def _chunk_pages( 
        
               self, pages: Iterable[Page], *, chunk_size: int, chunk_overlap: int 
        
           ) -> Iterator[Chunk]: 
        
               for window in _windowed_ragged( 
        
                   ( 
        
                       (tokens, page.number) 
        
                       for page in pages 
        
                       for tokens in self._tokenizer.encode(page.text) 
        
                   ), 
        
                   n=chunk_size, 
        
                   step=chunk_size - chunk_overlap, 
        
               ): 
        
                   tokens, page_numbers = zip(*window) 
        
                   yield Chunk( 
        
                       text=self._tokenizer.decode(tokens),  # type: ignore[arg-type] 
        
                       page_numbers=list(filter(lambda n: n is not None, page_numbers)) 
        
                       or None, 
        
                       num_tokens=len(tokens), 
        
                   )

After that, the text is chunked different than before and the only thing left from the original structure is the list of page numbers. They are stored alongside the chunk in the source storage as metadata and will re-appear on the retrieved Sources as source.location

ragna/ragna/core/_components.py

Lines 79 to 96 in 51fcda4

    
           class Source(pydantic.BaseModel): 
        
               """Data class for sources stored inside a source storage. 
        
               Attributes: 
        
                   id: Unique ID of the source. 
        
                   document: Document this source belongs to. 
        
                   location: Location of the source inside the document. 
        
                   content: Content of the source. 
        
                   num_tokens: Number of tokens of the content. 
        
               """ 
        
               model_config = pydantic.ConfigDict(arbitrary_types_allowed=True) 
        
               id: str 
        
               document: Document 
        
               location: str 
        
               content: str 
        
               num_tokens: int

ragna/ragna/source_storages/_chroma.py

Lines 129 to 135 in 51fcda4

    
           Source( 
        
               id=result["id"], 
        
               document=document_map[result["metadata"]["document_id"]], 
        
               location=result["metadata"]["page_numbers"], 
        
               content=result["document"], 
        
               num_tokens=result["metadata"]["num_tokens"], 
        
           )

Note that we here already deviated the Page abstraction and used the more generic location. However, we still only put page numbers in.

Ultimately, this location can help users track down the sources in the original document that were used in an answer. Even after #264, i.e. the sources in the API / UI contain the full text alongside the location information, I still believe this is quite useful.

Ok, where do we go from here? My initial thought was to generalize the Page object to Text and instead of having the page number on there we have a more generic location field similar to the Source. Instead of just a string, this could be a of type class Location(abc.ABC). Potential subclasses could be class Page(Location), class Paragraph(Location), or class Line(Location). This should give us more flexibility for the supported data formats.

After some more thoughts, one thing I dislike about the approach above is that we still have an "input object" Text and a processed object Chunk. And besides where in the pipeline the object appears, they are practically the same. One can think of the original extraction from the document as the first chunking and our later processing into chunks of a constant number of tokens as re-chunking. Thus, what I think makes the most sense here is to make Chunk the object we extract from the document. We can still use the location approach I described in the paragraph above here. And in the source storages we could have something like

class Chunker(abc.ABC):
    @abc.abstractmethod
    def __call__(self, chunks: Iterator[Chunk]) -> Iterator[Chunk]:
        ...
    
class TokenChunker(Chunker):
    ...

This Chunker could actually become a standalone component separated from the SourceStorage (@nenb IIRC we have discussed this at some point in the past, but I can't find it. If you do, please link it to this thread). However, this would not be required in the first step, as the change can also stand on its own as is.

The text was updated successfully, but these errors were encountered:

pmeier added the type: RFD ⚖️ Decision making label Jan 24, 2024

This was referenced Mar 8, 2024

embedding model component? #191

Open

remove max_input_size property from assistants #362

Merged

pmeier added this to the 0.3.0 milestone Mar 18, 2024

pmeier added the dev: components label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize document handler extraction #297

Generalize document handler extraction #297

pmeier commented Jan 24, 2024

Generalize document handler extraction #297

Generalize document handler extraction #297

Comments

pmeier commented Jan 24, 2024