Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize document handler extraction #297

Open
pmeier opened this issue Jan 24, 2024 · 0 comments
Open

Generalize document handler extraction #297

pmeier opened this issue Jan 24, 2024 · 0 comments
Milestone

Comments

@pmeier
Copy link
Member

pmeier commented Jan 24, 2024

When we started out with the document handlers, we assumed that a Page is a good unit to base the extraction on:

class Page(BaseModel):
"""Dataclass for pages of a document
Attributes:
text: Text included in the page.
number: Page number.
"""
text: str
number: Optional[int] = None

@abc.abstractmethod
def extract_pages(self, document: Document) -> Iterator[Page]:
"""Extract pages from a document.
Args:
document: Document to extract pages from.
Returns:
Extracted pages.
"""
...

However, right from the beginning this was already an issue for plain text documents like .txt and after #270 also .md since they don't have a concept of pages. In addition, we were disillusioned by the fact not even .docx documents can provide page information (#281 (comment)). There is some hope for .pptx (#296 (comment)).

In any case, it is clear now that besides .pdf most other popular formats do not support page information. Let's have a look on how the information is used next.


As seen above, the Page object is just a thin dataclass. It is processed by the builtin source storages by transforming the objects into chunks:

@dataclasses.dataclass
class Chunk:
text: str
page_numbers: Optional[list[int]]
num_tokens: int

def _chunk_pages(
self, pages: Iterable[Page], *, chunk_size: int, chunk_overlap: int
) -> Iterator[Chunk]:
for window in _windowed_ragged(
(
(tokens, page.number)
for page in pages
for tokens in self._tokenizer.encode(page.text)
),
n=chunk_size,
step=chunk_size - chunk_overlap,
):
tokens, page_numbers = zip(*window)
yield Chunk(
text=self._tokenizer.decode(tokens), # type: ignore[arg-type]
page_numbers=list(filter(lambda n: n is not None, page_numbers))
or None,
num_tokens=len(tokens),
)

After that, the text is chunked different than before and the only thing left from the original structure is the list of page numbers. They are stored alongside the chunk in the source storage as metadata and will re-appear on the retrieved Sources as source.location

class Source(pydantic.BaseModel):
"""Data class for sources stored inside a source storage.
Attributes:
id: Unique ID of the source.
document: Document this source belongs to.
location: Location of the source inside the document.
content: Content of the source.
num_tokens: Number of tokens of the content.
"""
model_config = pydantic.ConfigDict(arbitrary_types_allowed=True)
id: str
document: Document
location: str
content: str
num_tokens: int

Source(
id=result["id"],
document=document_map[result["metadata"]["document_id"]],
location=result["metadata"]["page_numbers"],
content=result["document"],
num_tokens=result["metadata"]["num_tokens"],
)

Note that we here already deviated the Page abstraction and used the more generic location. However, we still only put page numbers in.

Ultimately, this location can help users track down the sources in the original document that were used in an answer. Even after #264, i.e. the sources in the API / UI contain the full text alongside the location information, I still believe this is quite useful.


Ok, where do we go from here? My initial thought was to generalize the Page object to Text and instead of having the page number on there we have a more generic location field similar to the Source. Instead of just a string, this could be a of type class Location(abc.ABC). Potential subclasses could be class Page(Location), class Paragraph(Location), or class Line(Location). This should give us more flexibility for the supported data formats.

After some more thoughts, one thing I dislike about the approach above is that we still have an "input object" Text and a processed object Chunk. And besides where in the pipeline the object appears, they are practically the same. One can think of the original extraction from the document as the first chunking and our later processing into chunks of a constant number of tokens as re-chunking. Thus, what I think makes the most sense here is to make Chunk the object we extract from the document. We can still use the location approach I described in the paragraph above here. And in the source storages we could have something like

class Chunker(abc.ABC):
    @abc.abstractmethod
    def __call__(self, chunks: Iterator[Chunk]) -> Iterator[Chunk]:
        ...
    
class TokenChunker(Chunker):
    ...

This Chunker could actually become a standalone component separated from the SourceStorage (@nenb IIRC we have discussed this at some point in the past, but I can't find it. If you do, please link it to this thread). However, this would not be required in the first step, as the change can also stand on its own as is.

@pmeier pmeier added the type: RFD ⚖️ Decision making label Jan 24, 2024
@pmeier pmeier added this to the 0.3.0 milestone Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

1 participant