Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"managed" Ragna #256

Open
pmeier opened this issue Dec 19, 2023 · 20 comments
Open

"managed" Ragna #256

pmeier opened this issue Dec 19, 2023 · 20 comments

Comments

@pmeier
Copy link
Member

pmeier commented Dec 19, 2023

It's been roughly two months since the initial release of Ragna. While the overall feedback is positive, we had feature requests (I'll link stuff below) for a "managed" version of Ragna. By "managed" I don't mean managed by the Ragna team as a paid service, but rather managed by an admin team inside an organization for regular users inside the same organization. If someone has a better name for this, please come forward.

Let's have a look at how Ragna currently works and why this is the case, what exactly is requested, and what ultimately needs to change to make these requests a reality.

Status quo

Ragnas original goal was to make the research case, i.e. evaluating all different RAG components, easy. We made the following assumptions:

  • The user not only has access to the documents they want to test, but also has them on disk.
  • The user potentially wants to (and in turn is allowed to) tweak everything.

With that we made the chat the "atomic unit" of Ragna. Each chat can be about different documents and all components can be configured freely.

What is missing?

As explained above, the use case that we build Ragna for requires a large degree of agency of the user. In an actual production setting, the assumptions we made above might not hold:

  • The user might not have the documents on disk, but they are stored in some remote "database" (potentially an actual database, a wiki, ...)
  • The user might not have any clue what all the parameters mean and they "just want to get my questions answered"

Thus, Ragna currently falls flat for this use case or at least makes it difficult. This is evident by bunch of feature requests and comments we received. I'll try to link everything here. If I missed something, feel free to follow up on this comment.

Basically, this uncouples the Chat.prepare and Chat.answer methods. The former will be executed by the administrator and the latter by the users.

Cc @peachkeel, @nenb, @NetDevAutomate who contributed the requests linked above.

What needs to change?

There are plenty of changes needed. I'll put my thoughts below without any claim that the list is exhaustive. There might be other changes needed that we only find when start proper planning. The list below is meant as a starting point.

Python / REST API

  • While the Assistant class can stay as it is, we need to make some changes to the SourceStorage class. SourceStorage.store is used in the preparation stage, while SourceStorage.retrieve is used in the answer stage. At the very least, we need to switch away from storing each chat in an individual collection / table / etc to doing this on a corpus level. Furthermore, we also need a way to select a subset of documents from a corpus for example through tagging.
  • A new chat class ManagedChat (name TBD) is needed. Instead of documents, this takes a corpus and optionally tags for a sub-selection of documents. The new ManagedChat has no notion of preparation as the corpus is already prepared.
  • We need to design some interface to let an administrator perform the preparation stage. Meaning, we need an easy way to ingest documents from an existing "database", chunk and embded them, and ultimately store them in a SourceStorage.

Web UI

  • The major changes happen in the "New Chat" modal as it has to be redesigned completely. The assistant selection and its parameters (and maybe even this gets disabled completely if the administrators don't want the users to change this) all other parts need to be replaced by corpus and tag selection.
  • Apart from this modal the UI can mostly stay the same. Maybe we also want to remove the chat info button to avoid "leaking" administrator infos to the users, but no strong opinion here.
  • We need to decide how both use cases, i.e. the experimentation and managed RAG, can coexist in the UI. Potential solutions might be to force one use case through the config, i.e. they are not accessible at the same time through the same UI. We could also add a "tab" in the header bar to make them both accessible.

How do we move forward?

This is a massive feature request since it breaks some core assumptions that we made. We agree that this is useful and would be beneficial to support. That being said, we (Ragna core team funded by Quansight) currently lack the resources to tackle this holistically. Given that there are so many design decisions to be made, I think this will be almost impossible by relying on OSS contributions alone. At the very least we need to have some intense sprint planning before we can ask the community for help.

If you have a corporate interest in this feature being added to Ragna, please reach out to Quansight, either by contact form or email.

@pmeier
Copy link
Member Author

pmeier commented Dec 19, 2023

Shower thought: what if we create a new Corpus class that can be passed as Rag().chat(documents=Corpus(), ...). Internally, if someone passes documents as it is done now, we just create new Corpus for them. With that we could move the prepared flag to the corpus rather than the chat

self._prepared = False

Meaning, if a user supplies an already prepared corpus, they can just start chatting. On the SourceStorage class we could also shift away from having collections / tables / ... per chat to per corpus. That way, the same class could be used for both use cases.

Unless I'm missing something big, this could actually work and potentially fix all the issues that we saw above regarding the Python / REST API. If only my day had more than 24 hours ...

@nenb
Copy link
Contributor

nenb commented Dec 19, 2023

TL;DR I'm happy to continue using the issue tracker for contributing. I'm also happy to accept that not all issues will lead to a new feature being merged into the codebase. Perhaps we can see what a 'managed' Ragna would look like via a bunch of exploratory PRs, and keep the project momentum going?

Given that there are so many design decisions to be made, I think this will be almost impossible by relying on OSS contributions alone. At the very least we need to have some intense sprint planning before we can ask the community for help.

I would be keen to keep the momentum going that the ragna team have already built up. Other projects in the area are moving fast (eg langchain), and my impression is that this is important for young projects like ragna to be a success. I understand that there is a trade-off here with 'accuracy', but my personal vote is that it is worth it during the early stages of a project. I'd be afraid that sprint planning might halt some of the momentum + OSS connection that the ragna team has built. But that's just a personal preference of course.

Shower thought: what if we create a new Corpus class that can be passed as Rag().chat(documents=Corpus(), ...). ...

I'm happy to contribute here! Perhaps the ragna team could open some relevant sub-issues (likely they will be imperfect, and little more than a basic signpost of a feature), and I would be happy to start opening some more PRs. Worst-case scenario, the ragna team decides that the PR is no longer practical/desirable and we can abandon it. But even this is a helpful datapoint about how ragna should evolve!

@peachkeel
Copy link
Contributor

peachkeel commented Dec 19, 2023

When I proposed #246, I wanted to stay away from a first-class Corpus in the code. However, after writing #255 and reading @pmeier's shower thoughts, I think that the inclusion of such a class is inevitable. I also think that the notion of corpora can be included without making the system so complex that it loses its original charm as a lightweight research tool.

Writing in a sort of constructor-type pesudocode, and starting with the original atomic unit of Ragna, the Chat, I think the following refactoring might capture some of what we're looking for:

Chat(Assistant, IndexedCorpus)

IndexedCorpus(SourceStorage, Corpus)

Corpus(list[Document])

# NOTE: Simple UI view is equivalent to specifying leaves of tree.

Chat(Assistant, (SourceStorage, (list[Document])))

Here, the IndexedCorpus kind of becomes the storage-focused counterpart of the Chat and allows a level of indirection that gives the overall system more flexibility. A user can upload files to define a Corpus. A Corpus can be indexed using different SourceStorage strategies, each yielding an IndexedCorpus. An IndexedCorpus can, in a read-only fashion, be used to back Chats with different Assistants. All of this seems to flow pretty well. Of course, I too might be missing something. Feedback is welcome!

@nenb
Copy link
Contributor

nenb commented Jan 10, 2024

I've made a start on this over here.

My first attempt is to just extend the Rag class with a corpus method. This method instantiates a Corpus instance, in a similar way to how the current code uses a chat method to instantiate a Chat instance. The benefit of doing this is that it decouples the prepare and answer logic. I've also added a kwarg corpus_name for identifying a corpus.

I'll come back to working on this later in the week. Feedback very welcome!

@peachkeel
Copy link
Contributor

peachkeel commented Feb 1, 2024

Most of my work has been hacking around the edges of Ragna rather than hacking on Ragna's code base directly. In doing so, I've written two scripts (both of which I'm happy to share with the community if there is interest):

usage: delete.py [-h] UUID [UUID ...]

Delete a chat in Ragna.

positional arguments:
  UUID        existing chat to delete

options:
  -h, --help  show this help message and exit

--- AND ---

usage: clone.py [-h] --src-id UUID --dest-name NAME --dest-assistant STRING --dest-source-storage
                STRING

Create a new chat from an existing chat in Ragna.

options:
  -h, --help            show this help message and exit
  --src-id UUID         existing chat to use as a source of documents
  --dest-name NAME      name of the new chat
  --dest-assistant STRING
                        assistant (LLM) for the new chat
  --dest-source-storage STRING
                        source storage (vector database) for the new chat

After writing these scripts, I realized that they're all I need for "managed" Ragna.

Could we promote these function into the UI? That is, what if we put a delete button and a clone button under each listed chat? Clicking the clone button would pop up the chat creation dialog, which would be prepopulated with all the information from the chat being cloned. A user could then modify the information to create a new chat based off of the old chat.

Does this seem like a more doable plan to getting to "managed" Ragna than adding a Corpus class?

[NOTE: I will stipulate that the Corpus class solution is a better overall solution; however, perhaps "cloning" could be an interim step in that direction.]

@pmeier
Copy link
Member Author

pmeier commented Feb 2, 2024

@peachkeel

Re delete:

This was originally discussed in #62. We ended up merging #67, which added a delete endpoint to the API, which is likely what you are using in your script:

@app.delete("/chats/{id}")
async def delete_chat(user: UserDependency, id: uuid.UUID) -> None:
with get_session() as session:
database.delete_chat(session, user=user, id=id)

We always had the plan to add the delete button to the UI as well, but never got around to actually implement it. Let me open an issue for this to track it. Edit: See #304.

Re clone:

I agree with you that cloning a chat is the inferior solution to using a Corpus or a similar holistic approach. I haven't updated this issue, but there is actually some movement on the implementation. Demand for this growing and I'll likely have some time starting next week to tackle this. Thus, I think we should avoid merging a workaround solution if a proper fix is on the horizon.

In the mean time I think the scripts you wrote can still be beneficial for the community. Do you want to start a "Show and tell" discussion for them?

@peachkeel
Copy link
Contributor

In the mean time I think the scripts you wrote can still be beneficial for the community. Do you want to start a "Show and tell" discussion for them?

Posting to "Show and tell" might be a little unwieldy as I have four scripts and one common code file. I could do a pull request to add them to:

https://github.com/Quansight/ragna/tree/main/scripts

I do think they would be pretty useful and I'm happy to contribute them.

Let me know what you think.

@pmeier
Copy link
Member Author

pmeier commented Feb 2, 2024

Sure, a PR works as well. I never meant for you to post all the code in the discussion itself, but rather create a discussion and link the code from there. Like you said, these scripts can be useful for the community and I don't want them "buried" in this thread here.

@pmeier
Copy link
Member Author

pmeier commented Feb 8, 2024

I've spent a good deal of time thinking about this over the past few weeks and I think I found a proper solution that I'll propose here. This is going to be a long post, but there is no way around it. If the proposal is accepted, we can break it up into smaller issues for tracking.

I've ditched the corpus abstraction and thus my idea from #256 (comment) in favor of a simpler but just as powerful workflow. The overall idea goes as follows:

  • Switch the source storages to one table / collection / index / ... per embedding model (there currently is only one, but that will change with embedding model component? #191)
  • Store document metadata alongside the embedding in the source storage
  • Use native functionality of the source storage to filter documents based on their metadata when retrieving sources

Let's dive deep into the proposal now.

Python API

To be able to store additional document metadata, e.g. tags (#246), inside the source storage, we first need a way to put it on the Document object itself. Luckily, this is already there:

class Document(RequirementsMixin, abc.ABC):
"""Abstract base class for all documents."""
def __init__(
self,
*,
id: Optional[uuid.UUID] = None,
name: str,
metadata: dict[str, Any],
handler: Optional[DocumentHandler] = None,
):

The constructor is usually not called directly, but rather through

@classmethod
def from_path(

Meaning, passing additional metadata is as easy as

LocalDocument.from_path("/path/to/my/document.txt", metadata={"tag": "important"})

To store them alongside the embedding inside the SourceStorage, we can just unpack this metadata dictionary and merge it with the metadata we already store, e.g.

metadatas.append(
{
"document_id": str(document.id),
"page_numbers": self._page_numbers_to_str(chunk.page_numbers),
"num_tokens": chunk.num_tokens,
}
)

With that in place we can use the native functionality to filter on this metadata, e.g.

However, while all components have similar support for metadata filtering they all use a different "dialect". Meaning we need to provide a layer of abstraction on top of it. This should be relatively straight forward tree-like object that supports a few basic operators like "and", "or", and "equals". We can always add more later. Instead of documents, SourceStorage.retrieve will instead be passed this new metadata filter object. The SourceStorage is responsible in translating it its own "dialect" and perform the filtering with it.

With this we have the functionality down, but there are still two open questions:

  1. Although "managed" Ragna is a valid and important use case, the same applies to our current default. How can both use cases coexist?
  2. When the issue above is solved, how can we wrap everything into a nice interface to not degrade the UX that we currently have?

The current relevant part interface consists of these parts:

Both parts can not stay like this if we want to support "managed" Ragna. To overcome this, I suggest replacing the documents parameter with an inputs one and remove it from the keyword-only parameters. It should do everything documents does now, but also allow one additional input: the new metadata filter object. Based on the type we will have two branches:

  • If we get a metadata filter object, we simply set _prepared = True. That way we can start sending prompts right away since we have everything we need to call SourceStorage.retrieve and Assistant.answer.
  • If we get anything else, we trigger our current behavior. After we have parsed the input and now have Document objects, we can build metadata filter object from them. We filter for any embedding for which the document ID matches an ID of our input documents. In pseudocode this would look like lambda id: or([equal(id, document.id) for document in documents]). In this case we also leave _prepared = False. Meaning the user has to call Chat.prepare or use the context manager before they can start chatting.

REST API

The REST API closely follows the Python API and thus we don't need to change a lot here:

  • Instead of letting Document.get_upload_info define the document metadata completely, it needs to only update metadata sent by the user. This is the same as LocalDocument.from_path does and roughly what was requested in Add tags to documents #246.
  • Instead of doing type shenanigans with JSON inputs, the "create chat"-endpoint should get two distinct inputs for documents or a metadata filter. We just need to make sure that exactly one of the two is passed and otherwise return an error.

Web UI

This section relies on #313. TL;DR: the "new chat" modal should be be cleanly split into two "phases", namely "preparation" and "interrogation" (happy for other naming suggestions). For the remainder, I'll assume that we have a good UI design and will only deal with the background functionality here.

Right now the functionality of these two phases is hardcoded:

  • For the preparation stage, the user can select a source storage and the chunking parameters as well as upload documents
  • For the interrogation stage, the user can select an assistant and the retrieval and generation parameters

Instead of hardcoding this behavior, which clashes with the ideas presented above, we could create two new abstract objects, e.g. NewChatPreparationConfig and NewChatInterogationConfig, that need to be set in the configuration. Each of them would have two abstract methods:

  • __panel__: The return value of this will just be dropped into the modal and thus enables the ultimate flexibility of what information or controls are exposed to the user
  • get_params (name TBD): This method is called when the "Start Conversation" button is clicked and should return the parameters to pass to the "create chat"-endpoint.

With this we can retain our current behavior by providing default implementations for the two objects. At the same time this leaves admins with the ability to specify the workflow they want for their users.

Feedback

Although I carefully considered all known use cases right now, I might have missed something or simply made a mistake. If you have a use case that will break the design above or you notice an inconsistency in it, feel free to speak up. Implementing the proposal will be a major effort. I would hate to only find about some blockers when we are already deep in the weeds.

@peachkeel
Copy link
Contributor

@pmeier,

Thanks for thinking so thoroughly about the problem and posting such a detailed response.

I'm going to have to read it a few times to fully understand everything and make sure what you suggest aligns with our codebase. We've actually written a fair bit of code to the existing interfaces, so I'll be honest that I'm a little nervous about the proposed changes.

I hope you will hold off on implementation until we can provide you with some feedback.

Cheers!

@pmeier
Copy link
Member Author

pmeier commented Feb 9, 2024

We've actually written a fair bit of code to the existing interfaces, so I'll be honest that I'm a little nervous about the proposed changes.

I'd love to see or hear about all workarounds you have in place to see if this is something Ragna would benefit from in general. If you cannot post this publicly, I'm available at pmeier at quansight dot com

The only breaking change in my proposal is how the source storage stores and retrieves data. Right now, the builtin SourceStorages

  • create a new collection / table / index / ... for each chat based on its ID
  • store the name and ID of the document but none of its metadata
  • don't filter anything when retrieving sources since the collection / table / index already only contains the documents of the specific chat

I don't know how you have set up your SourceStorage. The only hint I have is your discussion about a Vectare plugin. If that is still relevant it seems you are doing the same as I described above.

What we need for this proposal

  • Only one collection / table / index
  • Storage of the full document metadata
  • Ability to translate an abstract metadata filter object into the native dialect and use that as document filter when retrieving

I hope you will hold off on implementation until we can provide you with some feedback.

There is no rush yet. We have a deadline for early April, but we are not going to start on this before the 0.2.0 release is out. So you have roughly 1-2 weeks.

@nenb
Copy link
Contributor

nenb commented Feb 12, 2024

@pmeier Thanks for the write-up. 😍 I have some questions on a couple of the topics:

Switch the source storages to one table / collection / index / ... per embedding model (there currently is only one, but that will change with #191)

Does this mean that it's not possible to create multiple tables for the same embedding model at all? If I have multiple pre-existing tables that I would like to chat over in separate conversations, is this no longer possible? I think this is quite restrictive if so (but perhaps I have misunderstood).

However, while all components have similar support for metadata filtering they all use a different "dialect". Meaning we need to provide a layer of abstraction on top of it. This should be relatively straight forward tree-like object that supports a few basic operators like "and", "or", and "equals". We can always add more later.

This sounds like a lot of work (including continuous maintenance). It also seems possible that we will not be able to support the full-functionality of every source_storage, which might prevent uptake of ragna. Do we really needs this abstraction? What is preventing us from just passing a search string to the relevant source_storage that is then responsible for handling this functionality?

If we get a metadata filter object, we simply set _prepared = True. That way we can start sending prompts right away since we have everything we need to call SourceStorage.retrieve and Assistant.answer.

I'm getting into the weeds a bit here (and perhaps I have not understood correctly), but I don't find this behaviour intuitive. Isn't it also possible that I would like to pass a metadata filter object and 'prepare' the data at the same time? I know you have mentioned something about this in a later paragraph, but I don't think it covers all possibilities - what if I add some documents, and want to use a filter over a larger superset of documents already exisiting in a table?

@pmeier
Copy link
Member Author

pmeier commented Feb 12, 2024

Does this mean that it's not possible to create multiple tables for the same embedding model at all?

Yes, but only for the builtin source storages. For custom ones, users can do whatever they want. If they have a use case for multiple tables, their custom SourceStorage can of course support this.

If I have multiple pre-existing tables that I would like to chat over in separate conversations, is this no longer possible?

This is currently also not possible for the builtin source storages and thus no regression.


What is preventing us from just passing a search string to the relevant source_storage that is then responsible for handling this functionality?

I'm happy to be enlightened here, but I don't think this is easy or doable. How would you define a "search string" for the following two scenarios:

  1. Only retrieve documents which IDs are part of a list of IDs
  2. Only retrieve documents which have one or more tags of a list of tags

Here is an example for the first scenario for a MetadataFilter object I've already built (roughly 100LoC including convenience methods) while playing around with the idea:

ids = [...]
filter = MetadataFilter.or_([MetadataFilter.eq("id", id) for id in ids])
print(filter)
or(
  eq('id', '525bbe13-a058-47ce-a60a-d47caa204095'),
  eq('id', '39f8868a-ae6a-4ad5-8b52-6199f18d9dd1'),
  eq('id', 'd492b405-7cd6-4ba6-a509-89d33aa36578'),
)

And here is the filter translated into the Chroma dialect:

{'$or': [{'id': {'$eq': '525bbe13-a058-47ce-a60a-d47caa204095'}}, {'id': {'$eq': '39f8868a-ae6a-4ad5-8b52-6199f18d9dd1'}}, {'id': {'$eq': 'd492b405-7cd6-4ba6-a509-89d33aa36578'}}]}

It also seems possible that we will not be able to support the full-functionality of every source_storage, which might prevent uptake of ragna.

Yes, certainly. We need to decide at some point if we are ok to add more advanced operators to the metadata filter that only some source storages support. I would start with a minimal set as proposed above and only expand in case the use case arises.

However, this is exactly the same for a "search string". Unless I misunderstand what you are saying, a "search string" would have exactly the same features and limitations as a MetadataFilter object, with the downside that the downstream source storage also needs to parse it before it can be translated and used.


Isn't it also possible that I would like to pass a metadata filter object and 'prepare' the data at the same time? [...] what if I add some documents, and want to use a filter over a larger superset of documents already exisiting in a table?

Hmm, I'm a little on the fence for this. We could potentially also support your use case by allowing both a documents and a metadata_filter input for the Chat. The only thing that would change from the proposal above is when both are passed. In that case we don't assume that Chat is prepared, but use the passed metadata_filter rather than constructing one from the documents.

The reason I'm not sure is because this is mixing two use cases. Adding documents to an existing index is an admin task, while filtering / chatting is a user task. I'll think about this more.

@pmeier
Copy link
Member Author

pmeier commented Feb 12, 2024

Here is my sample implementation of a MetadataFilter: https://gist.github.com/pmeier/38ee90be6c30ecdf9bbec086a0dabafe

@nenb
Copy link
Contributor

nenb commented Feb 12, 2024

I'm happy to be enlightened here, but I don't think this is easy or doable. How would you define a "search string" for the following two scenarios:
Only retrieve documents which IDs are part of a list of IDs
Only retrieve documents which have one or more tags of a list of tags

Why can't we just pass a string that contains the query according to the source_storage's own dialect eg for chromadb above the user would just pass:
"{'$or': [{'id': {'$eq': '525bbe13-a058-47ce-a60a-d47caa204095'}}, {'id': {'$eq': '39f8868a-ae6a-4ad5-8b52-6199f18d9dd1'}}, {'id': {'$eq': 'd492b405-7cd6-4ba6-a509-89d33aa36578'}}]}"

(Sorry, not grokking this...)

@pmeier
Copy link
Member Author

pmeier commented Feb 12, 2024

Two reasons:

  1. You would tie the input to Chat to a specific SourceStorage. Meaning, you cannot simply switch the SourceStorage and keep everything as is. Tasks like comparing multiple setups (Add a comparison workflow / app #18) will get a lot harder. Plus, this would completely eliminate the possibility for us to ever provide builtin a metadata_filter from the UI, because we cannot know the dialect of the source storage.
  2. Chroma does not accept a string as filter, but rather a filter object. Meaning, we would need to parse the string back into the object, rather than receiving it in an abstract form in the first place.

@pmeier
Copy link
Member Author

pmeier commented Feb 12, 2024

Summarizing an offline discussion I had with @nenb:

His concerns regarding the MetadataFilter object are mostly related to the fact that we add another object to Ragna that we have maintain as well the translation into the respective dialects for the builtin source storages. This of course takes time and effort that we also could spent on other things.

I agree that these are valid concerns. However, not providing an abstraction here would mean that the chat gets less "portable", i.e. we now have dependent parameters were they are currently independent. Plus, and this is the strongest counter argument for me, we would push learning the correct filter dialect for the source storage to the user. That is probably ok for custom source storages, but a quite bad UX for the builtin ones.

Unless the mismatch cannot be justified, I value UX higher than DevX. For me, maintaining a fairly standard tree implementation plus a few translators is preferable over telling users to do it.

@peachkeel
Copy link
Contributor

peachkeel commented Feb 17, 2024

I've read this proposal over and reflected on it. I guess my concern is that I don't know if every SourceStorage I want to implement has metadata filtering built into it. This concern is especially true for newer or academic libraries, which, perversely, are often what I might want to test in Ragna to understand if they're worth exploring more in production.

Beyond that, I doubt I'll have the time or inclination to update old SourceStorage classes I've written to the new interface. That makes me a little sad as they'll essentially become orphaned code.

I certainly don't want to stand in the way of the project advancing, and I do think that "managed" Ragna is a great feature. My instinct is to say that the proposed solution is a little too "fragile" for my tastes.

Another thing to think about is the ramifications of commingling documents in the same underlying index. Right now, I can pretty easily find and delete a large index backing a deleted chat. It seems like in this model the single collection / table / index grows unbounded. Is that correct?

Anyway, I'm happy to discuss more either online or offline.

@pmeier
Copy link
Member Author

pmeier commented Feb 20, 2024

I guess my concern is that I don't know if every SourceStorage I want to implement has metadata filtering built into it. This concern is especially true for newer or academic libraries, which, perversely, are often what I might want to test in Ragna to understand if they're worth exploring more in production.

That is a valid concern. This coincides with another use case I haven't touched on above: what if I don't want to do any filtering, but rather want to use the full index? Let's enumerate the four possible cases on how to create a new chat assuming we can pass both documents and metadata_filter (see last point in #256 (comment)):

  1. documents is None and metadata_filter is None:

    We don't perform any metadata filtering and retrieve sources from the full index.

  2. documents is not None and metadata_filter is None:

    We store the documents and set the metadata_filter to the IDs of the documents.

  3. documents is None and metadata_filter is not None:

    We retrieve the sources with the given metadata_filter

  4. documents is not None and metadata_filter is not None:

    Not decided yet given that this is an hybrid use case. Two options:

    1. Raise an error.
    2. Store the documents and use the passed metadata_filter rather than creating one from the documents. See the last point in in "managed" Ragna #256 (comment).

I've used None as sentinel for the default above, but we probably need a proper sentinel there since None is an actual value.

To come back to the use case of the SourceStorage not supporting metadata filtering, we could just create the chat with Rag.chat(documents=..., metadata_filter=None).

Of course the implementation of the SourceStorage could also simply ignore the metadata_filter input.

Beyond that, I doubt I'll have the time or inclination to update old SourceStorage classes I've written to the new interface. That makes me a little sad as they'll essentially become orphaned code.

I feel you, but unfortunately this is the harsh reality of beta products ... The only thing that breaks existing SourceStorages is the usage of documents in SourceStorage.retrieve:

document_map = {str(document.id): document for document in documents}

document_map = {str(document.id): document for document in documents}

For our builtin SourceStorages we only use it to map the retrieved chunks back to a document. This in turn is then placed on the Source

document: Document

We probably need to refactor this to only store the document ID and metadata there rather than the actual Document object, since we can only extract the former from the stored chunks. Plus, the actual Document object is not used afterwards anyway, but only its attributes.

So indeed this is a breaking change. The only upside is that we can achieve the same behavior as before with a few changes. Whether or not this is worth it in your case is not for me to judge.

I certainly don't want to stand in the way of the project advancing, and I do think that "managed" Ragna is a great feature. My instinct is to say that the proposed solution is a little too "fragile" for my tastes.

Could you elaborate on what is "too fragile" here. Do you mean anything but the fact that some SourceStorages might not support metadata filtering?

Does the point that @nenb raised in #256 (comment) that our custom MetadataFilter object might be insufficient for some SourceStorages also come into play? If so, do you think just using a string as filter that has to be crafted by the user is a better solution? See #256 (comment) for my thoughts on this.

Another thing to think about is the ramifications of commingling documents in the same underlying index. Right now, I can pretty easily find and delete a large index backing a deleted chat. It seems like in this model the single collection / table / index grows unbounded. Is that correct?

Yes, that is correct. My only argument here is that "managed" Ragna basically creates two roles: admin and user. Users would lose the ability to store new documents and thus the index can only grow by admin action.

Also note that for your custom implementations, you don't need to follow the paradigm of one index only. You can happily create indices based on the chat_id as before. Although I'm not sure how you'd want to apply metadata filtering in that case.

Anyway, I'm happy to discuss more either online or offline.

I would love too. I think getting some insights in how you have set up your SourceStorages would be very valuable feedback for Ragna in general, but for this proposal in particular. If I can somehow access them, let me know. If you want to get in touch and not post any contact information online, please use my email address that I posted in #256 (comment).

@peachkeel
Copy link
Contributor

peachkeel commented Feb 21, 2024

I'm going to keep commenting in public, but if I feel that I need to provide more depth, I'll contact you, @pmeier, and we can talk offline.

First, let me address this part of your reply above:

Could you elaborate on what is "too fragile" here. Do you mean anything but the fact that some SourceStorages might not support metadata filtering?

I think, technically, I mean that the proposed solution has a "code smell" known as Control Coupling. If we're fully aware of this coupling and its ramifications, and it is still the best option, I suppose I can't object to its implementation as I'm not writing the code. However, my gut instinct is that this kind of coupling is going to introduce unforeseen maintenance and modification costs.

Second, let me expound upon this part of your reply:

I think getting some insights in how you have set up your SourceStorages would be very valuable feedback for Ragna in general, but for this proposal in particular.

I've implemented "managed" Ragna through my scripts and specialty classes that kind of act as soft links (see example below).

More specifically, say I have a corpus of .txt files derived from Wikipedia articles and a SourceStorage called Vectara. My goal is to index these .txt files using Vectara and then create several different chats such that I don't need to reindex again. I can achieve this goal by first creating an initial chat using the UI. Let's say this initial chat has a chat_id of f5eeb599-46f8-4e9e-8753-3a1d2b80b26f.

With this chat_id in hand, I can create the following class:

class WikipediaVectara(Vectara):

    def store(
        self,
        documents: list[Document],
        *,
        chat_id: uuid.UUID,
        chunk_size: int = 500,
        chunk_overlap: int = 250
    ) -> None:
        pass

    def retrieve(
        self,
        documents: list[Document],
        prompt: str,
        *,
        chat_id: uuid.UUID,
        chunk_size: int = 500,
        num_tokens: int = 4000
    ) -> list[Source]:
        return super().retrieve(
            documents,
            prompt,
            chat_id=uuid.UUID("f5eeb599-46f8-4e9e-8753-3a1d2b80b26f"),
            chunk_size=chunk_size,
            num_tokens=num_tokens
        )

Now, I can use my clone.py script to make a cloned conversation based off of the initial chat:

./clone.py \
  --src-id "f5eeb599-46f8-4e9e-8753-3a1d2b80b26f" \
  --dest-name "New Chat" \
  --dest-assistant "Bedrock/anthropic.claude-v2" \
  --dest-source-storage "WikipediaVectara"

Maybe there is some way we can operationalize my evolved solution as I know it works and it is non-breaking with respect to existing interfaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

3 participants