Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Approaches to searching documents requiring different kind of embeddings (multimodal too) #13594

Open
1 task done
coffepowered opened this issue May 20, 2024 · 3 comments
Labels
question Further information is requested

Comments

@coffepowered
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

Hello,
I am reflecting on how to tackle the problem of searching over a knowledge base that could require different modalities for embeddings. To make this more concrete, my users are used to upload:

  • business documents/reports (what the system was conceived to ingest)
  • documents containing code or SQL (I get bad performance on this!)
  • documents containing images
  • documents containing tables

I recall llama_index has some approaches to images and tables, but is there a general preferred pattern to do this?
In general, I see some options to handle this:

  1. having a super strong multimodal embedding (which is able to interpret images, SQL, multilingual text and whatever). I am not aware of such an embedded
  2. maintaining segregated indexes for each modalities, then route queries depending on the type of question
  3. a mix of 1-2.

Do you see more pattern to apply? Which would be the "llama-index" way of approaching the problem, if any?

@coffepowered coffepowered added the question Further information is requested label May 20, 2024
Copy link

dosubot bot commented May 20, 2024

To effectively search over a knowledge base with documents requiring different kinds of embeddings, including multimodal ones, LlamaIndex offers a robust approach that aligns well with the options you're considering. Here's a concise overview tailored to your needs:

  1. Multimodal Embeddings: LlamaIndex employs MultiModalEmbedding for handling various content types, such as text, images, and code. For images, it uses CLIP embeddings, and for documents containing code or SQL, it leverages specialized embeddings that understand the structure and semantics of code, enhancing retrieval accuracy.

  2. Separate Indexes for Each Modality: It maintains separate vector stores for different modalities within the MultiModalVectorStoreIndex class. This setup allows for efficient handling and storage of embeddings specific to each modality, ensuring that each type of content is indexed and queried optimally.

  3. Combination of Approaches: By combining multimodal embeddings with separate indexes for each modality, LlamaIndex provides a comprehensive solution for indexing and querying a diverse knowledge base. This hybrid approach ensures that all types of documents, whether they contain text, images, code, or SQL, are effectively processed and made searchable.

For implementing these strategies, consider leveraging vector space models for semantic representation, utilizing specialized embeddings for code and text, and integrating with language models for enhanced query processing. Customizable ingestion pipelines and efficient indexing mechanisms are also crucial for handling large and diverse datasets.

For more detailed guidance and examples, you might want to explore the source code and documentation provided by LlamaIndex, which offers insights into the implementation of these strategies.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@logan-markewich
Copy link
Collaborator

Basically it's 2. Create indexes for each type. The multimodal index is one specific use case of these, for images and text, but the general approach makes sense, since each modality requires a specific embedding model

@coffepowered
Copy link
Author

the response from dosubot is useful, but nothing beats a human. Thank you @logan-markewich !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants