[Core] Omni-Modal Embedding, Vector Index and Retriever #13551

DarkLight1337 · 2024-05-17T10:02:04Z

Description

This PR lays the groundwork for extending the multi-modal support to other modalities (such as audio). The main components of this PR are:

Modality: Encapsulates the embedding-agnostic information about each modality
- e.g.: how to read BaseNode and QueryBundles that belong to that modality.
OmniModalEmbedding: Base class for Embedding component that supports any modality, not just text and image.
- Concrete subclasses should define the supported document_modalities and query_modalities, and implement _get_embedding (and related methods) for those modalities accordingly.
OmniModalEmbeddingBundle: Composite of OmniModalEmbedding where multiple embedding models can be combined together.
- To avoid ambiguity, only one model per document modality is allowed.
- There can be multiple models per query modality (each covering a different document modality).
OmniModalVectorStoreIndex: Index component that stores documents using OmniModalEmbeddingBundle. It is meant to be a drop-in replacement for MultiModalVectorStoreIndex.
- There is no need to specify the document modality when storing BaseNodes. The modality is inferred automatically based on the class type.
- Note: To load a persisted index, use OmniModalVectorStoreIndex.load_from_storage instead of llama_index.core.load_index_from_storage, since we do not serialize the details of each modality.
OmniModalVectorIndexRetriever: Retriever component that queries documents using OmniModalEmbeddingBundle.
- You can set top-k for each document modality.
- You must specify the query modality when passing in the QueryBundle. (May be changed in the future to automatically detect the modality in a manner similar to the case for document nodes)
- You can specify one or more document modalities to retrieve from, which may be different from the query modality.

As you may have guessed, I took inspiration from the recently released GPT-4o (where "o" stands for "omni") when naming these components, to distinguish from the existing MultiModal* components for text-image retrieval. I am open to other naming suggestions.

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

This change intentionally leaves the existing code untouched at the expense of some code duplication. Future PRs may work on the following:

Replace the existing BaseEmbedding class with OmniModalEmbedding (since it's more general).
Integrate the function of extracting document/query data based on Modality into existing BaseNode and QueryBundle classes. That way, we can replace Modality with a string key which can be serialized and deserialized easily.
Some existing enums/constants need to be refactored to enable downstream developers to define their own modalities.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Added new unit/integration tests

I have added basic unit tests for the internals of OmniModalEmbedding and OmniModalEmbeddingBundle.

It appears that the original multi-modal index (#8709) and retriever (#8787) don't have any unit tests. I am not sure what would be the best approach for testing their functionality. Perhaps @hatianzhang would have some ideas?

Added new notebook (that tests end-to-end)

To demonstrate the compatibility between OmniModalVectorStoreIndex and MultiModalVectorStoreIndex, I have created omni_modal_retrieval.ipynb which is basically the same as multi_modal_retrieval.ipynb except that MultiModal* components are replaced with OmniModal* ones.

Future PRs can work on adding new modality types. In particular, audio and video support would complement GPT-4o well (unfortunately we probably can't use GPT-4o directly to generate embeddings).

I stared at the code and made sure it makes sense

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

I will update the code documentation once the details are finalized.

…_retrieval.ipynb`

review-notebook-app · 2024-05-17T10:02:09Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

DarkLight1337 · 2024-05-17T10:04:54Z

llama-index-core/llama_index/core/instrumentation/events/embedding.py

        embeddings (List[List[float]]): List of embeddings.

    """

-    chunks: List[str]
+    chunks: Sequence[object]


This change is required to satisfy the type checker when logging chunks for non-text data. Don't think this would break anything.

I found that logging the data as objects directly can be very slow depending on the modality. I'm reverting this back to List[str] and will instead stringify the data objects before logging them.

Edit: Seems that it's mostly slow because Pydantic is validating the embedding list. Still, using str(data) would allow a more user-friendly display than just storing the object directly.

DarkLight1337 · 2024-05-17T10:36:04Z

llama-index-core/llama_index/core/query_engine/omni_modal.py

+from llama_index.core.schema import NodeWithScore
+
+
+class OmniModalQueryEngine(BaseQueryEngine, Generic[KD, KQ]):


Mostly a copy of MultiModalQueryEngine. Once we have OmniModalLLM, we can generalize this class to other modalities as well.

DarkLight1337 · 2024-05-17T10:40:48Z

llama-index-core/llama_index/core/indices/omni_modal/retriever.py

+
+        return await super()._aretrieve_from_object(obj, query_bundle, score)
+
+    def _handle_recursive_retrieval(


Unlike in MultiModalVectorIndexRetriever, composite nodes are nominally supported for non-text modalities, but this feature has yet to be tested.

logan-markewich · 2024-05-20T02:25:30Z

These are some pretty core changes. Thanks for taking a stab at this, we will have to spend some time digging into the structure here and ensure it fits with existing multimodal plans

… converting nodes into text nodes

DarkLight1337 · 2024-05-27T09:14:49Z

llama-index-core/llama_index/core/indices/omni_modal/base.py

+            callback_manager = callback_manager_from_settings_or_context(Settings, None)
+
+        # Distinguish from the case where an empty sequence is provided.
+        if transformations is None:


Should we also apply this change to the BaseIndex? Imo it's unexpected behaviour that passing transformations=[] fails to actually override the default settings.

DarkLight1337 · 2024-05-27T09:15:51Z

llama-index-core/llama_index/core/indices/base.py

@@ -48,7 +48,7 @@ def __init__(
        index_struct: Optional[IS] = None,
        storage_context: Optional[StorageContext] = None,
        callback_manager: Optional[CallbackManager] = None,
-        transformations: Optional[List[TransformComponent]] = None,
+        transformations: Optional[Sequence[TransformComponent]] = None,


BaseIndex does not require that transformations specifically be a list. Existing subclasses that assume that it is a list should remain unaffected (in terms of type safety) as long as they specify transformations as a list in the subclass's initializer.

DarkLight1337 added 4 commits May 17, 2024 06:11

Enable multi-modal RAG on arbitrary modality combinations

5f2aadf

Add demo of OmniModalVectorStoreIndex heavily based on `multi_modal…

c5493dc

…_retrieval.ipynb`

Move inner classes outside

c625b4c

Fix using pydantic field instead of dataclass field

9ca072c

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label May 17, 2024

DarkLight1337 commented May 17, 2024

View reviewed changes

Add missing wikipedia dependency

2951c29

DarkLight1337 commented May 17, 2024

View reviewed changes

DarkLight1337 added 5 commits May 18, 2024 05:07

Fix some issues in GenericTransformComponent

a56c48d

Fix self equality

1654df6

Add tests for OmniModalEmbedding

74a0963

Add tests for OmniModalEmbeddingBundle

311bf26

Code cleanup

2630e7a

DarkLight1337 added 5 commits May 21, 2024 07:25

Fix incompatible override

9c4938d

Enable const inference of key type parameter

3370419

Merge branch 'main' into omni-modal

5567ea2

Improve error message

7fc7594

Fix unexpected error in from_documents due to the default behaviour…

3ccce97

… converting nodes into text nodes

DarkLight1337 commented May 27, 2024

View reviewed changes

DarkLight1337 added 5 commits May 27, 2024 10:21

Avoid expensive logging operation for large data objects

22e4d60

Merge branch 'main' into omni-modal

5b61e90

Apply run-llama#13352

ab3b32e

Apply run-llama#13712

ca5236c

Avoid expensive validation of embeddings

60513d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Omni-Modal Embedding, Vector Index and Retriever #13551

[Core] Omni-Modal Embedding, Vector Index and Retriever #13551

DarkLight1337 commented May 17, 2024 •

edited

review-notebook-app bot commented May 17, 2024

DarkLight1337 May 17, 2024

DarkLight1337 May 27, 2024 •

edited

DarkLight1337 May 17, 2024

DarkLight1337 May 17, 2024

logan-markewich commented May 20, 2024

DarkLight1337 May 27, 2024 •

edited

DarkLight1337 May 27, 2024

		from llama_index.core.schema import NodeWithScore


		class OmniModalQueryEngine(BaseQueryEngine, Generic[KD, KQ]):


		return await super()._aretrieve_from_object(obj, query_bundle, score)

		def _handle_recursive_retrieval(

[Core] Omni-Modal Embedding, Vector Index and Retriever #13551

Are you sure you want to change the base?

[Core] Omni-Modal Embedding, Vector Index and Retriever #13551

Conversation

DarkLight1337 commented May 17, 2024 • edited

Description

Type of Change

How Has This Been Tested?

Suggested Checklist:

review-notebook-app bot commented May 17, 2024

DarkLight1337 May 17, 2024

Choose a reason for hiding this comment

DarkLight1337 May 27, 2024 • edited

Choose a reason for hiding this comment

DarkLight1337 May 17, 2024

Choose a reason for hiding this comment

DarkLight1337 May 17, 2024

Choose a reason for hiding this comment

logan-markewich commented May 20, 2024

DarkLight1337 May 27, 2024 • edited

Choose a reason for hiding this comment

DarkLight1337 May 27, 2024

Choose a reason for hiding this comment

DarkLight1337 commented May 17, 2024 •

edited

DarkLight1337 May 27, 2024 •

edited

DarkLight1337 May 27, 2024 •

edited