Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Omni-Modal Embedding, Vector Index and Retriever #13551

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

DarkLight1337
Copy link

@DarkLight1337 DarkLight1337 commented May 17, 2024

Description

This PR lays the groundwork for extending the multi-modal support to other modalities (such as audio). The main components of this PR are:

  • Modality: Encapsulates the embedding-agnostic information about each modality
    • e.g.: how to read BaseNode and QueryBundles that belong to that modality.
  • OmniModalEmbedding: Base class for Embedding component that supports any modality, not just text and image.
    • Concrete subclasses should define the supported document_modalities and query_modalities, and implement _get_embedding (and related methods) for those modalities accordingly.
  • OmniModalEmbeddingBundle: Composite of OmniModalEmbedding where multiple embedding models can be combined together.
    • To avoid ambiguity, only one model per document modality is allowed.
    • There can be multiple models per query modality (each covering a different document modality).
  • OmniModalVectorStoreIndex: Index component that stores documents using OmniModalEmbeddingBundle. It is meant to be a drop-in replacement for MultiModalVectorStoreIndex.
    • There is no need to specify the document modality when storing BaseNodes. The modality is inferred automatically based on the class type.
    • Note: To load a persisted index, use OmniModalVectorStoreIndex.load_from_storage instead of llama_index.core.load_index_from_storage, since we do not serialize the details of each modality.
  • OmniModalVectorIndexRetriever: Retriever component that queries documents using OmniModalEmbeddingBundle.
    • You can set top-k for each document modality.
    • You must specify the query modality when passing in the QueryBundle. (May be changed in the future to automatically detect the modality in a manner similar to the case for document nodes)
    • You can specify one or more document modalities to retrieve from, which may be different from the query modality.

As you may have guessed, I took inspiration from the recently released GPT-4o (where "o" stands for "omni") when naming these components, to distinguish from the existing MultiModal* components for text-image retrieval. I am open to other naming suggestions.

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

This change intentionally leaves the existing code untouched at the expense of some code duplication. Future PRs may work on the following:

  • Replace the existing BaseEmbedding class with OmniModalEmbedding (since it's more general).
  • Integrate the function of extracting document/query data based on Modality into existing BaseNode and QueryBundle classes. That way, we can replace Modality with a string key which can be serialized and deserialized easily.
  • Some existing enums/constants need to be refactored to enable downstream developers to define their own modalities.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Added new unit/integration tests

I have added basic unit tests for the internals of OmniModalEmbedding and OmniModalEmbeddingBundle.

It appears that the original multi-modal index (#8709) and retriever (#8787) don't have any unit tests. I am not sure what would be the best approach for testing their functionality. Perhaps @hatianzhang would have some ideas?

  • Added new notebook (that tests end-to-end)

To demonstrate the compatibility between OmniModalVectorStoreIndex and MultiModalVectorStoreIndex, I have created omni_modal_retrieval.ipynb which is basically the same as multi_modal_retrieval.ipynb except that MultiModal* components are replaced with OmniModal* ones.

Future PRs can work on adding new modality types. In particular, audio and video support would complement GPT-4o well (unfortunately we probably can't use GPT-4o directly to generate embeddings).

  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

I will update the code documentation once the details are finalized.

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label May 17, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

embeddings (List[List[float]]): List of embeddings.

"""

chunks: List[str]
chunks: Sequence[object]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is required to satisfy the type checker when logging chunks for non-text data. Don't think this would break anything.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that logging the data as objects directly can be very slow depending on the modality. I'm reverting this back to List[str] and will instead stringify the data objects before logging them.

Edit: Seems that it's mostly slow because Pydantic is validating the embedding list. Still, using str(data) would allow a more user-friendly display than just storing the object directly.

from llama_index.core.schema import NodeWithScore


class OmniModalQueryEngine(BaseQueryEngine, Generic[KD, KQ]):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly a copy of MultiModalQueryEngine. Once we have OmniModalLLM, we can generalize this class to other modalities as well.


return await super()._aretrieve_from_object(obj, query_bundle, score)

def _handle_recursive_retrieval(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike in MultiModalVectorIndexRetriever, composite nodes are nominally supported for non-text modalities, but this feature has yet to be tested.

@logan-markewich
Copy link
Collaborator

These are some pretty core changes. Thanks for taking a stab at this, we will have to spend some time digging into the structure here and ensure it fits with existing multimodal plans

callback_manager = callback_manager_from_settings_or_context(Settings, None)

# Distinguish from the case where an empty sequence is provided.
if transformations is None:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also apply this change to the BaseIndex? Imo it's unexpected behaviour that passing transformations=[] fails to actually override the default settings.

@@ -48,7 +48,7 @@ def __init__(
index_struct: Optional[IS] = None,
storage_context: Optional[StorageContext] = None,
callback_manager: Optional[CallbackManager] = None,
transformations: Optional[List[TransformComponent]] = None,
transformations: Optional[Sequence[TransformComponent]] = None,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseIndex does not require that transformations specifically be a list. Existing subclasses that assume that it is a list should remain unaffected (in terms of type safety) as long as they specify transformations as a list in the subclass's initializer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants