Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add vertexai embeddings #2693

Merged
merged 38 commits into from Mar 28, 2024
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
381ccac
build(deps): bump version for security patches
MthwRobinson Mar 15, 2024
18d3817
Merge branch 'main' into deps/security-bump
MthwRobinson Mar 15, 2024
27c1cab
pin unstructured client
MthwRobinson Mar 15, 2024
04e2058
Merge branch 'deps/security-bump' of github.com:Unstructured-IO/unstr…
MthwRobinson Mar 15, 2024
a2a1649
add vertexai with testing on pinecone
ahmetmeleq Mar 26, 2024
ce53a12
Merge branch 'main' into ahmet/vertex-embed
ahmetmeleq Mar 26, 2024
d472223
make tidy
ahmetmeleq Mar 26, 2024
e706f54
Merge branch 'ahmet/vertex-embed' of https://github.com/Unstructured-…
ahmetmeleq Mar 26, 2024
0ac052c
fix import on unit test
ahmetmeleq Mar 26, 2024
f5a6cdc
Merge branch 'deps/security-bump' into ahmet/vertex-embed
ahmetmeleq Mar 26, 2024
e09156f
requirements update, pip compile for embed modules
ahmetmeleq Mar 26, 2024
43bcdfc
add vertexai integration test
ahmetmeleq Mar 26, 2024
66f1a2c
add octoai test
ahmetmeleq Mar 26, 2024
dbaaf6f
dependency updates for embedding modules
ahmetmeleq Mar 26, 2024
7b2368b
shellcheck
ahmetmeleq Mar 26, 2024
3f7c20c
shfmt
ahmetmeleq Mar 26, 2024
9208ea8
Revert "Merge branch 'deps/security-bump' into ahmet/vertex-embed"
ahmetmeleq Mar 26, 2024
5ec18de
fix typo, fix extra name
ahmetmeleq Mar 26, 2024
827eba3
update test-ingest-src
ahmetmeleq Mar 27, 2024
658de79
change extra name for octoai
ahmetmeleq Mar 27, 2024
95a619e
parametrized api-key
ahmetmeleq Mar 27, 2024
86cc766
Merge branch 'main' into ahmet/vertex-embed
ahmetmeleq Mar 27, 2024
21c08d8
version
ahmetmeleq Mar 27, 2024
c81d3b8
update docs based on parametrized api_key
ahmetmeleq Mar 27, 2024
dffde65
testing to invalidate github cache
ahmetmeleq Mar 27, 2024
975b8f0
debugging cache
ahmetmeleq Mar 27, 2024
cd1cb7a
try healing the cache via a save without load
ahmetmeleq Mar 28, 2024
6a007cc
re-enable loads
ahmetmeleq Mar 28, 2024
3805622
add api_key to mock test
ahmetmeleq Mar 28, 2024
78c69a1
add credentials cleanup
ahmetmeleq Mar 28, 2024
2866f42
save creds to tmp rather than manual cleanup
ahmetmeleq Mar 28, 2024
dd47e56
heal cache
ahmetmeleq Mar 28, 2024
3e489ff
Revert "heal cache"
ahmetmeleq Mar 28, 2024
5b278bf
working example in examples with comments updated
ahmetmeleq Mar 28, 2024
7587fd3
working example in docs with comments updated
ahmetmeleq Mar 28, 2024
4392700
vectara fix
ahmetmeleq Mar 28, 2024
0cc52d7
Merge branch 'main' into ahmet/vertex-embed
ahmetmeleq Mar 28, 2024
b639fc1
version
ahmetmeleq Mar 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
@@ -1,11 +1,12 @@
## 0.13.0-dev13

### Enhancements
### Enhancements

* **Add `.metadata.is_continuation` to text-split chunks.** `.metadata.is_continuation=True` is added to second-and-later chunks formed by text-splitting an oversized `Table` element but not to their counterpart `Text` element splits. Add this indicator for `CompositeElement` to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks.
* **Add `compound_structure_acc` metric to table eval.** Add a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores
* **Add `.metadata.orig_elements` to chunks.** `.metadata.orig_elements: list[Element]` is added to chunks during the chunking process (when requested) to allow access to information from the elements each chunk was formed from. This is useful for example to recover metadata fields that cannot be consolidated to a single value for a chunk, like `page_number`, `coordinates`, and `image_base64`.
* **Add `--include_orig_elements` option to Ingest CLI.** By default, when chunking, the original elements used to form each chunk are added to `chunk.metadata.orig_elements` for each chunk. * The `include_orig_elements` parameter allows the user to turn off this behavior to produce a smaller payload when they don't need this metadata.
* **Add Google VertexAI embedder** Adds VertexAI embeddings to support embedding via Google Vertex AI.

### Features

Expand Down
53 changes: 53 additions & 0 deletions docs/source/core/embedding.rst
Expand Up @@ -171,6 +171,59 @@ To obtain an api key, visit: https://octo.ai/docs/getting-started/how-to-create-
query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)
[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
``VertexAIEmbeddingEncoder``
--------------------------

The ``VertexAIEmbeddingEncoder`` class connects to the GCP VertexAI to obtain embeddings for pieces of text.

``embed_documents`` will receive a list of Elements, and return an updated list which
includes the ``embeddings`` attribute for each Element.

``embed_query`` will receive a query as a string, and return a list of floats which is the
embedding vector for the given query string.

``num_of_dimensions`` is a metadata property that denotes the number of dimensions in any
embedding vector obtained via this class.

``is_unit_vector`` is a metadata property that denotes if embedding vectors obtained via
this class are unit vectors.

The following code block shows an example of how to use ``VertexAIEmbeddingEncoder``. You will
see the updated elements list (with the ``embeddings`` attribute included for each element),
the embedding vector for the query string, and some metadata properties about the embedding model.

To use Vertex AI PaLM tou will need to:
- either, pass the full json content of your GCP VertexAI application credentials to the
VertexAIEmbeddingConfig as the api_key parameter. (This will create a file in the ``/tmp``
directory with the content of the json, and set the GOOGLE_APPLICATION_CREDENTIALS environment
variable to the **path** of the created file.)
- or, you'll need to store the path to a manually created service account JSON file as the
GOOGLE_APPLICATION_CREDENTIALS environment variable. (For more information:
https://python.langchain.com/docs/integrations/text_embedding/google_vertex_ai_palm)
- or, you'll need to have the credentials configured for your environment (gcloud,
workload identity, etc…)

.. code:: python
import os
from unstructured.documents.elements import Text
from unstructured.embed.vertexai import VertexAIEmbeddingConfig, VertexAIEmbeddingEncoder
embedding_encoder = VertexAIEmbeddingEncoder(
config=VertexAIEmbeddingConfig(api_key=os.environ["VERTEXAI_GCP_APP_CREDS_JSON_CONTENT"])
)
elements = embedding_encoder.embed_documents(
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)
query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)
[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
30 changes: 30 additions & 0 deletions examples/embed/example_vertexai.py
@@ -0,0 +1,30 @@
import os

from unstructured.documents.elements import Text
from unstructured.embed.vertexai import VertexAIEmbeddingConfig, VertexAIEmbeddingEncoder

# To use Vertex AI PaLM tou will need to:
# - either, pass the full json content of your GCP VertexAI application credentials to the
# VertexAIEmbeddingConfig as the api_key parameter. (This will create a file in the ``/tmp``
# directory with the content of the json, and set the GOOGLE_APPLICATION_CREDENTIALS environment
# variable to the **path** of the created file.)
# - or, you'll need to store the path to a manually created service account JSON file as the
# GOOGLE_APPLICATION_CREDENTIALS environment variable. (For more information:
# https://python.langchain.com/docs/integrations/text_embedding/google_vertex_ai_palm)
# - or, you'll need to have the credentials configured for your environment (gcloud,
# workload identity, etc…)

embedding_encoder = VertexAIEmbeddingEncoder(
config=VertexAIEmbeddingConfig(api_key=os.environ["VERTEXAI_GCP_APP_CREDS_JSON_CONTENT"])
)

elements = embedding_encoder.embed_documents(
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)

query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)

[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
4 changes: 4 additions & 0 deletions requirements/ingest/embed-octoai.in
@@ -0,0 +1,4 @@
-c ../constraints.in
-c ../base.txt
openai
tiktoken
72 changes: 72 additions & 0 deletions requirements/ingest/embed-octoai.txt
@@ -0,0 +1,72 @@
#
# This file is autogenerated by pip-compile with Python 3.9
# by the following command:
#
# pip-compile --output-file=ingest/embed-octoai.txt ingest/embed-octoai.in
#
anyio==3.7.1
# via
# -c ingest/../constraints.in
# httpx
# openai
certifi==2024.2.2
# via
# -c ingest/../base.txt
# -c ingest/../constraints.in
# httpcore
# httpx
# requests
charset-normalizer==3.3.2
# via
# -c ingest/../base.txt
# requests
distro==1.9.0
# via openai
exceptiongroup==1.2.0
# via anyio
h11==0.14.0
# via httpcore
httpcore==1.0.4
# via httpx
httpx==0.27.0
# via openai
idna==3.6
# via
# -c ingest/../base.txt
# anyio
# httpx
# requests
openai==1.14.3
# via -r ingest/embed-octoai.in
pydantic==1.10.14
# via
# -c ingest/../constraints.in
# openai
regex==2023.12.25
# via
# -c ingest/../base.txt
# tiktoken
requests==2.31.0
# via
# -c ingest/../base.txt
# tiktoken
sniffio==1.3.1
# via
# anyio
# httpx
# openai
tiktoken==0.6.0
# via -r ingest/embed-octoai.in
tqdm==4.66.2
# via
# -c ingest/../base.txt
# openai
typing-extensions==4.10.0
# via
# -c ingest/../base.txt
# openai
# pydantic
urllib3==2.2.1
# via
# -c ingest/../base.txt
# requests
5 changes: 5 additions & 0 deletions requirements/ingest/embed-vertexai.in
@@ -0,0 +1,5 @@
-c ../constraints.in
-c ../base.txt
langchain
langchain-community
langchain-google-vertexai