Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature CORE-3985: add Clarifai destination connector #2633

Merged
merged 17 commits into from Mar 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Expand Up @@ -398,6 +398,7 @@ jobs:
VECTARA_CUSTOMER_ID: ${{secrets.VECTARA_CUSTOMER_ID}}
ASTRA_DB_TOKEN: ${{secrets.ASTRA_DB_TOKEN}}
ASTRA_DB_ENDPOINT: ${{secrets.ASTRA_DB_ENDPOINT}}
CLARIFAI_API_KEY: ${{secrets.CLARIFAI_API_KEY}}
TABLE_OCR: "tesseract"
OCR_AGENT: "unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract"
CI: "true"
Expand Down
4 changes: 3 additions & 1 deletion CHANGELOG.md
@@ -1,4 +1,4 @@
## 0.12.7-dev8
## 0.12.7-dev9

### Enhancements

Expand All @@ -8,6 +8,7 @@
### Features

* **Chunking populates `.metadata.orig_elements` for each chunk.** This behavior allows the text and metadata of the elements combined to make each chunk to be accessed. This can be important for example to recover metadata such as `.coordinates` that cannot be consolidated across elements and so is dropped from chunks. This option is controlled by the `include_orig_elements` parameter to `partition_*()` or to the chunking functions. This option defaults to `True` so original-elements are preserved by default. This behavior is not yet supported via the REST APIs or SDKs but will be in a closely subsequent PR to other `unstructured` repositories. The original elements will also not serialize or deserialize yet; this will also be added in a closely subsequent PR.
* **Add Clarifai destination connector** Adds support for writing partitioned and chunked documents into Clarifai.

### Fixes

Expand All @@ -24,6 +25,7 @@
* **Redefine `table_level_acc` metric for table evaluation.** `table_level_acc` now is an average of individual predicted table's accuracy. A predicted table's accuracy is defined as the sequence matching ratio between itself and its corresponding ground truth table.

### Features

* **Added Unstructured Platform Documentation** The Unstructured Platform is currently in beta. The documentation provides how-to guides for setting up workflow automation, job scheduling, and configuring source and destination connectors.

### Fixes
Expand Down
4 changes: 4 additions & 0 deletions Makefile
Expand Up @@ -251,6 +251,10 @@ install-ingest-databricks-volumes:
install-ingest-astra:
python3 -m pip install -r requirements/ingest/astra.txt

.PHONY: install-ingest-clarifai
install-ingest-clarifai:
python3 -m pip install -r requirements/ingest/clarifai.txt

.PHONY: install-embed-huggingface
install-embed-huggingface:
python3 -m pip install -r requirements/ingest/embed-huggingface.txt
Expand Down
1 change: 1 addition & 0 deletions docs/source/ingest/destination_connectors.rst
Expand Up @@ -13,6 +13,7 @@ in our community `Slack. <https://short.unstructured.io/pzw05l7>`_
destination_connectors/azure_cognitive_search
destination_connectors/box
destination_connectors/chroma
destination_connectors/clarifai
destination_connectors/databricks_volumes
destination_connectors/delta_table
destination_connectors/dropbox
Expand Down
34 changes: 34 additions & 0 deletions docs/source/ingest/destination_connectors/clarifai.rst
@@ -0,0 +1,34 @@
Clarifai
===========

Batch process all your records using ``unstructured-ingest`` to store unstructured outputs locally on your filesystem and upload those to Clarifai apps.

First start with the installation of clarifai dependencies as shown here.

.. code:: shell

pip install "unstructured[clarifai]"

Create a clarifai app with base workflow. Find more information in the `create clarifai app <https://docs.clarifai.com/clarifai-basics/applications/create-an-application/>`_.

Run Locally
-----------
The upstream connector can be any of the ones supported, but for the convenience here, showing a sample command using the upstream local connector.

.. tabs::

.. tab:: Shell

.. literatinclude:: ./code/bash/clarifai.sh
:language: bash

.. tab:: Python

.. literalinclude:: ./code/python/clarifai.py
:language: python

For a full list of the options the CLI accepts check ``unstructured-ingest <upstream connector> clarifai --help``.

NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_.


15 changes: 15 additions & 0 deletions docs/source/ingest/destination_connectors/code/bash/clarifai.sh
@@ -0,0 +1,15 @@
#!/usr/bin/env bash

unstructured-ingest \
local \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-output-to-clarifai \
--strategy fast \
--chunk-elements \
--num-processes 2 \
--verbose \
ahmetmeleq marked this conversation as resolved.
Show resolved Hide resolved
clarifai \
--app-id "<your clarifai app name>" \
--user-id "<your clarifai user id>" \
--api-key "<your clarifai PAT key>" \
--batch-size 100
48 changes: 48 additions & 0 deletions docs/source/ingest/destination_connectors/code/python/clarifai.py
@@ -0,0 +1,48 @@
from unstructured.ingest.connector.clarifai import (
ClarifaiAccessConfig,
ClarifaiWriteConfig,
SimpleClarifaiConfig,
)
from unstructured.ingest.connector.local import SimpleLocalConfig
from unstructured.ingest.interfaces import (
ChunkingConfig,
PartitionConfig,
ProcessorConfig,
ReadConfig,
)
from unstructured.ingest.runner import LocalRunner
from unstructured.ingest.runner.writers.base_writer import Writer
from unstructured.ingest.runner.writers.clarifai import (
ClarifaiWriter,
)


def get_writer() -> Writer:
return ClarifaiWriter(
connector_config=SimpleClarifaiConfig(
access_config=ClarifaiAccessConfig(api_key="CLARIFAI_PAT"),
app_id="CLARIFAI_APP",
user_id="CLARIFAI_USER_ID",
),
write_config=ClarifaiWriteConfig(),
)


if __name__ == "__main__":
writer = get_writer()
runner = LocalRunner(
ahmetmeleq marked this conversation as resolved.
Show resolved Hide resolved
processor_config=ProcessorConfig(
verbose=True,
output_dir="local-output-to-clarifai-app",
num_processes=2,
),
connector_config=SimpleLocalConfig(
input_path="example-docs/book-war-and-peace-1225p.txt",
),
read_config=ReadConfig(),
partition_config=PartitionConfig(),
chunking_config=ChunkingConfig(chunk_elements=True),
writer=writer,
writer_kwargs={},
)
runner.run()
2 changes: 1 addition & 1 deletion docs/source/introduction/key_concepts.rst
Expand Up @@ -66,7 +66,7 @@ A RAG workflow can be broken down into the following steps:

4. **Embedding**: After chunking, you must convert the text into a numerical representation (vector embedding) that an LLM can understand. To use the various embedding models using Unstructured tools, please refer to `this page <https://unstructured-io.github.io/unstructured/core/embedding.html>`__.

5. **Vector Database**: The next step is to choose a location for storing your chunked embeddings. There are many options for your vector database (ChromaDB, Milvus, Pinecone, Qdrant, Weaviate, and more). For complete list of Unstructured ``Destination Connectors``, please visit `this page <https://unstructured-io.github.io/unstructured/ingest/destination_connectors.html>`__.
5. **Vector Database**: The next step is to choose a location for storing your chunked embeddings. There are many options for your vector database (AstraDB, ChromaDB, Clarifai, Milvus, Pinecone, Qdrant, Weaviate, and more). For complete list of Unstructured ``Destination Connectors``, please visit `this page <https://unstructured-io.github.io/unstructured/ingest/destination_connectors.html>`__.

6. **User Prompt**: Take the user prompt and grab the most relevant chunks of information in the vector database via similarity search.

Expand Down
20 changes: 20 additions & 0 deletions examples/ingest/clarifai/ingest.sh
@@ -0,0 +1,20 @@
#!/usr/bin/env bash

# Uploads the structured output of the files within the given path to a clarifai app.

SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
cd "$SCRIPT_DIR"/../../.. || exit 1

PYTHONPATH=. ./unstructured/ingest/main.py \
local \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-output-to-clarifai \
--strategy fast \
--chunk-elements \
--num-processes 2 \
--verbose \
ahmetmeleq marked this conversation as resolved.
Show resolved Hide resolved
clarifai \
--app-id "<your clarifai app name>" \
--user-id "<your clarifai user id>" \
--api-key "<your clarifai PAT key>" \
--batch-size 100
3 changes: 3 additions & 0 deletions requirements/ingest/clarifai.in
@@ -0,0 +1,3 @@
-c ../constraints.in
-c ../base.txt
clarifai