Unstructured-IO · potter-potter · Mar 21, 2024 · Mar 11, 2024 · Mar 11, 2024 · Mar 11, 2024
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -398,6 +398,7 @@ jobs:
         VECTARA_CUSTOMER_ID: ${{secrets.VECTARA_CUSTOMER_ID}}
         ASTRA_DB_TOKEN: ${{secrets.ASTRA_DB_TOKEN}}
         ASTRA_DB_ENDPOINT: ${{secrets.ASTRA_DB_ENDPOINT}}
+        CLARIFAI_API_KEY: ${{secrets.CLARIFAI_API_KEY}}
         TABLE_OCR: "tesseract"
         OCR_AGENT: "unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract"
         CI: "true"

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,4 +1,4 @@
-## 0.12.7-dev8
+## 0.12.7-dev9
 
 ### Enhancements 
 
@@ -8,6 +8,7 @@
 ### Features
 
 * **Chunking populates `.metadata.orig_elements` for each chunk.** This behavior allows the text and metadata of the elements combined to make each chunk to be accessed. This can be important for example to recover metadata such as `.coordinates` that cannot be consolidated across elements and so is dropped from chunks. This option is controlled by the `include_orig_elements` parameter to `partition_*()` or to the chunking functions. This option defaults to `True` so original-elements are preserved by default. This behavior is not yet supported via the REST APIs or SDKs but will be in a closely subsequent PR to other `unstructured` repositories. The original elements will also not serialize or deserialize yet; this will also be added in a closely subsequent PR.
+* **Add Clarifai destination connector** Adds support for writing partitioned and chunked documents into Clarifai.
 
 ### Fixes
 
@@ -24,6 +25,7 @@
 * **Redefine `table_level_acc` metric for table evaluation.** `table_level_acc` now is an average of individual predicted table's accuracy. A predicted table's accuracy is defined as the sequence matching ratio between itself and its corresponding ground truth table.
 
 ### Features
+
 * **Added Unstructured Platform Documentation** The Unstructured Platform is currently in beta. The documentation provides how-to guides for setting up workflow automation, job scheduling, and configuring source and destination connectors.
 
 ### Fixes

diff --git a/Makefile b/Makefile
@@ -251,6 +251,10 @@ install-ingest-databricks-volumes:
 install-ingest-astra:
 	python3 -m pip install -r requirements/ingest/astra.txt
 
+.PHONY: install-ingest-clarifai
+install-ingest-clarifai:
+	python3 -m pip install -r requirements/ingest/clarifai.txt
+
 .PHONY: install-embed-huggingface
 install-embed-huggingface:
 	python3 -m pip install -r requirements/ingest/embed-huggingface.txt

diff --git a/docs/source/ingest/destination_connectors.rst b/docs/source/ingest/destination_connectors.rst
@@ -13,6 +13,7 @@ in our community `Slack. <https://short.unstructured.io/pzw05l7>`_
    destination_connectors/azure_cognitive_search
    destination_connectors/box
    destination_connectors/chroma
+   destination_connectors/clarifai
    destination_connectors/databricks_volumes
    destination_connectors/delta_table
    destination_connectors/dropbox

diff --git a/docs/source/ingest/destination_connectors/clarifai.rst b/docs/source/ingest/destination_connectors/clarifai.rst
@@ -0,0 +1,34 @@
+Clarifai 
+===========
+
+Batch process all your records using ``unstructured-ingest`` to store unstructured outputs locally on your filesystem and upload those to Clarifai apps.
+
+First start with the installation of clarifai dependencies as shown here.
+
+.. code:: shell
+
+    pip install "unstructured[clarifai]"
+
+Create a clarifai app with base workflow. Find more information in the `create clarifai app <https://docs.clarifai.com/clarifai-basics/applications/create-an-application/>`_.
+
+Run Locally
+-----------
+The upstream connector can be any of the ones supported, but for the convenience here, showing a sample command using the upstream local connector.
+
+.. tabs::
+
+    .. tab:: Shell
+
+        .. literatinclude:: ./code/bash/clarifai.sh
+            :language: bash
+
+    .. tab:: Python
+
+        .. literalinclude:: ./code/python/clarifai.py
+            :language: python
+
+For a full list of the options the CLI accepts check ``unstructured-ingest <upstream connector> clarifai --help``.
+
+NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_.
+
+
diff --git a/docs/source/ingest/destination_connectors/code/bash/clarifai.sh b/docs/source/ingest/destination_connectors/code/bash/clarifai.sh
@@ -0,0 +1,15 @@
+#!/usr/bin/env bash
+
+unstructured-ingest \
+  local \
+  --input-path example-docs/book-war-and-peace-1225p.txt \
+  --output-dir local-output-to-clarifai \
+  --strategy fast \
+  --chunk-elements \
+  --num-processes 2 \
+  --verbose \
+  clarifai \
+  --app-id "<your clarifai app name>" \
+  --user-id "<your clarifai user id>" \
+  --api-key "<your clarifai PAT key>" \
+  --batch-size 100
diff --git a/docs/source/ingest/destination_connectors/code/python/clarifai.py b/docs/source/ingest/destination_connectors/code/python/clarifai.py
@@ -0,0 +1,48 @@
+from unstructured.ingest.connector.clarifai import (
+    ClarifaiAccessConfig,
+    ClarifaiWriteConfig,
+    SimpleClarifaiConfig,
+)
+from unstructured.ingest.connector.local import SimpleLocalConfig
+from unstructured.ingest.interfaces import (
+    ChunkingConfig,
+    PartitionConfig,
+    ProcessorConfig,
+    ReadConfig,
+)
+from unstructured.ingest.runner import LocalRunner
+from unstructured.ingest.runner.writers.base_writer import Writer
+from unstructured.ingest.runner.writers.clarifai import (
+    ClarifaiWriter,
+)
+
+
+def get_writer() -> Writer:
+    return ClarifaiWriter(
+        connector_config=SimpleClarifaiConfig(
+            access_config=ClarifaiAccessConfig(api_key="CLARIFAI_PAT"),
+            app_id="CLARIFAI_APP",
+            user_id="CLARIFAI_USER_ID",
+        ),
+        write_config=ClarifaiWriteConfig(),
+    )
+
+
+if __name__ == "__main__":
+    writer = get_writer()
+    runner = LocalRunner(
+        processor_config=ProcessorConfig(
+            verbose=True,
+            output_dir="local-output-to-clarifai-app",
+            num_processes=2,
+        ),
+        connector_config=SimpleLocalConfig(
+            input_path="example-docs/book-war-and-peace-1225p.txt",
+        ),
+        read_config=ReadConfig(),
+        partition_config=PartitionConfig(),
+        chunking_config=ChunkingConfig(chunk_elements=True),
+        writer=writer,
+        writer_kwargs={},
+    )
+    runner.run()
diff --git a/docs/source/introduction/key_concepts.rst b/docs/source/introduction/key_concepts.rst
@@ -66,7 +66,7 @@ A RAG workflow can be broken down into the following steps:
 
 4. **Embedding**: After chunking, you must convert the text into a numerical representation (vector embedding) that an LLM can understand. To use the various embedding models using Unstructured tools, please refer to `this page <https://unstructured-io.github.io/unstructured/core/embedding.html>`__.
 
-5. **Vector Database**: The next step is to choose a location for storing your chunked embeddings. There are many options for your vector database (ChromaDB, Milvus, Pinecone, Qdrant, Weaviate, and more). For complete list of Unstructured ``Destination Connectors``, please visit `this page <https://unstructured-io.github.io/unstructured/ingest/destination_connectors.html>`__.
+5. **Vector Database**: The next step is to choose a location for storing your chunked embeddings. There are many options for your vector database (AstraDB, ChromaDB, Clarifai, Milvus, Pinecone, Qdrant, Weaviate, and more). For complete list of Unstructured ``Destination Connectors``, please visit `this page <https://unstructured-io.github.io/unstructured/ingest/destination_connectors.html>`__.
 
 6. **User Prompt**: Take the user prompt and grab the most relevant chunks of information in the vector database via similarity search.
 

diff --git a/examples/ingest/clarifai/ingest.sh b/examples/ingest/clarifai/ingest.sh
@@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+
+# Uploads the structured output of the files within the given path to a clarifai app.
+
+SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
+cd "$SCRIPT_DIR"/../../.. || exit 1
+
+PYTHONPATH=. ./unstructured/ingest/main.py \
+  local \
+  --input-path example-docs/book-war-and-peace-1225p.txt \
+  --output-dir local-output-to-clarifai \
+  --strategy fast \
+  --chunk-elements \
+  --num-processes 2 \
+  --verbose \
+  clarifai \
+  --app-id "<your clarifai app name>" \
+  --user-id "<your clarifai user id>" \
+  --api-key "<your clarifai PAT key>" \
+  --batch-size 100
diff --git a/requirements/ingest/clarifai.in b/requirements/ingest/clarifai.in
@@ -0,0 +1,3 @@
+-c ../constraints.in
+-c ../base.txt
+clarifai