Add ingestion function for ingesting files to vector search #532

NikolaosPapailiou · 2024-03-08T08:48:54Z

This adds one-click ingestion function for ingesting files to vector search.

Tested in cloud:

shortcut-integration · 2024-03-08T08:48:57Z

This pull request has been linked to Shortcut Story #42043: Trigger Task Graph for Indexing (Ingestion).

Shelnutt2

Several comments in first pass.

This is also completely missing interacting with TileDB Files. The requirement is to support loading into TileDB Files store and performing indexing. Not just indexing.

src/tiledb/cloud/vector_search/file_ingestion.py

NikolaosPapailiou · 2024-03-11T14:13:16Z

This is also completely missing interacting with TileDB Files. The requirement is to support loading into TileDB Files store and performing indexing. Not just indexing.

I don't think we have discussed requirements for this. This needs to be designed in collaboration with cloud and we need to understand who owns the file ingestion code and implementation.

Shelnutt2 · 2024-03-19T13:04:23Z

I don't think we have discussed requirements for this. This needs to be designed in collaboration with cloud and we need to understand who owns the file ingestion code and implementation.

When you are back we need to sync on this. I believe we had discussed this and the example POC code I provided handled all of this. The goal was to take the POC we used and re-implement the features in a production fashion. The first goal is to support file ingestion and index creation as part of one pipeline.

JohnMoutafis

We should avoid nesting functions especially when they are only used once through out the body of the method.
Also functions like index_exists that only work as a "passthrough" to only call another function should be avoided as they are unnecessary, they add overhead and they complicate the code and any potential debugging process.

src/tiledb/cloud/vector_search/file_ingestion.py

NikolaosPapailiou · 2024-04-01T06:31:07Z

@Shelnutt2 are your comments addressed by the changes? This requires your approval to continue with merging.

Shelnutt2 · 2024-04-01T10:33:01Z

src/tiledb/cloud/vector_search/file_ingestion.py

+    embedding_class_ = getattr(embeddings_module, embedding_class)
+    embedding = embedding_class_(**embedding_kwargs)
+
+    with tiledb.scope_ctx(config):


This should be done as the first stage in the graph, not local to the caller.

I am not sure I understand this, ingest_files_udf is running within the taskgraph

I created an alternative version of the PR that uses an extra taskgraph and creates the dataset as a first node in the taskgraph #547 Is this what you are expecting the ingestion structure to look like?

@Shelnutt2 is the alternative version matching your expectation? Let me know if you still have concerns

Applied the alternative taskgraph structure in this PR.

@Shelnutt2 this PR needs your approval to move forward. Let me know if you need any more changes.

src/tiledb/cloud/vector_search/__init__.py

src/tiledb/cloud/vector_search/file_ingestion.py

Shelnutt2

Several comments that need to be fixed. This code also needs to be aligned with the goals of handling TileDB FileStore files as a primary source.

Additionally there are some pylint errors related to variables that aren't passed through. Please address all lint errors.

Shelnutt2 · 2024-04-26T04:16:29Z

src/tiledb/cloud/vector_search/file_ingestion.py

+    driver_image: Optional[str] = None,
+    extra_driver_modules: Optional[List[str]] = None,
+    max_tasks_per_stage: int = -1,
+    embeddings_generation_mode: dag.Mode = dag.Mode.LOCAL,


Everything must default to batch mode. Running this in local is unexpected. The goals are that like all other verticals we support and default to batch ingestion capabilities.

Document indexing has multiple execution steps that can be run in different modes:

ingest_files creates a BATCH taskgraph that runs all the indexing. This means that all processing happens within a BATCH taskgraph with access_credentials even if the options here are set to LOCAL

embeddings_generation: reads the documents and creates text embeddings. This can spawn its own taskgraph.

vector_indexing: creates a vector index from the produced embeddings. This can spawn its own taskgraph.

The default configuration at the moment is:

ingest_files creates a BATCH taskgraph that runs all the indexing.

embeddings_generation, vector_indexing run in LOCAL mode within a UDF of the ingest_files taskgraph. Both of these tasks can leverage the available parallelism within the single worker.

This is expected to be a good default execution configuration for cost and latency even for sets of thousands of documents.

Do you want all the execution steps to be executed in BATCH mode by default?

The requirement again, as we discussed and as spelled out in the story is to have a robust batch mode ingestion that can scale to millions of documents. Local mode is a bad default and does not meet our intended goal, please change it and please be sure you actually test at scale. These issues are easy to see even running just our same test datasets.

Shelnutt2 · 2024-04-26T04:17:43Z

src/tiledb/cloud/vector_search/file_ingestion.py

+
+
+def ingest_files(
+    file_dir_uri: str,


This does not work as expected. Passing in a TileDB file URI get ignored. Please test this and add unit tests for the relevant cases. Currently this does not cover the required use cases.

As it is implemented atm this should be the group URI and we pick up the file from the group(applying regexp patterns if provided). What are the cases you are looking for supporting here?

We have some testcases for this here https://github.com/TileDB-Inc/TileDB-Vector-Search/blob/main/apis/python/test/test_directory_reader.py

The requirements are to support TileDB FileStore files or a group of files. This has been a hard requirement from day one and is outlined in our planning document.

Shelnutt2 · 2024-04-26T04:20:33Z

src/tiledb/cloud/vector_search/file_ingestion.py

+def ingest_files(
+    file_dir_uri: str,
+    index_uri: str,
+    file_name: Optional[str] = None,


This along with include/exclude don't make sense. How is this to work with TileDB files? There is no check if the TileDB file name or any parsing of the TileDB URIs. The goal again as outlined in the requirements is to use this with TileDB files, either standalone or from a group.

Shelnutt2 · 2024-04-26T04:22:00Z

src/tiledb/cloud/vector_search/file_ingestion.py

+    # Index update params
+    index_timestamp: Optional[int] = None,
+    workers: int = -1,
+    worker_resources: Optional[Dict] = None,


This is not plumbing all the different resource parameters, is there a reason?

Shelnutt2 · 2024-04-26T04:24:28Z

src/tiledb/cloud/vector_search/file_ingestion.py

+                    environment_variables=environment_variables,
+                    load_embedding=False,
+                    load_metadata_in_memory=False,
+                    memory_budget=1,


Why is this set to 1? Please add inline code comments. There should be a decent amount of comments explaining the purpose of values set such as this one. The goal is for others to be able to read the code + comments and understand the code and be able to work on it. If request.

Shelnutt2 · 2024-04-26T04:25:28Z

src/tiledb/cloud/vector_search/file_ingestion.py

+        mode=dag.Mode.BATCH,
+    )
+    if worker_resources is None:
+        driver_resources = {"cpu": "2", "memory": "8Gi"}


Did you mean worker or driver here?

Add ingestion fuction for ingesting files to vector search

f0691c9

NikolaosPapailiou requested review from JohnMoutafis, Tile-Kyle and antalakas March 8, 2024 08:49

NikolaosPapailiou added 2 commits March 8, 2024 10:51

Format

f37de0a

Format

987a26a

ihnorton self-requested a review March 8, 2024 13:04

NikolaosPapailiou requested review from thetorpedodog and removed request for Tile-Kyle March 11, 2024 10:22

Shelnutt2 requested changes Mar 11, 2024

View reviewed changes

NikolaosPapailiou added 3 commits March 12, 2024 09:13

Address review comments

0b7f010

Format

9b0e8a5

Add support for OpenAI embeddings and consolidate directory listing args

3befe31

JohnMoutafis requested changes Mar 26, 2024

View reviewed changes

src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved

src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved

src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved

thetorpedodog reviewed Mar 26, 2024

View reviewed changes

Address review comments

9f8b5f8

NikolaosPapailiou requested review from JohnMoutafis, thetorpedodog and Shelnutt2 March 27, 2024 09:57

JohnMoutafis approved these changes Mar 27, 2024

View reviewed changes

Fix param

fe0c981

thetorpedodog reviewed Mar 29, 2024

View reviewed changes

NikolaosPapailiou added 2 commits March 29, 2024 17:09

Address review comments

fd8987b

Removed default arg dicts

7132369

thetorpedodog approved these changes Mar 29, 2024

View reviewed changes

Shelnutt2 reviewed Apr 1, 2024

View reviewed changes

NikolaosPapailiou requested a review from Shelnutt2 April 2, 2024 10:52

Shelnutt2 reviewed Apr 2, 2024

View reviewed changes

src/tiledb/cloud/vector_search/__init__.py Outdated Show resolved Hide resolved

Fix function name

011ffa1

Shelnutt2 reviewed Apr 3, 2024

View reviewed changes

src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved

Shelnutt2 reviewed Apr 3, 2024

View reviewed changes

src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved

Fix typoand remove create_index argument

61f389a

NikolaosPapailiou mentioned this pull request Apr 3, 2024

Add ingestion function for ingesting files to vector search #547

Closed

NikolaosPapailiou requested a review from Shelnutt2 April 3, 2024 14:33

Apply code restructure suggestion

d483531

Shelnutt2 requested changes Apr 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ingestion function for ingesting files to vector search #532

Add ingestion function for ingesting files to vector search #532

NikolaosPapailiou commented Mar 8, 2024 •

edited

shortcut-integration bot commented Mar 8, 2024

Shelnutt2 left a comment

NikolaosPapailiou commented Mar 11, 2024

Shelnutt2 commented Mar 19, 2024

JohnMoutafis left a comment

NikolaosPapailiou commented Apr 1, 2024

Shelnutt2 Apr 1, 2024

NikolaosPapailiou Apr 1, 2024

NikolaosPapailiou Apr 3, 2024

NikolaosPapailiou Apr 4, 2024

NikolaosPapailiou Apr 12, 2024

NikolaosPapailiou Apr 23, 2024

Shelnutt2 left a comment

Shelnutt2 Apr 26, 2024

NikolaosPapailiou Apr 26, 2024

Shelnutt2 Apr 26, 2024 •

edited

Shelnutt2 Apr 26, 2024

NikolaosPapailiou Apr 26, 2024

NikolaosPapailiou Apr 26, 2024

Shelnutt2 Apr 26, 2024

Shelnutt2 Apr 26, 2024

Shelnutt2 Apr 26, 2024

Shelnutt2 Apr 26, 2024

Shelnutt2 Apr 26, 2024

Add ingestion function for ingesting files to vector search #532

Are you sure you want to change the base?

Add ingestion function for ingesting files to vector search #532

Conversation

NikolaosPapailiou commented Mar 8, 2024 • edited

shortcut-integration bot commented Mar 8, 2024

Shelnutt2 left a comment

Choose a reason for hiding this comment

NikolaosPapailiou commented Mar 11, 2024

Shelnutt2 commented Mar 19, 2024

JohnMoutafis left a comment

Choose a reason for hiding this comment

NikolaosPapailiou commented Apr 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shelnutt2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shelnutt2 Apr 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NikolaosPapailiou commented Mar 8, 2024 •

edited

Shelnutt2 Apr 26, 2024 •

edited