Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ingestion function for ingesting files to vector search #532

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

NikolaosPapailiou
Copy link
Contributor

@NikolaosPapailiou NikolaosPapailiou commented Mar 8, 2024

This adds one-click ingestion function for ingesting files to vector search.

Tested in cloud:

Copy link

This pull request has been linked to Shortcut Story #42043: Trigger Task Graph for Indexing (Ingestion).

@ihnorton ihnorton self-requested a review March 8, 2024 13:04
@NikolaosPapailiou NikolaosPapailiou requested review from thetorpedodog and removed request for Tile-Kyle March 11, 2024 10:22
Copy link
Member

@Shelnutt2 Shelnutt2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several comments in first pass.

This is also completely missing interacting with TileDB Files. The requirement is to support loading into TileDB Files store and performing indexing. Not just indexing.

src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved
src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved
src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved
src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved
src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved
src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved
src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved
@NikolaosPapailiou
Copy link
Contributor Author

This is also completely missing interacting with TileDB Files. The requirement is to support loading into TileDB Files store and performing indexing. Not just indexing.

I don't think we have discussed requirements for this. This needs to be designed in collaboration with cloud and we need to understand who owns the file ingestion code and implementation.

@Shelnutt2
Copy link
Member

I don't think we have discussed requirements for this. This needs to be designed in collaboration with cloud and we need to understand who owns the file ingestion code and implementation.

When you are back we need to sync on this. I believe we had discussed this and the example POC code I provided handled all of this. The goal was to take the POC we used and re-implement the features in a production fashion. The first goal is to support file ingestion and index creation as part of one pipeline.

Copy link
Contributor

@JohnMoutafis JohnMoutafis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid nesting functions especially when they are only used once through out the body of the method.
Also functions like index_exists that only work as a "passthrough" to only call another function should be avoided as they are unnecessary, they add overhead and they complicate the code and any potential debugging process.

src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved
src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved
src/tiledb/cloud/vector_search/file_ingestion.py Outdated Show resolved Hide resolved
@NikolaosPapailiou
Copy link
Contributor Author

@Shelnutt2 are your comments addressed by the changes? This requires your approval to continue with merging.

embedding_class_ = getattr(embeddings_module, embedding_class)
embedding = embedding_class_(**embedding_kwargs)

with tiledb.scope_ctx(config):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done as the first stage in the graph, not local to the caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this, ingest_files_udf is running within the taskgraph

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an alternative version of the PR that uses an extra taskgraph and creates the dataset as a first node in the taskgraph #547 Is this what you are expecting the ingestion structure to look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shelnutt2 is the alternative version matching your expectation? Let me know if you still have concerns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied the alternative taskgraph structure in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shelnutt2 this PR needs your approval to move forward. Let me know if you need any more changes.

Copy link
Member

@Shelnutt2 Shelnutt2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several comments that need to be fixed. This code also needs to be aligned with the goals of handling TileDB FileStore files as a primary source.

Additionally there are some pylint errors related to variables that aren't passed through. Please address all lint errors.

driver_image: Optional[str] = None,
extra_driver_modules: Optional[List[str]] = None,
max_tasks_per_stage: int = -1,
embeddings_generation_mode: dag.Mode = dag.Mode.LOCAL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything must default to batch mode. Running this in local is unexpected. The goals are that like all other verticals we support and default to batch ingestion capabilities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document indexing has multiple execution steps that can be run in different modes:

  • ingest_files creates a BATCH taskgraph that runs all the indexing. This means that all processing happens within a BATCH taskgraph with access_credentials even if the options here are set to LOCAL
  • embeddings_generation: reads the documents and creates text embeddings. This can spawn its own taskgraph.
  • vector_indexing: creates a vector index from the produced embeddings. This can spawn its own taskgraph.

The default configuration at the moment is:

  • ingest_files creates a BATCH taskgraph that runs all the indexing.
  • embeddings_generation, vector_indexing run in LOCAL mode within a UDF of the ingest_files taskgraph. Both of these tasks can leverage the available parallelism within the single worker.

This is expected to be a good default execution configuration for cost and latency even for sets of thousands of documents.

Do you want all the execution steps to be executed in BATCH mode by default?

Copy link
Member

@Shelnutt2 Shelnutt2 Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirement again, as we discussed and as spelled out in the story is to have a robust batch mode ingestion that can scale to millions of documents. Local mode is a bad default and does not meet our intended goal, please change it and please be sure you actually test at scale. These issues are easy to see even running just our same test datasets.



def ingest_files(
file_dir_uri: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work as expected. Passing in a TileDB file URI get ignored. Please test this and add unit tests for the relevant cases. Currently this does not cover the required use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is implemented atm this should be the group URI and we pick up the file from the group(applying regexp patterns if provided). What are the cases you are looking for supporting here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirements are to support TileDB FileStore files or a group of files. This has been a hard requirement from day one and is outlined in our planning document.

def ingest_files(
file_dir_uri: str,
index_uri: str,
file_name: Optional[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This along with include/exclude don't make sense. How is this to work with TileDB files? There is no check if the TileDB file name or any parsing of the TileDB URIs. The goal again as outlined in the requirements is to use this with TileDB files, either standalone or from a group.

# Index update params
index_timestamp: Optional[int] = None,
workers: int = -1,
worker_resources: Optional[Dict] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not plumbing all the different resource parameters, is there a reason?

environment_variables=environment_variables,
load_embedding=False,
load_metadata_in_memory=False,
memory_budget=1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this set to 1? Please add inline code comments. There should be a decent amount of comments explaining the purpose of values set such as this one. The goal is for others to be able to read the code + comments and understand the code and be able to work on it. If request.

mode=dag.Mode.BATCH,
)
if worker_resources is None:
driver_resources = {"cpu": "2", "memory": "8Gi"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean worker or driver here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants