Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ingestion function for ingesting files to vector search #532
base: main
Are you sure you want to change the base?
Add ingestion function for ingesting files to vector search #532
Changes from 3 commits
f0691c9
f37de0a
987a26a
0b7f010
9b0e8a5
3befe31
9f8b5f8
fe0c981
fd8987b
7132369
011ffa1
61f389a
d483531
4b9c422
482bdbd
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work as expected. Passing in a TileDB file URI get ignored. Please test this and add unit tests for the relevant cases. Currently this does not cover the required use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is implemented atm this should be the group URI and we pick up the file from the group(applying regexp patterns if provided). What are the cases you are looking for supporting here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some testcases for this here https://github.com/TileDB-Inc/TileDB-Vector-Search/blob/main/apis/python/test/test_directory_reader.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requirements are to support TileDB FileStore files or a group of files. This has been a hard requirement from day one and is outlined in our planning document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
file_dir_uri
tosearch_uri
to be consistent with other ingestion jobs.search_uri
supports FileStore URIs. If we want to index one file using the FileStore URI we can pass it assearch_uri
. @Shelnutt2 is this covering your expectations of the function signature?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This along with include/exclude don't make sense. How is this to work with TileDB files? There is no check if the TileDB file name or any parsing of the TileDB URIs. The goal again as outlined in the requirements is to use this with TileDB files, either standalone or from a group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the
file_name
option. FileStore URIs work directly using thesearch_uri
param.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything must default to batch mode. Running this in local is unexpected. The goals are that like all other verticals we support and default to batch ingestion capabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document indexing has multiple execution steps that can be run in different modes:
ingest_files
creates aBATCH
taskgraph that runs all the indexing. This means that all processing happens within aBATCH
taskgraph withaccess_credentials
even if the options here are set toLOCAL
embeddings_generation
: reads the documents and creates text embeddings. This can spawn its own taskgraph.vector_indexing
: creates a vector index from the produced embeddings. This can spawn its own taskgraph.The default configuration at the moment is:
ingest_files
creates aBATCH
taskgraph that runs all the indexing.embeddings_generation, vector_indexing
run inLOCAL
mode within a UDF of theingest_files
taskgraph. Both of these tasks can leverage the available parallelism within the single worker.This is expected to be a good default execution configuration for cost and latency even for sets of thousands of documents.
Do you want all the execution steps to be executed in
BATCH
mode by default?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requirement again, as we discussed and as spelled out in the story is to have a robust batch mode ingestion that can scale to millions of documents. Local mode is a bad default and does not meet our intended goal, please change it and please be sure you actually test at scale. These issues are easy to see even running just our same test datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
embeddings_generation_mode, vector_indexing
modes toBATCH
.Note: this setting can introduce some latency and resource overhead for small to medium size document datasets.
There were some bugs that led to execution failure for this setup. This was addressed in TileDB-Inc/TileDB-Vector-Search#351 Also added tests for the BATCH execution in Cloud.