Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postpone index build for empty table to first insert #209

Conversation

therealdarkknight
Copy link
Contributor

@therealdarkknight therealdarkknight commented Oct 19, 2023

Fixes #198

In the event of an index being built on an empty table where no dimension is specified in the options (using CREATE INDEX... WITH (dim=k)), we postpone the creation of this index to the first insertion, where we can get the dimension of the inserted vector and then build the index. This improves user UX as outlined in the issue.

I updated tests, and expanded on the hnsw_create test to test the above functionality.

Notes/Observations:
0. One thing to consider is what might happen when we have concurrent inserts/calls to aminsert when the index build is postponed. We don't want to trigger the index build more than once accidentally. Perhaps we should pass a Boolean state that marks the progress of a postponed index build and check against it? I have yet to look into Postgres' concurrency model. Feel free to share ideas.

  1. The BuildCallback immediately returns after the first tuple is read (explained in the comment there). This was done for simplicity's sake. But this also means that postgres is pointlessly calling this callback for every other tuple in the heap that exists at that time, since each callback after the first one will immediately return. So a future optimization may be, on postponed builds, to only read the first row and write it to the index without performing the full heap scan with the callback. However, note that this is only relevant in a specific scenario: when we construct an index on an empty table, specify no dimension, and do a bulk insert of some sort as part of the first insert (like \COPY on a csv file), which is when the heap would have more than one tuple at the time of building the postponed index. Since this scenario is pretty rare, we already have a simple implementation, and we are not guaranteed significant performance boosts if we opt for the alternative, I think that the current approach is well worth it.

  2. I noticed that the original codebase, when inferring a dimension (via the GetArrayLengthFromHeap method), we ditch inference entirely upon encountering a NULL vector. This means that we're not able to inference when the first encountered tuple is NULL but there are other non-NULL tuples. Wanted to point this out-- I don't think it would be a hard change: we can just keep going until we find a non-NULL vector or reach the end.

  3. I believe some of the code in GetHnswIndexDimensions can be refactored. It basically tries to do the same thing as InferDimension, and both use GetArrayLength. Idea is to have GetHnswIndexDimensions just call InferDimension and then refactor the InitBuildState as well, since it is currently calling both of these methods under certain circumstances which leads to the same operation effectively being done twice. But the way that GetHnswIndexDimensions and InferDimension use GetArraylength is also slightly different. Is the relevant part of GetHnswIndexDimensions and InferDimension trying to accomplish the same thing? Curious to hear thoughts on this.

@var77
Copy link
Collaborator

var77 commented Oct 27, 2023

Thank you @therealdarkknight !
I was curious, what if we build the index on empty table with let's say 1 dimension and then on the first insert if table is empty update the index header (with exclusive lock) by inferring the dimensions from the current tuple being inserted?

Will there be any problems with that approach or does postponing the index build have more advantages, what do you think?

cc: @Ngalstyan4

@Ngalstyan4
Copy link
Contributor

I agree with @var77's suggestion. That makes things cleaner.

Let's postpone this for now tho and concentrate on other things first.

@Ngalstyan4
Copy link
Contributor

Superseded by #251

@Ngalstyan4 Ngalstyan4 closed this Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow inserting vectors of arbitrary size on an empty index and pick up index dimension on first insert
3 participants