-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rsc: Allow blob stores to dedup blobs #1567
Conversation
let result = blob::Entity::insert(blob) | ||
.on_conflict( | ||
OnConflict::columns(vec![blob::Column::Key, blob::Column::StoreId]) | ||
.update_column(blob::Column::CreatedAt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to update the CreatedAt timestamp for eviction reasons?
It seems like it would be best to keep the the time of creation the same and just track the new use time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do yeah. Every blob has an unreferenced TTL where we guarantee that the blob won't get evicted during.
If we don't update the created at then the following would occur.
Job A -- reference --> Blob 1
Job B uploads a blob but its a dupe so id = 1 is returned
Job A gets evicted and now blob 1 is unreferenced and beyond its TTL so its eligible for eviction so its deleted
Job B gets uploaded referencing id = 1 which doesn't exist so the upload fails
Reseting the created_at also resets the TTL window
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth the effort to create a second entry to separate the eviction data from the "metrics data"? Or renaming "CreatedAt" to note that it is just the most recent time this blob was reference by a new job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not opposed. Could be something like TTLWall or similar. Could be interesting to have both data points
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think its best to do that now so we bundle as many database updates together in a single release, but I'll leave it up to you whether you want to do that in this PR or a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do it in a follow up after merging the two queued up PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The dbonly blob store was creating hundreds of rows for the empty string blob. In solving this a more general implementation arose.
A blob store is now allowed to de-duplicate a blob by returning a previously return key. The database will automatically flatten that into one entry in the db (and update the TTL for blob eviction)
The uniqueness requirement is set for store id+key pair so two different stores are allowed to return the same key without issues