Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsc: Allow blob stores to dedup blobs #1567

Merged
merged 1 commit into from
May 17, 2024
Merged

rsc: Allow blob stores to dedup blobs #1567

merged 1 commit into from
May 17, 2024

Conversation

V-FEXrt
Copy link
Collaborator

@V-FEXrt V-FEXrt commented May 15, 2024

The dbonly blob store was creating hundreds of rows for the empty string blob. In solving this a more general implementation arose.

A blob store is now allowed to de-duplicate a blob by returning a previously return key. The database will automatically flatten that into one entry in the db (and update the TTL for blob eviction)

The uniqueness requirement is set for store id+key pair so two different stores are allowed to return the same key without issues

let result = blob::Entity::insert(blob)
.on_conflict(
OnConflict::columns(vec![blob::Column::Key, blob::Column::StoreId])
.update_column(blob::Column::CreatedAt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update the CreatedAt timestamp for eviction reasons?
It seems like it would be best to keep the the time of creation the same and just track the new use time

Copy link
Collaborator Author

@V-FEXrt V-FEXrt May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do yeah. Every blob has an unreferenced TTL where we guarantee that the blob won't get evicted during.

If we don't update the created at then the following would occur.

Job A -- reference --> Blob 1
Job B uploads a blob but its a dupe so id = 1 is returned
Job A gets evicted and now blob 1 is unreferenced and beyond its TTL so its eligible for eviction so its deleted
Job B gets uploaded referencing id = 1 which doesn't exist so the upload fails

Reseting the created_at also resets the TTL window

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth the effort to create a second entry to separate the eviction data from the "metrics data"? Or renaming "CreatedAt" to note that it is just the most recent time this blob was reference by a new job?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not opposed. Could be something like TTLWall or similar. Could be interesting to have both data points

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think its best to do that now so we bundle as many database updates together in a single release, but I'll leave it up to you whether you want to do that in this PR or a follow up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do it in a follow up after merging the two queued up PRs

Copy link
Contributor

@colinschmidt colinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@V-FEXrt V-FEXrt merged commit 2f43fef into master May 17, 2024
12 checks passed
@V-FEXrt V-FEXrt deleted the rsc-blob-dedup branch May 17, 2024 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants