Timeline of S3 object changes to on-server resource index changes #790

jameshadfield · 2024-02-01T07:40:44Z

For context on the objects in question, and why they were deleted in the first place, see this slack thread

We recently deleted some S3 objects and it took a surprisingly long time for this to be reflected in the resource index used by the server. Here's a timeline (all times UTC):

Jan 30 03h00 (or therabouts) All versions of object deleted from S3 bucket

--- any nextstrain.org versioned (@YYYY-MM-DD) requests backed by
one of these objects will now 404 as the in-memory index on the server
instructs us to fetch a now-deleted s3 version ---

[+18h] Jan 30 21h37 S3 manifest published. Still lists now-deleted objects. 

[+25h] Jan 31 04h06 index rebuilds. Still includes now-deleted objects.

[+46h] Feb 1 00h59. S3 manifest published. Deleted objects do not appear. 

[+49h] Feb 1 04h07 index rebuilds. No longer includes deleted objects.

[+49h] Feb 1 04h36 heroku dyno#2 updates resources index

--- The URLs should now resolve if heroku routes your request to this dyno.
For me, this was still 404¹ ---

[+50h] Feb 1 04h56 heroku dyno#1 updates resources index

--- URLs now resolve to the correct dataset (from my machine) ---

The main bottleneck, and the part that is out of our control, is the changes being reflected in the S3 manifest. It may be surprising that the manifest published 18h after deletion still included these objects, but this behaviour is consistent with s3 inventories, and also consistent with the "eventually consistent" design I implemented for the resource indexing. S3 docs:

All of your objects might not appear in each inventory list. The inventory list provides eventual consistency for PUT requests (of both new objects and overwrites) and for DELETE requests. Each inventory list for a bucket is a snapshot of bucket items. These lists are eventually consistent (that is, a list might not include recently added or deleted objects).

Deletion / purging of objects should be rare, but PUT requests are not! Where this will affect us is in the following situation:

Latest dataset for "X" is from 2024 Feb 1, and for the purposes of this example it is included in the index.
On Feb 2, we make a request for X@2024-02-01. This accesses the latest (i.e. non-versioned) dataset, because the most recent entry in the index will always make a request to the latest s3 object, not the specific s3 version even tho it's in the index.
On Feb 3 we upload a new dataset. It's not yet in the manifest / index.
On Feb 4, the new dataset is still not yet in the manifest / index. Requests for X@2024-02-01 thus still point to the latest dataset, which is the feb3 version, rather than the dataset from Feb 1st.
Eventually, the manifest/index reflects the above data and requests for X@2024-02-01 access the dataset from Feb 1.

I don't think this is too bad, and it was known up-front when we developed this functionality, but it's certainly not perfect. There are ways to improve this, but for the current usage I don't think the dev investment is worth prioritizing. But maybe one day!

¹ Heroku docs indicate this routing is stochastic, but I consistently got dyno #2.

The text was updated successfully, but these errors were encountered:

While creating a recent timeline of how S3 object changes end up as (on-server) resource index changes <#790> this is the logging that I really wanted.

joverlee521 · 2024-04-23T17:59:06Z

Ah, good to know. I was just wondering why https://nextstrain.org/flu/seasonal/h3n2/ha/2y@2024-04-21 pointed to a dataset from 2024-04-23.

corneliusroemer · 2024-04-30T17:17:21Z

Also ran into this independently. Thanks for explaining what's going on @jameshadfield!

How about never pointing at the if one uses @ syntax but always using the latest indexed one? I guess it's a tradeoff between "getting builds from the future" vs "not getting builds that were actually available"

Could we maybe make it clearer whether we're pointing at "latest" or a particular file?

Alternatively, we could show a warning/info banner when one uses @ for recent dates (within say 2 days) that for recent dates there is this known issue. This might be quite easy to implement and prevent confusion.

Original thoughts: https://bedfordlab.slack.com/archives/C0K3GS3J8/p1714496715691069

jameshadfield added a commit that referenced this issue Feb 1, 2024

[resource index] improve logging

2f87331

While creating a recent timeline of how S3 object changes end up as (on-server) resource index changes <#790> this is the logging that I really wanted.

jameshadfield added a commit that referenced this issue Feb 6, 2024

[resource index] improve logging

8ebcb4c

While creating a recent timeline of how S3 object changes end up as (on-server) resource index changes <#790> this is the logging that I really wanted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeline of S3 object changes to on-server resource index changes #790

Timeline of S3 object changes to on-server resource index changes #790

jameshadfield commented Feb 1, 2024 •

edited

joverlee521 commented Apr 23, 2024

corneliusroemer commented Apr 30, 2024

Timeline of S3 object changes to on-server resource index changes #790

Timeline of S3 object changes to on-server resource index changes #790

Comments

jameshadfield commented Feb 1, 2024 • edited

joverlee521 commented Apr 23, 2024

corneliusroemer commented Apr 30, 2024

jameshadfield commented Feb 1, 2024 •

edited