Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeline of S3 object changes to on-server resource index changes #790

Open
jameshadfield opened this issue Feb 1, 2024 · 2 comments
Open

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Feb 1, 2024

For context on the objects in question, and why they were deleted in the first place, see this slack thread

We recently deleted some S3 objects and it took a surprisingly long time for this to be reflected in the resource index used by the server. Here's a timeline (all times UTC):

Jan 30 03h00 (or therabouts) All versions of object deleted from S3 bucket

--- any nextstrain.org versioned (@YYYY-MM-DD) requests backed by
one of these objects will now 404 as the in-memory index on the server
instructs us to fetch a now-deleted s3 version ---

[+18h] Jan 30 21h37 S3 manifest published. Still lists now-deleted objects. 

[+25h] Jan 31 04h06 index rebuilds. Still includes now-deleted objects.

[+46h] Feb 1 00h59. S3 manifest published. Deleted objects do not appear. 

[+49h] Feb 1 04h07 index rebuilds. No longer includes deleted objects.

[+49h] Feb 1 04h36 heroku dyno#2 updates resources index

--- The URLs should now resolve if heroku routes your request to this dyno.
For me, this was still 404¹ ---

[+50h] Feb 1 04h56 heroku dyno#1 updates resources index

--- URLs now resolve to the correct dataset (from my machine) ---

The main bottleneck, and the part that is out of our control, is the changes being reflected in the S3 manifest. It may be surprising that the manifest published 18h after deletion still included these objects, but this behaviour is consistent with s3 inventories, and also consistent with the "eventually consistent" design I implemented for the resource indexing. S3 docs:

All of your objects might not appear in each inventory list. The inventory list provides eventual consistency for PUT requests (of both new objects and overwrites) and for DELETE requests. Each inventory list for a bucket is a snapshot of bucket items. These lists are eventually consistent (that is, a list might not include recently added or deleted objects).


Deletion / purging of objects should be rare, but PUT requests are not! Where this will affect us is in the following situation:

  1. Latest dataset for "X" is from 2024 Feb 1, and for the purposes of this example it is included in the index.

  2. On Feb 2, we make a request for X@2024-02-01. This accesses the latest (i.e. non-versioned) dataset, because the most recent entry in the index will always make a request to the latest s3 object, not the specific s3 version even tho it's in the index.

  3. On Feb 3 we upload a new dataset. It's not yet in the manifest / index.

  4. On Feb 4, the new dataset is still not yet in the manifest / index. Requests for X@2024-02-01 thus still point to the latest dataset, which is the feb3 version, rather than the dataset from Feb 1st.

  5. Eventually, the manifest/index reflects the above data and requests for X@2024-02-01 access the dataset from Feb 1.

I don't think this is too bad, and it was known up-front when we developed this functionality, but it's certainly not perfect. There are ways to improve this, but for the current usage I don't think the dev investment is worth prioritizing. But maybe one day!


¹ Heroku docs indicate this routing is stochastic, but I consistently got dyno #2.

jameshadfield added a commit that referenced this issue Feb 1, 2024
While creating a recent timeline of how S3 object changes end up as
(on-server) resource index changes
<#790> this is the
logging that I really wanted.
jameshadfield added a commit that referenced this issue Feb 6, 2024
While creating a recent timeline of how S3 object changes end up as
(on-server) resource index changes
<#790> this is the
logging that I really wanted.
@joverlee521
Copy link
Contributor

Ah, good to know. I was just wondering why https://nextstrain.org/flu/seasonal/h3n2/ha/2y@2024-04-21 pointed to a dataset from 2024-04-23.

@corneliusroemer
Copy link
Member

Also ran into this independently. Thanks for explaining what's going on @jameshadfield!

How about never pointing at the if one uses @ syntax but always using the latest indexed one? I guess it's a tradeoff between "getting builds from the future" vs "not getting builds that were actually available"

Could we maybe make it clearer whether we're pointing at "latest" or a particular file?

Alternatively, we could show a warning/info banner when one uses @ for recent dates (within say 2 days) that for recent dates there is this known issue. This might be quite easy to implement and prevent confusion.

Original thoughts: https://bedfordlab.slack.com/archives/C0K3GS3J8/p1714496715691069

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants