-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeline of S3 object changes to on-server resource index changes #790
Comments
While creating a recent timeline of how S3 object changes end up as (on-server) resource index changes <#790> this is the logging that I really wanted.
While creating a recent timeline of how S3 object changes end up as (on-server) resource index changes <#790> this is the logging that I really wanted.
Ah, good to know. I was just wondering why https://nextstrain.org/flu/seasonal/h3n2/ha/2y@2024-04-21 pointed to a dataset from 2024-04-23. |
Also ran into this independently. Thanks for explaining what's going on @jameshadfield! How about never pointing at the if one uses Could we maybe make it clearer whether we're pointing at "latest" or a particular file? Alternatively, we could show a warning/info banner when one uses Original thoughts: https://bedfordlab.slack.com/archives/C0K3GS3J8/p1714496715691069 |
For context on the objects in question, and why they were deleted in the first place, see this slack thread
We recently deleted some S3 objects and it took a surprisingly long time for this to be reflected in the resource index used by the server. Here's a timeline (all times UTC):
The main bottleneck, and the part that is out of our control, is the changes being reflected in the S3 manifest. It may be surprising that the manifest published 18h after deletion still included these objects, but this behaviour is consistent with s3 inventories, and also consistent with the "eventually consistent" design I implemented for the resource indexing. S3 docs:
Deletion / purging of objects should be rare, but PUT requests are not! Where this will affect us is in the following situation:
Latest dataset for "X" is from 2024 Feb 1, and for the purposes of this example it is included in the index.
On Feb 2, we make a request for
X@2024-02-01
. This accesses the latest (i.e. non-versioned) dataset, because the most recent entry in the index will always make a request to the latest s3 object, not the specific s3 version even tho it's in the index.On Feb 3 we upload a new dataset. It's not yet in the manifest / index.
On Feb 4, the new dataset is still not yet in the manifest / index. Requests for
X@2024-02-01
thus still point to the latest dataset, which is the feb3 version, rather than the dataset from Feb 1st.Eventually, the manifest/index reflects the above data and requests for
X@2024-02-01
access the dataset from Feb 1.I don't think this is too bad, and it was known up-front when we developed this functionality, but it's certainly not perfect. There are ways to improve this, but for the current usage I don't think the dev investment is worth prioritizing. But maybe one day!
¹ Heroku docs indicate this routing is stochastic, but I consistently got dyno #2.
The text was updated successfully, but these errors were encountered: