Manifest fetching sometimes fails on new plugin release #1297

DragaDoncila · 2023-11-29T00:31:38Z

Description

When a new plugin is released, fetching the manifest sometimes fails due to this line in the npe2 fetch code.

Because we never retry fetching a manifest, these plugins must have their manifests manually reprocessed. Currently this is done via this script.

This doesn't occur for all new releases (see for example MouseCHD) which was recently released and had its manifest processed correctly.

This also doesn't seem to happen (at least as frequently), in npe2api which also npe2 fetch. When this error was first discovered, there were some instances of this error logged in this errors file, but I have not been able to find recent instances of this happening in npe2api, even for plugins that failed for the napari hub. This could be because npe2api indexes every 10 minutes, rather than every 5 (giving the PyPI API time to settle?), or could be due to a quirk where the "latest" plugin version is available in search results on PyPI before its dedicated version json is available?

Expected Results

A fix for this issue would mean that all new plugin releases have their manifest processed.

The text was updated successfully, but these errors were encountered:

DragaDoncila · 2023-11-29T03:19:22Z

I have been able to reproduce this locally now by releasing a plugin and then immediately (within less than 5 seconds) fetching it. Rerunning the fetch command immediately a second time fetched the manifest correctly. That seems to point to a race condition, but it's strange that we see it so often on the napari hub, if we're only reindexing plugins every 5 minutes, and the manifests seem to be available in just a few seconds...

DragaDoncila · 2023-12-01T01:52:21Z

Based on the behaviour we're seeing I'm pretty confident this is a race condition with PyPI and occurs when a plugin has been updated, but the plugin's project JSON doesn't yet contain the latest version in the releases field.

I think it's possible we're not really seeing this error in npe2api purely because it runs every 10 minutes while our cron job runs every 5 minutes, making it less likely we hit the API at the wrong time 🤷‍♀️ . It's also harder to look for these errors because npe2api does retry these manifests, so the error is overwritten on subsequent runs.

In terms of how we can address this, there are a couple of options. I'll list them here for discussion:

Delay: Introduce a delay between pypi metadata processing and invoking the plugins lambda. We could do this using a step function that waits e.g. 60 seconds and then invokes the plugins lambda. The step functions are billed "per step", so this would be a minimal cost, and allow time for the JSON to be populated. An example step function definition would be something like:

{
    "StartAt": "Delay",
    "Comment": "Invoke Lambda with delay",
    "States": {
        "Delay": {
            "Type": "Wait",
            "Seconds": 60,
            "Next": "Invoke Plugins Lambda"
        },
        "Invoke Plugins Lambda": {
            "Type": "Task",
            "Resource": "arn:aws:states:::lambda:invoke",
            "Parameters": {
                "FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:plugins",
                "InvocationType": "Event",
            },
            "End": true
        }
    }
}

Reprocess flag: Add a REPROCESS flag to the database. If a manifest fails with the error above, we write a REPROCESS=True flag to dynamo. When cloudwatch triggers update_plugin again, we check for all plugins whether the DISTRIBUTION metadata is missing, the ERROR matches the one above and REPROCESS is True. If this is the case, we invoke the plugins lambda again, passing a flag that indicates this is a retry. Regardless of the outcome of fetch_manifest, we overwrite REPROCESS=False in dynamo, and never try again.
Polling npe2api: Since npe2api also builds the manifests and has a retry mechanism inbuilt, we could avoid needing to re-run the lambda or delay it by requesting the manifest from https://npe2api.vercel.app/api/manifest/{plugin} if the DISTRIBUTION metadata is missing and the ERROR matches the one above. If the returned manifest matches the latest version of the plugin, we write it to dynamo and exit. We could eventually get rid of the plugins lambda entirely if we wanted, and only retrieve manifests from npe2api.
Run re-processing script periodically: Rather than retrying or relying on npe2api, we can add a lambda that periodically runs the processing script linked above e.g. daily, and reprocesses any missing manifests.

Option 1 is likely to work, but introduces an additional aws component and does not guarantee we can retrieve the manifest (if for whatever reason the delay in availability is longer than the delay we use). Option 2 introduces some complexity and will require rerunning the plugins lambda. However, it doesn't introduce external dependencies. Since we're only retrying once, it's technically possible we still end up missing the manifest. Option 3 should be very simple to implement, and guarantees eventually the manifest is populated. However it introduces a dependency on npe2api. Option 4 is very simple, and also almost guaranteed to work. However, it means manifest data will be incorrect until the script is run.

@manasaV3 @richaagarwal curious on your thoughts on the above options or any options I may not have considered!

manasaV3 · 2024-01-02T19:49:15Z

Thank you for such a detailed write-up of solution options. I like the setup of us reacting to the error and fixing it, over us reprocessing periodically.

What are your thoughts on using an SQS for triggering reprocessing?

We could add an SQS that triggers the lambda instead of directly invoking the lambda as we currently do. We could configure the SQS to have a maxRecieveCount of more than 1, so on failures, the lambda tries at least once more before giving up. Also, we can set up the visibilityTimeout to be greater than 1 minute so there is at least 1-minute gap between two consecutive tries for a plugin's manifest fetch.

DragaDoncila · 2024-01-03T00:20:40Z

We could add an SQS that triggers the lambda instead of directly invoking the lambda as we currently do. We could configure the SQS to have a maxRecieveCount of more than 1, so on failures, the lambda tries at least once more before giving up. Also, we can set up the visibilityTimeout to be greater than 1 minute so there is at least 1-minute gap between two consecutive tries for a plugin's manifest fetch.

Great, even better than a step function. Such a configuration for SQS sounds like exactly what we need. Thanks for the suggestion @manasaV3

DragaDoncila added the bug Something isn't working label Nov 29, 2023

richaagarwal assigned manasaV3 Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manifest fetching sometimes fails on new plugin release #1297

Manifest fetching sometimes fails on new plugin release #1297

DragaDoncila commented Nov 29, 2023

DragaDoncila commented Nov 29, 2023

DragaDoncila commented Dec 1, 2023

manasaV3 commented Jan 2, 2024

DragaDoncila commented Jan 3, 2024

Manifest fetching sometimes fails on new plugin release #1297

Manifest fetching sometimes fails on new plugin release #1297

Comments

DragaDoncila commented Nov 29, 2023

Description

Expected Results

DragaDoncila commented Nov 29, 2023

DragaDoncila commented Dec 1, 2023

manasaV3 commented Jan 2, 2024

DragaDoncila commented Jan 3, 2024