Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix frontend alarms #1276

Open
4 of 13 tasks
codemonkey800 opened this issue Sep 29, 2023 · 1 comment
Open
4 of 13 tasks

Fix frontend alarms #1276

codemonkey800 opened this issue Sep 29, 2023 · 1 comment
Assignees
Labels

Comments

@codemonkey800
Copy link
Collaborator

codemonkey800 commented Sep 29, 2023

Caught Error Alarms

These are errors that are caught and logged in the frontend. The errors for these logs are can be surfaced from CloudWatch Logs.

Error fetching spdx license data

This log can be found in CloudWatch when filtering using the following query:

image

This is related to error logs that happens when fetching the SPDX license data on the browser throws an error:

onError(err) {
logger.error({
message:
'Error fetching spdx license data for MetadataListMetadataItem',
error: getErrorMessage(err),
});
},

Fetching this data on the browser is probably inefficient and more prone to error because of the user’s environment. We can move this fetch to the server side to improve the reliability of this API call. If this doesn’t reduce the amount of errors occurring, we can look into reducing the log level of this message.

Error loading route

This log can be found in CloudWatch when filtering using the following query:

image

This is related to some code for logging when an error occurs while a page is transitioning:

logger.error({
message: 'Error loading route',
error: getErrorMessage(error),
});

According to the docs, this error occurs if the route transition is cancelled or if an error is thrown, but the code above doesn’t check for this when logging the error message. We can refactor the code to use a different log level depending on if the user cancelled the transition or not:

const level = error.cancelled ? 'info' : 'error'

logger[level]({
  message: 'Error loading route',
  error: getErrorMessage(error),
  cancelled: error.cancelled,
})

Ideally this should reduce the amount of actual errors we encounter, but if not, we can look into filtering out this error from the logs metric filter if it’s something we can’t easily fix.

Uncaught Error Alarms

These are alarms that are not handled within a try / catch block. Currently RUM has reported the following errors:

image

CWR: Failed to retrieve credentials from STS: TypeError: Failed to fetch

This error occurs when a network error occurs while fetching credentials from AWS STS. The stacktrace for this message looks like:

Error: CWR: Failed to retrieve credentials from STS: TypeError: Failed to fetch
    at nS.<anonymous> (www.napari-hub.org/_next/static/chunks/pages/_app-ab7d999ffe90ca01.js:165:375226)
    at www.napari-hub.org/_next/static/chunks/pages/_app-ab7d999ffe90ca01.js:165:373992
    at Object.throw (www.napari-hub.org/_next/static/chunks/pages/_app-ab7d999ffe90ca01.js:165:374097)
    at s (www.napari-hub.org/_next/static/chunks/pages/_app-ab7d999ffe90ca01.js:165:375420)

Unfortunately we can't really fix this error since we can't control user network conditions. Instead, we can try filtering this event from being tracked by the alarm.

To do this, we will need to refactor the alarm infrastructure to:

  1. Export RUM events to a log stream
  2. Create a logs metric filter that filters out STS fetch errors
  3. Updated frontend alarm to use data from logs metric filter

Error details: CWR: Failed to retrieve Cognito OpenId token: TypeError: Failed to fetch

Similar to the above error, this is out of our control due to user network conditions. We can remove this from the frontend alarm by ignoring this specific error message.

The provided href (/plugins/[name]) value is missing query values

According to the docs, this error occurs when the UI tries to open a URL that does not have the provided variable in the pathname.

This error is a bit complex to debug because it happens intermittently and is not easy to reproduce. The frequency appears to be 1-2 instance per week:

image

The plugin page also does not have links to itself or plugin pages, so it seems technically impossible for this error to occur.

One thing we can try is updating all references to /plugins/[name] to check that name is defined before creating a link or navigating to a route.

If this does not reduce the errors, we could reduce the log level since this type of error doesn't have a huge impact on the functionality of the page. It's possible this error could be a result of an intermittent loading state since some of the errors happen in the loading state for the plugin page.

Script error

These are unknown errors that happen during JavaScript execution that seemingly only happen on Desktop Safari browsers:

image

This error may occur when the frontend tries to load JavaScript from another domain. Based on this article, we can possibly fix this by updating references to external JavaScript to include the crossorigin property in the <script> tag.

The only reference to this is the script we use for hub spot:

<Script
onLoad={() => {
hubspotStore.ready = true;
}}
src="//js.hsforms.net/forms/v2.js?pre=1"
/>

If this does not reduce the errors, we can look into filtering out this message for this specific error.

Request aborted

This error occurs when a request is cancelled which may happen if the user navigates away from a page with an in-progress request, so it should be safe to filter out.

ResizeObserver loop completed with undelivered notifications.

This error occurs when ResizeObserver is trying to notify subscribers of a recent resize. This error may occur if the users page resizes during a notification. Unfortunately we can't control this because of the variety of differences in the user's environment like viewport and browser, so this is something we can look into filtering out.

Action Items

  • Filter out errors
    • Export RUM events to CloudWatch Logs
    • Create logs metric filter
      • STS fetch errors
      • CWR fetch errors
      • Request aborted errors
      • ResizeObserver undelivered notification errors
    • Create metrics alarm based on metric filter
  • Investigate fixes for errors fix frontend alarm bugs #1277
    • Verify name is defined for all references to /plugins/[name]
    • Move SPDX license data fetching to SSR
    • Change log level for when page transitions are cancelled
    • Assign lower log level to 4xx errors
@codemonkey800
Copy link
Collaborator Author

recently got some 400 errors today related to a user somehow accessing the plugins page using the template variable [name]:

image

this would mean they accessed /plugins/[name] somehow. overall this isn't necessarily an error we have to worry about since it's a client error, so we can filter these out by reducing the log level to warning. I've added the task Assign lower log level to 4xx errors to capture this 🫡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

No branches or pull requests

1 participant