Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCK-2299: Correctly handle thrown Error during push #5290

Draft
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

svonworl
Copy link
Contributor

@svonworl svonworl commented Dec 21, 2022

Description
This PR modifies the webservice's GitHub apps push-processing code to more gracefully handle an Error thrown by the code within. Previously, Exceptions were handled properly, but Errors slightly not: pushes resulting in Errors were logged as "success"-ful in the GitHub apps logs. We discovered this problem because Ash created a WDL workflow that referenced itself via a http url, which Cromwell happily imported until overflowing the stack, resulting in a StackOverflowError.

The integration test processes a similarly-recursive WDL, which it assumes will trigger a StackOverflowError. The test will break if Cromwell changes that behavior. I tried another approach wherein Errors were triggered via mocks, but it didn't work, and it'd be fragile, anyways...

Currently, to give the user a bit of a clue as to what has happened, this PR includes the short class name of the Error in the github apps log message. For example, StackOverflowError. Could this leak any sensitive information? Please consider.

We catch RuntimeException | Error on purpose, rather than Throwable, so that if we change the inner code so it throws any new check exceptions, we'll know about it.

The codebots really don't like us catching Errors, even though I believe it is the right call in this case. They also want us to update to more modern Java features and constructs. But I didn't see anything really objectionable in their feedback.

I updated TransactionHelper to handle Errors similarly. There is probably other code scattered throughout our system that will skip logging/cleanup operations if an Error is thrown.

Review Instructions
Fork https://github.com/dockstore-testing/recursive-wdl and register it on both qa and staging. Check the github apps logs on each. qa should report that the push failed, and staging should report that the push succeeded.

Issue
https://ucsc-cgl.atlassian.net/browse/DOCK-2299
#5274

Security
There may be related issues, I have contacted the appropriate personnel.

Please make sure that you've checked the following before submitting your pull request. Thanks!

  • Check that you pass the basic style checks and unit tests by running mvn clean install
  • Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
  • Follow the existing JPA patterns for queries, using named parameters, to avoid SQL injection
  • If you are changing dependencies, check the Snyk status check or the dashboard to ensure you are not introducing new high/critical vulnerabilities
  • Assume that inputs to the API can be malicious, and sanitize and/or check for Denial of Service type values, e.g., massive sizes
  • Do not serve user-uploaded binary images through the Dockstore API
  • Ensure that endpoints that only allow privileged access enforce that with the @RolesAllowed annotation
  • Do not create cookies, although this may change in the future
  • If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

@codecov
Copy link

codecov bot commented Dec 21, 2022

Codecov Report

Attention: 18 lines in your changes are missing coverage. Please review.

Comparison is base (fa76e2a) 73.54% compared to head (217c0ea) 73.48%.
Report is 251 commits behind head on develop.

❗ Current head 217c0ea differs from pull request most recent head 9399d9c. Consider uploading reports for the commit 9399d9c to get more accurate results

Files Patch % Lines
...ockstore/webservice/helpers/TransactionHelper.java 19.04% 15 Missing and 2 partials ⚠️
...webservice/resources/AbstractWorkflowResource.java 94.11% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             develop    #5290      +/-   ##
=============================================
- Coverage      73.54%   73.48%   -0.07%     
+ Complexity      4471     4463       -8     
=============================================
  Files            296      296              
  Lines          16914    16928      +14     
  Branches        1862     1865       +3     
=============================================
  Hits           12440    12440              
- Misses          3590     3597       +7     
- Partials         884      891       +7     
Flag Coverage Δ
bitbuckettests 28.30% <0.00%> (-0.03%) ⬇️
integrationtests 58.34% <52.63%> (-0.02%) ⬇️
languageparsingtests 10.82% <0.00%> (-0.01%) ⬇️
toolintegrationtests 30.45% <0.00%> (-0.03%) ⬇️
unit-tests_and_non-confidential-tests 27.13% <0.00%> (-0.03%) ⬇️
workflowintegrationtests 40.65% <0.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, if the issue is that the cromwell/scala bridge code can throw Errors, shouldn't our catch clauses be closer to the Scala bridge code or just inside it? This approach seems to result in collateral damage or too much "by-catch. For example, this seems to affect updateUserWorkflows as well which has nothing to do with WDL parsing.

I do wonder since

Catching either Throwable or Error will also catch OutOfMemoryError and InternalError, from which an application should not attempt to recover.

could happen in big parts of "our" code too.

The integration test processes a similarly-recursive WDL, which it assumes will trigger a StackOverflowError. The test will break if Cromwell changes that behavior. I tried another approach wherein Errors were triggered via mocks, but it didn't work, and it'd be fragile, anyways...

Sounds ok, in general in favour of a high-level integration test approach without mocks. That said, if cromwell changes their code to avoid a stack overflow, I think I'm ok with that. Just make a code comment

@@ -370,16 +370,16 @@ protected void githubWebhookRelease(String repository, String username, String g
isSuccessful &= createWorkflowsAndVersionsFromDockstoreYml(dockstoreYaml12.getWorkflows(), repository, gitReference, installationId, username, dockstoreYml, BioWorkflow.class, messageWriter);
isSuccessful &= createWorkflowsAndVersionsFromDockstoreYml(dockstoreYaml12.getTools(), repository, gitReference, installationId, username, dockstoreYml, AppTool.class, messageWriter);

} catch (Exception ex) {
} catch (RuntimeException | DockstoreYamlHelper.DockstoreYamlException | Error throwable) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok with the general idea, but if we're breaking with conventions because the Scala/Cromwell bridge code is acting against expectations, shouldn't this be a lot "closer" to the Scala bridge class to avoid "by-catch"?

@@ -34,7 +34,7 @@ public final class TransactionHelper {

private static final Logger LOG = LoggerFactory.getLogger(TransactionHelper.class);
private Session session;
private RuntimeException thrown;
private Throwable thrown;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the same token, the changes in this file seem to affect /updateUserWorkflows which should maybe be deleted anyway (we probably don't use it)

@svonworl
Copy link
Contributor Author

svonworl commented Dec 21, 2022

In general, if the issue is that the cromwell/scala bridge code can throw Errors, shouldn't our catch clauses be closer to the Scala bridge code or just inside it? This approach seems to result in collateral damage or too much "by-catch. For example, this seems to affect updateUserWorkflows as well which has nothing to do with WDL parsing.

I do wonder since

Catching either Throwable or Error will also catch OutOfMemoryError and InternalError, from which an application should not attempt to recover.

could happen in big parts of "our" code too.

The integration test processes a similarly-recursive WDL, which it assumes will trigger a StackOverflowError. The test will break if Cromwell changes that behavior. I tried another approach wherein Errors were triggered via mocks, but it didn't work, and it'd be fragile, anyways...

Sounds ok, in general in favour of a high-level integration test approach without mocks. That said, if cromwell changes their code to avoid a stack overflow, I think I'm ok with that. Just make a code comment

Any of our code can potentially throw an Error, and we should respond to that eventuality in a reasonable way. This PR modifies the enclosing catch so that when it attempts to log an Error-related failure, it does so correctly. After that, the resource handler ends. It's not trying to recover, but rather provide the user with correct information about the failed push, because what it was reporting previously, if any Error was thrown, was wrong. If we don't change the code, it tells the user something incorrect, and if it simply exits without logging anything, the user won't know what has happened, because this endpoint is invoked behind the scenes via the webhook callback...

Similarly, the TransactionHelper mods are designed to do the same, to attempt some cleanup and to log some information before rethrowing the Error, so that we have more information about what happened. If you feel super strongly about this, I'm fine with reverting the TransactionHelper changes, they're not as essential...

And sure, we can also wrap Cromwell to translate the StackOverflowError into something else. But, I kinda feel like it's a severe enough issue that maybe we let it ripple out and terminate the thread with the other Errors. We still need the enclosing catch to make sure that the failed push is logged correctly...

Copy link
Contributor

@coverbeck coverbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good discussion...

Researching it, I lean towards catching this particular StackOverflowError error closer to Cromwell:

  • Referring to the Joshua Bloch "Effective Java" book, he says, although it encompasses unchecked exceptions as well, which seems overkill: "If a program throws an unchecked exception or an error, it is generally the case that recovery is impossible and continued execution would do more harm than good"
  • In particular, if we get an OOM error and catch it, I worry that this might put the web service in a indeterminate state.
  • The JavaDoc for VirtualError, which OOM and StackOverflow subclass, says "Thrown to indicate that the Java Virtual Machine is broken or has run out of resources necessary for it to continue operating."
  • Confusingly, Dockstore seems to recover just fine from a StackOverflowError in this case.
  • But, according to a random user on Stack Overflow in this post, it is not always guaranteed you can recover from a StackOverflowError.

Currently, to give the user a bit of a clue as to what has happened, this PR includes the short class name of the Error in the github apps log message. For example, StackOverflowError. Could this leak any sensitive information? Please consider.

I feel it would, a little bit -- the user would have the keys on how to trigger VM exception(s) in our web service.

Writing/researching all the above, I do think it's best to catch the StackOverflowError, only, not errors in general, closer to Cromwell.

@denis-yuen
Copy link
Member

Any of our code can potentially throw an Error, and we should respond to that eventuality in a reasonable way.

Yes, but my understanding of the Java documentation is that this is usually an OS problem so "we" as a platform can respond to it but not in the Java code.

For example as previously noted

An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch. Most such errors are abnormal conditions. The ThreadDeath error, though a "normal" condition, is also a subclass of Error because most applications should not try to catch it.

If our belief is that the Scala libraries are indicating a serious problem incorrectly, i.e. a non-serious problem that we can deal with, then reiterating that the Error (or StackOverflowError) catch should occur "closer" to the problem and not accidentally catch Errors that may be thrown by non-Scala code.

e.g.

In particular, if we get an OOM error and catch it, I worry that this might put the web service in a indeterminate state.

@denis-yuen
Copy link
Member

If we don't change the code, it tells the user something incorrect, and if it simply exits without logging anything, the user won't know what has happened, because this endpoint is invoked behind the scenes via the webhook callback...

This seems like a little of a different problem though. For this, maybe we should log a failure first and then only overwrite it with a success afterwards. Thinking that if the whole application crashes due to a genuine error (or if Scala does have a real problem), then we would leave behind evidence of a problem in the github app logs.

@svonworl
Copy link
Contributor Author

svonworl commented May 4, 2023

test

@denis-yuen
Copy link
Member

test

Copy link
Member

@denis-yuen denis-yuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old PR with merge conflicts
Looking at stale PRs and noise in the daily reminders, is this PR still alive and/or worth keeping @svonworl ?

@denis-yuen
Copy link
Member

One other observation @svonworl I think "draft" PRs do not show up on the daily reminder, so if you wanted to keep this for inspiration/cherry-picking, that can be a route

Copy link
Contributor

@coverbeck coverbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was fixed, perhaps by a different PR/branch?

@svonworl
Copy link
Contributor Author

svonworl commented Mar 12, 2024

I thought this was fixed, perhaps by a different PR/branch?

The original ticket was to address bad registration behavior due to recursive WDLs: specifically, Ash noticed that when he attempted to register a recursive WDL, no entry appeared, and nothing appeared in the app logs. In a different PR, we fixed the original problem by patching the code closer to where the corresponding StackOverflowError occurs, for that specific case.

However, the merged PR didn't address the general case of an Error thrown in the registration code. This PR was intended to remedy that, but there was pushback, and I was not able to merge. I still believe that when an Error is thrown, it is preferable for the surrounding code to attempt some basic cleanup and logging, rather than let the Error pass through, silently, unhandled. Thus, I kept this PR around, in case the team ever subscribes to that point of view.

@denis-yuen
Copy link
Member

denis-yuen commented Mar 12, 2024

Thus, I kept this PR around, in case the team ever subscribes to that point of view.

In the meantime, though, we get a reminder every day about it.
I'm going to put this back into draft and I think that will (may?) stop that reminder

Edit to add: ok, worked

@denis-yuen denis-yuen marked this pull request as draft March 12, 2024 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants