-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOCK-2299: Correctly handle thrown Error during push #5290
base: develop
Are you sure you want to change the base?
DOCK-2299: Correctly handle thrown Error during push #5290
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## develop #5290 +/- ##
=============================================
- Coverage 73.54% 73.48% -0.07%
+ Complexity 4471 4463 -8
=============================================
Files 296 296
Lines 16914 16928 +14
Branches 1862 1865 +3
=============================================
Hits 12440 12440
- Misses 3590 3597 +7
- Partials 884 891 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, if the issue is that the cromwell/scala bridge code can throw Errors, shouldn't our catch
clauses be closer to the Scala bridge code or just inside it? This approach seems to result in collateral damage or too much "by-catch. For example, this seems to affect updateUserWorkflows
as well which has nothing to do with WDL parsing.
I do wonder since
Catching either Throwable or Error will also catch OutOfMemoryError and InternalError, from which an application should not attempt to recover.
could happen in big parts of "our" code too.
The integration test processes a similarly-recursive WDL, which it assumes will trigger a StackOverflowError. The test will break if Cromwell changes that behavior. I tried another approach wherein Errors were triggered via mocks, but it didn't work, and it'd be fragile, anyways...
Sounds ok, in general in favour of a high-level integration test approach without mocks. That said, if cromwell changes their code to avoid a stack overflow, I think I'm ok with that. Just make a code comment
@@ -370,16 +370,16 @@ protected void githubWebhookRelease(String repository, String username, String g | |||
isSuccessful &= createWorkflowsAndVersionsFromDockstoreYml(dockstoreYaml12.getWorkflows(), repository, gitReference, installationId, username, dockstoreYml, BioWorkflow.class, messageWriter); | |||
isSuccessful &= createWorkflowsAndVersionsFromDockstoreYml(dockstoreYaml12.getTools(), repository, gitReference, installationId, username, dockstoreYml, AppTool.class, messageWriter); | |||
|
|||
} catch (Exception ex) { | |||
} catch (RuntimeException | DockstoreYamlHelper.DockstoreYamlException | Error throwable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok with the general idea, but if we're breaking with conventions because the Scala/Cromwell bridge code is acting against expectations, shouldn't this be a lot "closer" to the Scala bridge class to avoid "by-catch"?
@@ -34,7 +34,7 @@ public final class TransactionHelper { | |||
|
|||
private static final Logger LOG = LoggerFactory.getLogger(TransactionHelper.class); | |||
private Session session; | |||
private RuntimeException thrown; | |||
private Throwable thrown; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the same token, the changes in this file seem to affect /updateUserWorkflows
which should maybe be deleted anyway (we probably don't use it)
Any of our code can potentially throw an Similarly, the And sure, we can also wrap Cromwell to translate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good discussion...
Researching it, I lean towards catching this particular StackOverflowError error closer to Cromwell:
- Referring to the Joshua Bloch "Effective Java" book, he says, although it encompasses unchecked exceptions as well, which seems overkill: "If a program throws an unchecked exception or an error, it is generally the case that recovery is impossible and continued execution would do more harm than good"
- In particular, if we get an OOM error and catch it, I worry that this might put the web service in a indeterminate state.
- The JavaDoc for VirtualError, which OOM and StackOverflow subclass, says "Thrown to indicate that the Java Virtual Machine is broken or has run out of resources necessary for it to continue operating."
- Confusingly, Dockstore seems to recover just fine from a StackOverflowError in this case.
- But, according to a random user on Stack Overflow in this post, it is not always guaranteed you can recover from a StackOverflowError.
Currently, to give the user a bit of a clue as to what has happened, this PR includes the short class name of the Error in the github apps log message. For example, StackOverflowError. Could this leak any sensitive information? Please consider.
I feel it would, a little bit -- the user would have the keys on how to trigger VM exception(s) in our web service.
Writing/researching all the above, I do think it's best to catch the StackOverflowError, only, not errors in general, closer to Cromwell.
Yes, but my understanding of the Java documentation is that this is usually an OS problem so "we" as a platform can respond to it but not in the Java code. For example as previously noted
If our belief is that the Scala libraries are indicating a serious problem incorrectly, i.e. a non-serious problem that we can deal with, then reiterating that the e.g.
|
This seems like a little of a different problem though. For this, maybe we should log a failure first and then only overwrite it with a success afterwards. Thinking that if the whole application crashes due to a genuine error (or if Scala does have a real problem), then we would leave behind evidence of a problem in the github app logs. |
test |
❓ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old PR with merge conflicts
Looking at stale PRs and noise in the daily reminders, is this PR still alive and/or worth keeping @svonworl ?
One other observation @svonworl I think "draft" PRs do not show up on the daily reminder, so if you wanted to keep this for inspiration/cherry-picking, that can be a route |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was fixed, perhaps by a different PR/branch?
The original ticket was to address bad registration behavior due to recursive WDLs: specifically, Ash noticed that when he attempted to register a recursive WDL, no entry appeared, and nothing appeared in the app logs. In a different PR, we fixed the original problem by patching the code closer to where the corresponding StackOverflowError occurs, for that specific case. However, the merged PR didn't address the general case of an Error thrown in the registration code. This PR was intended to remedy that, but there was pushback, and I was not able to merge. I still believe that when an Error is thrown, it is preferable for the surrounding code to attempt some basic cleanup and logging, rather than let the Error pass through, silently, unhandled. Thus, I kept this PR around, in case the team ever subscribes to that point of view. |
In the meantime, though, we get a reminder every day about it. Edit to add: ok, worked |
Description
This PR modifies the webservice's GitHub apps push-processing code to more gracefully handle an
Error
thrown by the code within. Previously,Exception
s were handled properly, butError
s slightly not: pushes resulting inError
s were logged as "success"-ful in the GitHub apps logs. We discovered this problem because Ash created a WDL workflow that referenced itself via a http url, which Cromwell happily imported until overflowing the stack, resulting in aStackOverflowError
.The integration test processes a similarly-recursive WDL, which it assumes will trigger a
StackOverflowError
. The test will break if Cromwell changes that behavior. I tried another approach whereinError
s were triggered via mocks, but it didn't work, and it'd be fragile, anyways...Currently, to give the user a bit of a clue as to what has happened, this PR includes the short class name of the
Error
in the github apps log message. For example,StackOverflowError
. Could this leak any sensitive information? Please consider.We catch
RuntimeException | Error
on purpose, rather thanThrowable
, so that if we change the inner code so it throws any new check exceptions, we'll know about it.The codebots really don't like us catching
Error
s, even though I believe it is the right call in this case. They also want us to update to more modern Java features and constructs. But I didn't see anything really objectionable in their feedback.I updated
TransactionHelper
to handleError
s similarly. There is probably other code scattered throughout our system that will skip logging/cleanup operations if anError
is thrown.Review Instructions
Fork https://github.com/dockstore-testing/recursive-wdl and register it on both qa and staging. Check the github apps logs on each. qa should report that the push failed, and staging should report that the push succeeded.
Issue
https://ucsc-cgl.atlassian.net/browse/DOCK-2299
#5274
Security
There may be related issues, I have contacted the appropriate personnel.
Please make sure that you've checked the following before submitting your pull request. Thanks!
mvn clean install
@RolesAllowed
annotation