Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZEPPELIN-5981] Check binded interpreter inside Paragraph.jobAbort() #4693

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

elbakramer
Copy link
Contributor

What is this PR for?

Inside an open zeppelin notebook, after running a paragraph by clicking the play icon button, canceling the running paragraph by clicking the pause icon button (or the Ctrl+Option+C shortcut) does not interrupt the running process and cancel the paragraph.

I've checked that CANCEL_PARAGRAPH message gets well passed through the websocket connection but server was somehow completely ignoring the message.

After some more investigation, I've found that the reason for that was the Paragraph's jobAbort() method, which does nothing due to the null interpreter member.

I've checked some other methods of Paragraph such as completion(), execute(), jobRun(), recover() and they were initializing interpreter before doing something by calling getBindedInterpreter(), so I've changed the jobAbort() method to do the same.

Other than that, I have some doubts on whether it's right thing to return true on failing what was requested.

But after all, I just sticked to the original behavior (returning true for every case) to prevent some unexpected side effects.

What type of PR is it?

Bug Fix

Todos

N/A

What is the Jira issue?

How should this be tested?

  • Run a paragraph that can be canceled, like a python interpreter paragraph with for i in range(10): time.sleep(1)
  • Click the cancel button and check if it gets canceled .

Screenshots (if appropriate)

N/A

Questions:

  • Does the license files need to update? No
  • Is there breaking changes for older versions? No
  • Does this needs documentation? No

@jongyoul
Copy link
Member

@elbakramer Thank you for the contribution. BTW, could you please check the CI? All tests for core-modules failed. I hope you investigate and fix them before starting reviews.

@elbakramer
Copy link
Contributor Author

@jongyoul I think the failures are due to CI actions taking too long, hopefully not by the code change. Could you please check if it's the CI issue first? Or is there any way to investigate the CI or just rerun by myself?
스크린샷 2023-11-21 오후 2 33 18

@jongyoul
Copy link
Member

@elbakramer I don't think we have any critical issues in CI now but let me restart it. FYI, I checked the log briefly and it looks like hanging for a long time in some tests.

@jongyoul
Copy link
Member

Hello, I re-ran the test but there are still some error with the same situation before. I checked the current CI and it didn't have a problem. Could you please check why the CI stops and hangs?

@elbakramer
Copy link
Contributor Author

@jongyoul

I think now CI does not hang and all tests for core modules finish properly with the green check icons. But I'm still curious whether the errors and warnings shown in the Annotations block below should still be resolved. Could you please check if it's ok or not?

https://github.com/elbakramer/zeppelin/actions/runs/7013954933

스크린샷 2023-12-07 오후 6 01 01

What I've found about the hang is that, it seems like the getBindedInterpreter() method works like the get-or-create manner and it unintentionally creates a new interpreter instance upon jobAbort() if none exists. And that unexpected new interpreter makes all tests hang. My first guess about that method was like get-or-null manner but turned out to be wrong.

In order to prevent that creation, I've added some checks before actually calling the method:

  • do nothing if the Job is aborted already
  • only call getBindedInterpreter() if
    • the target for the abort is considered to exist already
    • and current this.interpreter member is null (and should bind to the existing one in order to cancel that)

I couldn't think of a direct way to check the existence of the interpreter instance. So it just indirectly checks whether if the job is currently running or not, assuming that there would be an interpreter instance if the job is in a running state.

@jongyoul
Copy link
Member

jongyoul commented Dec 7, 2023

Thank you for checking and fixing it. I think your analysis looks good. Let's see how the CI goes. I ran it again.

@jongyoul
Copy link
Member

jongyoul commented Feb 5, 2024

Could you please rebase this PR to trigger the latest CI again? one of them still fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants