Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate if jobs can enter monitoring while in submitting stage #2121

Open
egede opened this issue Feb 27, 2023 · 2 comments
Open

Investigate if jobs can enter monitoring while in submitting stage #2121

egede opened this issue Feb 27, 2023 · 2 comments
Labels

Comments

@egede
Copy link
Member

egede commented Feb 27, 2023

When a master job is submitting, it can take a very long time to submit the subjobs for certain remote backends (i.e. several hours if there are maybe 3000 subjobs). At the moment, the subjobs are not monitored during this period, so if some have finished already, we are effectively having deadtime in the system. Another benefit will be that if a job submission is terminated by the Ganga process getting killed, at least the already submitted subjobs will be recoverable. The current policy of failed submissions reverting the job to the new status should probably be changed to make this work.

@egede egede added the Core label Feb 27, 2023
@egede
Copy link
Member Author

egede commented May 1, 2023

@abhijeetsharma200 See further information here

At the moment the behaviour around submission and monitoring is the following

  • On submission, a job is split into subjobs. Then if keep_going is True, ganga will attempt to submit all the subjobs, even if there are some failures along the way. The failed submissions will be left in the submitting state.
  • The overall state of a job is determined from the status of all subjobs. If a single subjob is in submitting the complete subjob will be declared as submitting (see full status calculation).
  • Master jobs in submitting status are not monitored. The consequence is that monitoring will not start until all subjobs are submitted (can take well above 1 hour) and if a single subjob submission fails, the job will never be monitored.

I think we want a few changes in behaviour.

  • Subjobs that fail to submit should be put into the failed state rather than left in submitting.
  • We should change it such that subjobs start to be monitored even while other subjobs are not yet submitted. This code seems to indicate that it is already the case, but I do not think it is. Some careful debugging might be required to understand.
  • The submitting status is a transient status. So if the ganga process has been killed, then on startup, all subjobs in the submitting status should be changed to failed.

@egede
Copy link
Member Author

egede commented May 1, 2023

I think the first step will be to make a set of tests where you can get subjobs to fail on command and can get subjobs to submit very slowly as a way of testing if monitoring is starting at the same time. The TestSubmitter is a dummy backend that can be used for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant