Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to return SUCCEED when training is completed with some failed job tasks #420

Open
charliechen211 opened this issue Jan 16, 2020 · 6 comments
Assignees

Comments

@charliechen211
Copy link
Contributor

image

App will be stopped when chief worker (or some certain jobtype), and continue training when other job task fails. It looks reasonable.

image

But in this circumstance, training is finished and return FAILED on YARN.

I think it's a contradiction.
If we will continue training after one worker fails, and training is finally completed, why should it return FAILED?
Users would feel confused, and don't know what they should do next. Rerun training for more accurate model, or use this model with a FAILED status?

They could config #412 if they needs fail fast, while return SUCCEED when training is completed if they needs training to continue.

I believe it is better to add an extra option for whether return SUCCEED or FAILED when training is completed with some failed job tasks, which user can specify on a per-job basis.

@charliechen211
Copy link
Contributor Author

@oliverhu

@oliverhu
Copy link
Member

Agreed, but I am refrained from making the APIs over complicated, any thought on how the configurations would look like?

@charliechen211
Copy link
Contributor Author

@oliverhu
emmm...I made a PR, in which the configuration naming is not very good.
https://github.com/linkedin/TonY/pull/421/files

I also agreed that it's better to simplify the APIs, or we can cancel this configuration and all completed application with failure tasks will return SUCCEEDED, which is more reasonable.

@charliechen211
Copy link
Contributor Author

CI failed with following logs:
Too long with no output (exceeded 10m0s): context deadline exceeded

Link: https://circleci.com/gh/morenn520/TonY/2?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Maybe test has been run too long?

@oliverhu
Copy link
Member

Not sure, will take a look tmr

@oliverhu
Copy link
Member

oliverhu commented Feb 1, 2020

Assign to Ankit to follow up on this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants