Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging training jobs (old version of pipeline?!?) #68

Open
jfri3d opened this issue Nov 6, 2019 · 0 comments
Open

Hanging training jobs (old version of pipeline?!?) #68

jfri3d opened this issue Nov 6, 2019 · 0 comments
Labels
bug Something isn't working
Projects

Comments

@jfri3d
Copy link
Collaborator

jfri3d commented Nov 6, 2019

What is the current behaviour?

Hanging training jobs due to an old version of the Training Pipeline.

What is the expected behaviour?

Not this!

How to reproduce? (e.g. logs, minimal example, etc...)

Deploying a training job occasionally deploys multiple jobs due to some "lag" between versions of the Training Pipeline.

$ kaos train list
​
+--------------------------------------------------------------------------------------------------+
|                                             TRAINING                                             |
+-----+----------+----------+----------------------------------+---------------------+-------------+
| ind | duration | hyperopt |              job_id              |       started       |    state    |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  0  |    97    |  False   | 862e2de2a8c3424e8b39839831040a95 | 2019-11-06 13:36:24 | JOB_SUCCESS |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  1  |    24    |  False   | 1d35a8a88af04fb4b5e8a8c4087e0271 | 2019-11-06 13:21:04 | JOB_FAILURE |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  2  |    10    |  False   | 4c1bc7a26f5b4556bfa4ecf6bac60b1f | 2019-11-06 13:18:38 | JOB_SUCCESS |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  3  |    23    |  False   | 135aefb208c8410ab5381b78547b36b1 | 2019-11-06 12:48:15 | JOB_SUCCESS |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  4  |    1     |  False   | 77e866c4c3dc49109cbce71f45b8d0e3 | 2019-11-06 11:44:49 | JOB_FAILURE |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  5  |    ?     |    ?     | c57d7ce324e74efebc7a77cdc129b41c | 2019-11-06 13:36:10 | JOB_RUNNING |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  6  |    ?     |    ?     | 820ebfb0e4594001ad701c9c75415ec8 | 2019-11-06 13:20:48 | JOB_RUNNING |
+-----+----------+----------+----------------------------------+---------------------+-------------+
|  7  |    ?     |    ?     | 308f317cf37d47df9b23e123f6210cb6 | 2019-11-06 12:47:58 | JOB_RUNNING |
+-----+----------+----------+----------------------------------+---------------------+-------------+

Logs are also included:

aos train logs -j 308f317cf37d47df9b23e123f6210cb6
​
[2019-11-06 12:48:06] skipping job 77e866c4c3dc49109cbce71f45b8d0e3 as it is already in state JOB_FAILURE
[2019-11-06 12:48:36] skipping job 308f317cf37d47df9b23e123f6210cb6 as it uses old pipeline version 5
[2019-11-06 12:48:36] processing job 135aefb208c8410ab5381b78547b36b1
[2019-11-06 12:48:36] blocking on parent commit "f2f542a48f3b49e6926e31c93669a6d1" before writing to output commit "2178eb2a76834f23a6b6208f3217e7dd"
[2019-11-06 12:48:36] starting to download data
[2019-11-06 12:48:36] finished downloading data after 508.2152ms
[2019-11-06 12:48:36] beginning to run user code
[2019-11-06 12:48:38] Hello worldddddd!
[2019-11-06 12:48:38] We are training! Are we?
[2019-11-06 12:48:38] cwd        /opt/program
[2019-11-06 12:48:38] basedir    /opt/program
[2019-11-06 12:48:38] os.listdir()       ['stargazers', 'dist', 'stargazers.egg-info', 'build', 'train', 'README.md', 'requirements.txt', '.DS_Store', 'setup.py']
[2019-11-06 12:48:38] /pfs/hyper         ['params.null']
[2019-11-06 12:48:38] /pfs       ['data', 'build-train', 'hyper', '.scratch', 'out']
[2019-11-06 12:48:38] finished running user code after 1.8068292s
[2019-11-06 12:48:38] starting to upload output
[2019-11-06 12:48:38] finished uploading output after 13.5419ms
[2019-11-06 12:48:38] starting to merge chunk
[2019-11-06 12:48:38] finished merging chunk after 481.2µs
[2019-11-06 12:48:38] starting to merge output
[2019-11-06 12:48:38] finished merging output after 19.8903ms
[2019-11-06 12:48:38] job "135aefb208c8410ab5381b78547b36b1" put in terminal state "JOB_SUCCESS"; cancelling

Context (Environment)

LOCAL

@jfri3d jfri3d added the bug Something isn't working label Nov 6, 2019
@jfri3d jfri3d added this to To do in kaos via automation Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
kaos
  
To do
Development

No branches or pull requests

1 participant