You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training-operator UAT starts failing after bumping k8s-version to 1.28 on AKS with AssertionError: Job pytorch-dist-mnist-gloo was not successful.. This is the case both for CKF latest/edge and 1.8/stable. Unfortunately, we do not have more detailed logs due to known limitation of how our UATs run canonical/charmed-kubeflow-uats#4.
Model Controller Cloud/Region Version SLA Timestamp
kubeflow aks-controller aks/westeurope 3.1.8 unsupported 09:19:35Z
App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook active 1 admission-webhook 1.8/stable 301 10.0.245.250 no
argo-controller active 1 argo-controller 3.3.10/stable 424 10.0.249.157 no
dex-auth active 1 dex-auth 2.36/stable 422 10.0.185.107 no
envoy res:oci-image@cc06b3e active 1 envoy 2.0/stable 101 10.0.244.49 no
istio-ingressgateway active 1 istio-gateway 1.17/stable 723 10.0.216.118 no
istio-pilot active 1 istio-pilot 1.17/stable 827 10.0.173.92 no
jupyter-controller active 1 jupyter-controller 1.8/stable 849 10.0.75.253 no
jupyter-ui active 1 jupyter-ui 1.8/stable 858 10.0.184.139 no
katib-controller res:oci-image@b6a6100 active 1 katib-controller 0.16/stable 446 10.0.106.5 no
katib-db 8.0.35-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 127 10.0.233.45 no
katib-db-manager active 1 katib-db-manager 0.16/stable 411 10.0.188.36 no
katib-ui active 1 katib-ui 0.16/stable 422 10.0.126.70 no
kfp-api active 1 kfp-api 2.0/stable 1035 10.0.86.37 no
kfp-db 8.0.35-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 127 10.0.57.119 no
kfp-metadata-writer active 1 kfp-metadata-writer 2.0/stable 118 10.0.61.100 no
kfp-persistence active 1 kfp-persistence 2.0/stable 1039 10.0.131.226 no
kfp-profile-controller active 1 kfp-profile-controller 2.0/stable 998 10.0.184.246 no
kfp-schedwf active 1 kfp-schedwf 2.0/stable 1052 10.0.234.76 no
kfp-ui active 1 kfp-ui 2.0/stable 1034 10.0.225.138 no
kfp-viewer active 1 kfp-viewer 2.0/stable 1064 10.0.229.253 no
kfp-viz active 1 kfp-viz 2.0/stable 985 10.0.134.29 no
knative-eventing active 1 knative-eventing 1.10/stable 353 10.0.44.250 no
knative-operator active 1 knative-operator 1.10/stable 328 10.0.68.158 no
knative-serving active 1 knative-serving 1.10/stable 354 10.0.61.216 no
kserve-controller active 1 kserve-controller 0.11/stable 523 10.0.11.66 no
kubeflow-dashboard active 1 kubeflow-dashboard 1.8/stable 454 10.0.[14](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:15)7.5 no
kubeflow-profiles active 1 kubeflow-profiles 1.8/stable 355 10.0.68.5 no
kubeflow-roles active 1 kubeflow-roles 1.8/stable 187 10.0.196.222 no
kubeflow-volumes res:oci-image@2261827 active 1 kubeflow-volumes 1.8/stable 260 10.0.29.7 no
metacontroller-operator active 1 metacontroller-operator 3.0/stable 252 10.0.66.178 no
minio res:oci-image@1755999 active 1 minio ckf-1.8/stable 278 10.0.247.208 no
mlmd res:oci-image@44abc5d active 1 mlmd 1.14/stable 127 10.0.219.231 no
oidc-gatekeeper active 1 oidc-gatekeeper ckf-1.8/stable 350 10.0.38.12 no
pvcviewer-operator active 1 pvcviewer-operator 1.8/stable 30 10.0.238.124 no
seldon-controller-manager active 1 seldon-core 1.17/stable 664 10.0.22.127 no
tensorboard-controller active 1 tensorboard-controller 1.8/stable 257 10.0.44.54 no
tensorboards-web-app active 1 tensorboards-web-app 1.8/stable 245 10.0.204.180 no
training-operator active 1 training-operator 1.7/stable 347 10.0.91.235 no
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 10.244.0.10
argo-controller/0* active idle 10.244.1.6
dex-auth/0* active idle 10.244.0.12
envoy/0* active idle 10.244.1.34 9090,9901/TCP
istio-ingressgateway/0* active idle 10.244.1.7
istio-pilot/0* active idle 10.244.0.13
jupyter-controller/0* active idle 10.244.0.14
jupyter-ui/0* active idle 10.244.0.[15](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:16)
katib-controller/0* active idle 10.244.0.34 443,8080/TCP
katib-db-manager/0* active idle 10.244.1.10
katib-db/0* active idle 10.244.0.[16](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:17) Primary
katib-ui/0* active idle 10.244.1.11
kfp-api/0* active idle 10.244.1.12
kfp-db/0* active idle 10.244.1.13 Primary
kfp-metadata-writer/0* active idle 10.244.0.18
kfp-persistence/0* active idle 10.244.0.20
kfp-profile-controller/0* active idle 10.244.0.22
kfp-schedwf/0* active idle 10.244.0.23
kfp-ui/0* active idle 10.244.0.24
kfp-viewer/0* active idle 10.244.1.15
kfp-viz/0* active idle 10.244.0.26
knative-eventing/0* active idle 10.244.0.[17](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:18)
knative-operator/0* active idle 10.244.0.28
knative-serving/0* active idle 10.244.0.21
kserve-controller/0* active idle 10.244.1.[18](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:19)
kubeflow-dashboard/0* active idle 10.244.0.27
kubeflow-profiles/0* active idle 10.244.1.[19](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:20)
kubeflow-roles/0* active idle 10.244.1.14
kubeflow-volumes/0* active idle 10.244.1.21 5000/TCP
metacontroller-operator/0* active idle 10.244.0.25
minio/0* active idle 10.244.0.35 9000-9001/TCP
mlmd/0* active idle 10.244.1.35 8080/TCP
oidc-gatekeeper/0* active idle 10.244.1.16
pvcviewer-operator/0* active idle 10.244.1.[20](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:21)
seldon-controller-manager/0* active idle 10.244.1.17
tensorboard-controller/0* active idle 10.244.0.30
tensorboards-web-app/0* active idle 10.244.0.31
training-operator/0* active idle 10.[24](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:25)4.0.32
for latest/edge juju status
Model Controller Cloud/Region Version SLA Timestamp
kubeflow aks-controller aks/westeurope 3.1.8 unsupported 09:26:07Z
App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook active 1 admission-webhook latest/edge 308 10.0.16.94 no
argo-controller active 1 argo-controller latest/edge 468 10.0.100.236 no
dex-auth active 1 dex-auth latest/edge 458 10.0.254.87 no
envoy active 1 envoy latest/edge 183 10.0.245.125 no
istio-ingressgateway active 1 istio-gateway latest/edge 900 10.0.44.117 no
istio-pilot active 1 istio-pilot latest/edge 872 10.0.21.240 no
jupyter-controller active 1 jupyter-controller latest/edge 936 10.0.131.139 no
jupyter-ui active 1 jupyter-ui latest/edge 856 10.0.90.58 no
katib-controller active 1 katib-controller latest/edge 526 10.0.152.253 no
katib-db 8.0.36-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/edge 138 10.0.152.2 no
katib-db-manager active 1 katib-db-manager latest/edge 490 10.0.236.4 no
katib-ui active 1 katib-ui latest/edge 501 10.0.92.22 no
kfp-api active 1 kfp-api latest/edge 1244 10.0.176.102 no
kfp-db 8.0.36-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/edge 138 10.0.3.211 no
kfp-metadata-writer active 1 kfp-metadata-writer latest/edge 298 10.0.201.207 no
kfp-persistence active 1 kfp-persistence latest/edge 1251 10.0.6.212 no
kfp-profile-controller active 1 kfp-profile-controller latest/edge 1209 10.0.253.135 no
kfp-schedwf active 1 kfp-schedwf latest/edge 1263 10.0.6.119 no
kfp-ui active 1 kfp-ui latest/edge 1246 10.0.221.196 no
kfp-viewer active 1 kfp-viewer latest/edge 1276 10.0.137.58 no
kfp-viz active 1 kfp-viz latest/edge 1197 10.0.127.237 no
knative-eventing active 1 knative-eventing latest/edge 393 10.0.110.159 no
knative-operator active 1 knative-operator latest/edge 368 10.0.[14](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:15)5.205 no
knative-serving active 1 knative-serving latest/edge 394 10.0.147.68 no
kserve-controller active 1 kserve-controller latest/edge 538 10.0.[15](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:16)4.87 no
kubeflow-dashboard active 1 kubeflow-dashboard latest/edge 517 10.0.52.98 no
kubeflow-profiles active 1 kubeflow-profiles latest/edge 379 10.0.[16](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:17)4.223 no
kubeflow-roles active 1 kubeflow-roles latest/edge 207 10.0.205.101 no
kubeflow-volumes active 1 kubeflow-volumes latest/edge 279 10.0.83.113 no
metacontroller-operator active 1 metacontroller-operator latest/edge 280 10.0.153.8 no
minio res:oci-image@[17](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:18)55999 active 1 minio latest/edge 306 10.0.52.197 no
mlmd active 1 mlmd latest/edge 174 10.0.[18](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:19)8.218 no
oidc-gatekeeper active 1 oidc-gatekeeper latest/edge 371 10.0.125.250 no
pvcviewer-operator active 1 pvcviewer-operator latest/edge 74 10.0.97.108 no
seldon-controller-manager active 1 seldon-core latest/edge 691 10.0.87.[19](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:20)5 no
tensorboard-controller active 1 tensorboard-controller latest/edge 281 10.0.30.[20](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:21)1 no
tensorboards-web-app active 1 tensorboards-web-app latest/edge 269 10.0.24.183 no
training-operator active 1 training-operator latest/edge 378 10.0.16.237 no
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 10.244.0.7
argo-controller/0* active idle 10.244.1.9
dex-auth/0* active idle 10.244.0.8
envoy/0* active idle 10.244.1.11
istio-ingressgateway/0* active idle 10.244.1.10
istio-pilot/0* active idle 10.244.0.9
jupyter-controller/0* active idle 10.244.1.12
jupyter-ui/0* active idle 10.244.1.14
katib-controller/0* active idle 10.244.1.15
katib-db-manager/0* active idle 10.244.1.16
katib-db/0* active idle 10.244.0.12 Primary
katib-ui/0* active idle 10.244.1.17
kfp-api/0* active idle 10.244.1.18
kfp-db/0* active idle 10.244.1.19 Primary
kfp-metadata-writer/0* active idle 10.244.0.13
kfp-persistence/0* active idle 10.244.0.15
kfp-profile-controller/0* active idle 10.244.0.16
kfp-schedwf/0* active idle 10.244.0.18
kfp-ui/0* active idle 10.244.1.20
kfp-viewer/0* active idle 10.244.0.19
kfp-viz/0* active idle 10.244.0.20
knative-eventing/0* active idle 10.244.0.14
knative-operator/0* active idle 10.244.0.22
knative-serving/0* active idle 10.244.0.17
kserve-controller/0* active idle 10.244.1.25
kubeflow-dashboard/0* active idle 10.244.1.23
kubeflow-profiles/0* active idle 10.244.0.24
kubeflow-roles/0* active idle 10.244.1.[21](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:22)
kubeflow-volumes/0* active idle 10.244.0.21
metacontroller-operator/0* active idle 10.244.1.[22](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:23)
minio/0* active idle 10.244.1.24 9000-9001/TCP
mlmd/0* active idle 10.244.1.28
oidc-gatekeeper/0* active idle 10.244.0.[23](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:24)
pvcviewer-operator/0* active idle 10.244.0.26
seldon-controller-manager/0* active idle 10.[24](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:25)4.1.26
tensorboard-controller/0* active idle 10.244.1.27
tensorboards-web-app/0* active idle 10.244.0.[25](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:26)
training-operator/0* active idle 10.244.1.29
Relevant Log Output
test_notebooks.py::test_notebook[training-integration]
-------------------------------- live log call ---------------------------------
INFO test_notebooks:test_notebooks.py:44 Running training-integration.ipynb...
ERROR test_notebooks:test_notebooks.py:58 Cell In[4], line 8, in assert_job_succeeded(client, job_name, job_kind)
1 @retry(
2 wait=wait_exponential(multiplier=2, min=1, max=30),
3 stop=stop_after_attempt(50),
4 reraise=True,
5 )
6 def assert_job_succeeded(client, job_name, job_kind):
7 """Wait for the Job to complete successfully."""
----> 8 assert client.is_job_succeeded(
9 name=job_name, job_kind=job_kind
10 ), f"Job ***job_name*** was not successful."
AssertionError: Job pytorch-dist-mnist-gloo was not successful.
FAILED [100%]
=================================== FAILURES ===================================
_______________________ test_notebook[katib-integration] _______________________
test_notebook = '/tests/.worktrees/4ca5f8e7474193b125daecbd2dc157f3fe1ab017/tests/notebooks/katib/katib-integration.ipynb'
@pytest.mark.ipynb
@pytest.mark.parametrize(
# notebook - ipynb file to execute"test_notebook",
NOTEBOOKS.values(),
ids=NOTEBOOKS.keys(),
)
def test_notebook(test_notebook):
"""Test Notebook Generic Wrapper."""
os.chdir(os.path.dirname(test_notebook))
with open(test_notebook) as nb:
notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)
ep = ExecutePreprocessor(
timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
)
ep.skip_cells_with_tag = "pytest-skip"
try:
log.info(f"Running ***os.path.basename(test_notebook)***...")
output_notebook, _ = ep.preprocess(notebook, ***"metadata": ***"path": "./"***)
# persist the notebook output to the original file for debugging purposes
save_notebook(output_notebook, test_notebook)
except CellExecutionError as e:
# handle underlying error
pytest.fail(f"Notebook execution failed with ***e.ename***: ***e.evalue***")
forcellin output_notebook.cells:
metadata = cell.get("metadata", dict)
if"raises-exception"in metadata.get("tags", []):
forcell_outputin cell.outputs:
if cell_output.output_type == "error":
# extract the error message from the cell output
log.error(format_error_message(cell_output.traceback))
> pytest.fail(cell_output.traceback[-1])
E Failed: AssertionError: Katib Experiment was not successful.
Additional Context
No response
The text was updated successfully, but these errors were encountered:
Bug Description
Training-operator UAT starts failing after bumping k8s-version to 1.28 on AKS with
AssertionError: Job pytorch-dist-mnist-gloo was not successful.
. This is the case both for CKFlatest/edge
and1.8/stable
. Unfortunately, we do not have more detailed logs due to known limitation of how our UATs run canonical/charmed-kubeflow-uats#4.Example runs
To Reproduce
Run CI for k8s version 1.28
Environment
AKS k8s 1.28
Juju 3.1
for 1.8 juju status
for latest/edge juju status
Relevant Log Output
Additional Context
No response
The text was updated successfully, but these errors were encountered: