Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci(aks): Training-operator UAT fails on AKS k8s 1.28 #894

Open
orfeas-k opened this issue May 9, 2024 · 1 comment
Open

ci(aks): Training-operator UAT fails on AKS k8s 1.28 #894

orfeas-k opened this issue May 9, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@orfeas-k
Copy link
Contributor

orfeas-k commented May 9, 2024

Bug Description

Training-operator UAT starts failing after bumping k8s-version to 1.28 on AKS with AssertionError: Job pytorch-dist-mnist-gloo was not successful.. This is the case both for CKF latest/edge and 1.8/stable. Unfortunately, we do not have more detailed logs due to known limitation of how our UATs run canonical/charmed-kubeflow-uats#4.

Example runs

To Reproduce

Run CI for k8s version 1.28

Environment

AKS k8s 1.28
Juju 3.1

for 1.8 juju status

Model     Controller      Cloud/Region    Version  SLA          Timestamp
kubeflow  aks-controller  aks/westeurope  3.1.8    unsupported  09:19:35Z

App                        Version                  Status  Scale  Charm                    Channel          Rev  Address       Exposed  Message
admission-webhook                                   active      1  admission-webhook        1.8/stable       301  10.0.245.250  no       
argo-controller                                     active      1  argo-controller          3.3.10/stable    424  10.0.249.157  no       
dex-auth                                            active      1  dex-auth                 2.36/stable      422  10.0.185.107  no       
envoy                      res:oci-image@cc06b3e    active      1  envoy                    2.0/stable       101  10.0.244.49   no       
istio-ingressgateway                                active      1  istio-gateway            1.17/stable      723  10.0.216.118  no       
istio-pilot                                         active      1  istio-pilot              1.17/stable      827  10.0.173.92   no       
jupyter-controller                                  active      1  jupyter-controller       1.8/stable       849  10.0.75.253   no       
jupyter-ui                                          active      1  jupyter-ui               1.8/stable       858  10.0.184.139  no       
katib-controller           res:oci-image@b6a6100    active      1  katib-controller         0.16/stable      446  10.0.106.5    no       
katib-db                   8.0.35-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       127  10.0.233.45   no       
katib-db-manager                                    active      1  katib-db-manager         0.16/stable      411  10.0.188.36   no       
katib-ui                                            active      1  katib-ui                 0.16/stable      422  10.0.126.70   no       
kfp-api                                             active      1  kfp-api                  2.0/stable      1035  10.0.86.37    no       
kfp-db                     8.0.35-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       127  10.0.57.119   no       
kfp-metadata-writer                                 active      1  kfp-metadata-writer      2.0/stable       118  10.0.61.100   no       
kfp-persistence                                     active      1  kfp-persistence          2.0/stable      1039  10.0.131.226  no       
kfp-profile-controller                              active      1  kfp-profile-controller   2.0/stable       998  10.0.184.246  no       
kfp-schedwf                                         active      1  kfp-schedwf              2.0/stable      1052  10.0.234.76   no       
kfp-ui                                              active      1  kfp-ui                   2.0/stable      1034  10.0.225.138  no       
kfp-viewer                                          active      1  kfp-viewer               2.0/stable      1064  10.0.229.253  no       
kfp-viz                                             active      1  kfp-viz                  2.0/stable       985  10.0.134.29   no       
knative-eventing                                    active      1  knative-eventing         1.10/stable      353  10.0.44.250   no       
knative-operator                                    active      1  knative-operator         1.10/stable      328  10.0.68.158   no       
knative-serving                                     active      1  knative-serving          1.10/stable      354  10.0.61.216   no       
kserve-controller                                   active      1  kserve-controller        0.11/stable      523  10.0.11.66    no       
kubeflow-dashboard                                  active      1  kubeflow-dashboard       1.8/stable       454  10.0.[14](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:15)7.5    no       
kubeflow-profiles                                   active      1  kubeflow-profiles        1.8/stable       355  10.0.68.5     no       
kubeflow-roles                                      active      1  kubeflow-roles           1.8/stable       187  10.0.196.222  no       
kubeflow-volumes           res:oci-image@2261827    active      1  kubeflow-volumes         1.8/stable       260  10.0.29.7     no       
metacontroller-operator                             active      1  metacontroller-operator  3.0/stable       252  10.0.66.178   no       
minio                      res:oci-image@1755999    active      1  minio                    ckf-1.8/stable   278  10.0.247.208  no       
mlmd                       res:oci-image@44abc5d    active      1  mlmd                     1.14/stable      127  10.0.219.231  no       
oidc-gatekeeper                                     active      1  oidc-gatekeeper          ckf-1.8/stable   350  10.0.38.12    no       
pvcviewer-operator                                  active      1  pvcviewer-operator       1.8/stable        30  10.0.238.124  no       
seldon-controller-manager                           active      1  seldon-core              1.17/stable      664  10.0.22.127   no       
tensorboard-controller                              active      1  tensorboard-controller   1.8/stable       257  10.0.44.54    no       
tensorboards-web-app                                active      1  tensorboards-web-app     1.8/stable       245  10.0.204.180  no       
training-operator                                   active      1  training-operator        1.7/stable       347  10.0.91.235   no       

Unit                          Workload  Agent  Address      Ports          Message
admission-webhook/0*          active    idle   10.244.0.10                 
argo-controller/0*            active    idle   10.244.1.6                  
dex-auth/0*                   active    idle   10.244.0.12                 
envoy/0*                      active    idle   10.244.1.34  9090,9901/TCP  
istio-ingressgateway/0*       active    idle   10.244.1.7                  
istio-pilot/0*                active    idle   10.244.0.13                 
jupyter-controller/0*         active    idle   10.244.0.14                 
jupyter-ui/0*                 active    idle   10.244.0.[15](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:16)                 
katib-controller/0*           active    idle   10.244.0.34  443,8080/TCP   
katib-db-manager/0*           active    idle   10.244.1.10                 
katib-db/0*                   active    idle   10.244.0.[16](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:17)                 Primary
katib-ui/0*                   active    idle   10.244.1.11                 
kfp-api/0*                    active    idle   10.244.1.12                 
kfp-db/0*                     active    idle   10.244.1.13                 Primary
kfp-metadata-writer/0*        active    idle   10.244.0.18                 
kfp-persistence/0*            active    idle   10.244.0.20                 
kfp-profile-controller/0*     active    idle   10.244.0.22                 
kfp-schedwf/0*                active    idle   10.244.0.23                 
kfp-ui/0*                     active    idle   10.244.0.24                 
kfp-viewer/0*                 active    idle   10.244.1.15                 
kfp-viz/0*                    active    idle   10.244.0.26                 
knative-eventing/0*           active    idle   10.244.0.[17](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:18)                 
knative-operator/0*           active    idle   10.244.0.28                 
knative-serving/0*            active    idle   10.244.0.21                 
kserve-controller/0*          active    idle   10.244.1.[18](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:19)                 
kubeflow-dashboard/0*         active    idle   10.244.0.27                 
kubeflow-profiles/0*          active    idle   10.244.1.[19](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:20)                 
kubeflow-roles/0*             active    idle   10.244.1.14                 
kubeflow-volumes/0*           active    idle   10.244.1.21  5000/TCP       
metacontroller-operator/0*    active    idle   10.244.0.25                 
minio/0*                      active    idle   10.244.0.35  9000-9001/TCP  
mlmd/0*                       active    idle   10.244.1.35  8080/TCP       
oidc-gatekeeper/0*            active    idle   10.244.1.16                 
pvcviewer-operator/0*         active    idle   10.244.1.[20](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:21)                 
seldon-controller-manager/0*  active    idle   10.244.1.17                 
tensorboard-controller/0*     active    idle   10.244.0.30                 
tensorboards-web-app/0*       active    idle   10.244.0.31                 
training-operator/0*          active    idle   10.[24](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:25)4.0.32

for latest/edge juju status

Model     Controller      Cloud/Region    Version  SLA          Timestamp
kubeflow  aks-controller  aks/westeurope  3.1.8    unsupported  09:26:07Z

App                        Version                  Status  Scale  Charm                    Channel       Rev  Address       Exposed  Message
admission-webhook                                   active      1  admission-webhook        latest/edge   308  10.0.16.94    no       
argo-controller                                     active      1  argo-controller          latest/edge   468  10.0.100.236  no       
dex-auth                                            active      1  dex-auth                 latest/edge   458  10.0.254.87   no       
envoy                                               active      1  envoy                    latest/edge   183  10.0.245.125  no       
istio-ingressgateway                                active      1  istio-gateway            latest/edge   900  10.0.44.117   no       
istio-pilot                                         active      1  istio-pilot              latest/edge   872  10.0.21.240   no       
jupyter-controller                                  active      1  jupyter-controller       latest/edge   936  10.0.131.139  no       
jupyter-ui                                          active      1  jupyter-ui               latest/edge   856  10.0.90.58    no       
katib-controller                                    active      1  katib-controller         latest/edge   526  10.0.152.253  no       
katib-db                   8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/edge      138  10.0.152.2    no       
katib-db-manager                                    active      1  katib-db-manager         latest/edge   490  10.0.236.4    no       
katib-ui                                            active      1  katib-ui                 latest/edge   501  10.0.92.22    no       
kfp-api                                             active      1  kfp-api                  latest/edge  1244  10.0.176.102  no       
kfp-db                     8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/edge      138  10.0.3.211    no       
kfp-metadata-writer                                 active      1  kfp-metadata-writer      latest/edge   298  10.0.201.207  no       
kfp-persistence                                     active      1  kfp-persistence          latest/edge  1251  10.0.6.212    no       
kfp-profile-controller                              active      1  kfp-profile-controller   latest/edge  1209  10.0.253.135  no       
kfp-schedwf                                         active      1  kfp-schedwf              latest/edge  1263  10.0.6.119    no       
kfp-ui                                              active      1  kfp-ui                   latest/edge  1246  10.0.221.196  no       
kfp-viewer                                          active      1  kfp-viewer               latest/edge  1276  10.0.137.58   no       
kfp-viz                                             active      1  kfp-viz                  latest/edge  1197  10.0.127.237  no       
knative-eventing                                    active      1  knative-eventing         latest/edge   393  10.0.110.159  no       
knative-operator                                    active      1  knative-operator         latest/edge   368  10.0.[14](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:15)5.205  no       
knative-serving                                     active      1  knative-serving          latest/edge   394  10.0.147.68   no       
kserve-controller                                   active      1  kserve-controller        latest/edge   538  10.0.[15](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:16)4.87   no       
kubeflow-dashboard                                  active      1  kubeflow-dashboard       latest/edge   517  10.0.52.98    no       
kubeflow-profiles                                   active      1  kubeflow-profiles        latest/edge   379  10.0.[16](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:17)4.223  no       
kubeflow-roles                                      active      1  kubeflow-roles           latest/edge   207  10.0.205.101  no       
kubeflow-volumes                                    active      1  kubeflow-volumes         latest/edge   279  10.0.83.113   no       
metacontroller-operator                             active      1  metacontroller-operator  latest/edge   280  10.0.153.8    no       
minio                      res:oci-image@[17](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:18)55999    active      1  minio                    latest/edge   306  10.0.52.197   no       
mlmd                                                active      1  mlmd                     latest/edge   174  10.0.[18](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:19)8.218  no       
oidc-gatekeeper                                     active      1  oidc-gatekeeper          latest/edge   371  10.0.125.250  no       
pvcviewer-operator                                  active      1  pvcviewer-operator       latest/edge    74  10.0.97.108   no       
seldon-controller-manager                           active      1  seldon-core              latest/edge   691  10.0.87.[19](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:20)5   no       
tensorboard-controller                              active      1  tensorboard-controller   latest/edge   281  10.0.30.[20](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:21)1   no       
tensorboards-web-app                                active      1  tensorboards-web-app     latest/edge   269  10.0.24.183   no       
training-operator                                   active      1  training-operator        latest/edge   378  10.0.16.237   no       

Unit                          Workload  Agent  Address      Ports          Message
admission-webhook/0*          active    idle   10.244.0.7                  
argo-controller/0*            active    idle   10.244.1.9                  
dex-auth/0*                   active    idle   10.244.0.8                  
envoy/0*                      active    idle   10.244.1.11                 
istio-ingressgateway/0*       active    idle   10.244.1.10                 
istio-pilot/0*                active    idle   10.244.0.9                  
jupyter-controller/0*         active    idle   10.244.1.12                 
jupyter-ui/0*                 active    idle   10.244.1.14                 
katib-controller/0*           active    idle   10.244.1.15                 
katib-db-manager/0*           active    idle   10.244.1.16                 
katib-db/0*                   active    idle   10.244.0.12                 Primary
katib-ui/0*                   active    idle   10.244.1.17                 
kfp-api/0*                    active    idle   10.244.1.18                 
kfp-db/0*                     active    idle   10.244.1.19                 Primary
kfp-metadata-writer/0*        active    idle   10.244.0.13                 
kfp-persistence/0*            active    idle   10.244.0.15                 
kfp-profile-controller/0*     active    idle   10.244.0.16                 
kfp-schedwf/0*                active    idle   10.244.0.18                 
kfp-ui/0*                     active    idle   10.244.1.20                 
kfp-viewer/0*                 active    idle   10.244.0.19                 
kfp-viz/0*                    active    idle   10.244.0.20                 
knative-eventing/0*           active    idle   10.244.0.14                 
knative-operator/0*           active    idle   10.244.0.22                 
knative-serving/0*            active    idle   10.244.0.17                 
kserve-controller/0*          active    idle   10.244.1.25                 
kubeflow-dashboard/0*         active    idle   10.244.1.23                 
kubeflow-profiles/0*          active    idle   10.244.0.24                 
kubeflow-roles/0*             active    idle   10.244.1.[21](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:22)                 
kubeflow-volumes/0*           active    idle   10.244.0.21                 
metacontroller-operator/0*    active    idle   10.244.1.[22](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:23)                 
minio/0*                      active    idle   10.244.1.24  9000-9001/TCP  
mlmd/0*                       active    idle   10.244.1.28                 
oidc-gatekeeper/0*            active    idle   10.244.0.[23](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:24)                 
pvcviewer-operator/0*         active    idle   10.244.0.26                 
seldon-controller-manager/0*  active    idle   10.[24](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:25)4.1.26                 
tensorboard-controller/0*     active    idle   10.244.1.27                 
tensorboards-web-app/0*       active    idle   10.244.0.[25](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:26)                 
training-operator/0*          active    idle   10.244.1.29

Relevant Log Output

test_notebooks.py::test_notebook[training-integration] 
-------------------------------- live log call ---------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running training-integration.ipynb...
ERROR    test_notebooks:test_notebooks.py:58 Cell In[4], line 8, in assert_job_succeeded(client, job_name, job_kind)
      1 @retry(
      2     wait=wait_exponential(multiplier=2, min=1, max=30),
      3     stop=stop_after_attempt(50),
      4     reraise=True,
      5 )
      6 def assert_job_succeeded(client, job_name, job_kind):
      7     """Wait for the Job to complete successfully."""
----> 8     assert client.is_job_succeeded(
      9         name=job_name, job_kind=job_kind
     10     ), f"Job ***job_name*** was not successful."
AssertionError: Job pytorch-dist-mnist-gloo was not successful.
FAILED                                                                   [100%]

=================================== FAILURES ===================================
_______________________ test_notebook[katib-integration] _______________________

test_notebook = '/tests/.worktrees/4ca5f8e7474193b125daecbd2dc157f3fe1ab017/tests/notebooks/katib/katib-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))
    
        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)
    
        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"
    
        try:
            log.info(f"Running ***os.path.basename(test_notebook)***...")
            output_notebook, _ = ep.preprocess(notebook, ***"metadata": ***"path": "./"***)
            # persist the notebook output to the original file for debugging purposes
            save_notebook(output_notebook, test_notebook)
        except CellExecutionError as e:
            # handle underlying error
            pytest.fail(f"Notebook execution failed with ***e.ename***: ***e.evalue***")
    
        for cell in output_notebook.cells:
            metadata = cell.get("metadata", dict)
            if "raises-exception" in metadata.get("tags", []):
                for cell_output in cell.outputs:
                    if cell_output.output_type == "error":
                        # extract the error message from the cell output
                        log.error(format_error_message(cell_output.traceback))
>                       pytest.fail(cell_output.traceback[-1])
E                       Failed: AssertionError: Katib Experiment was not successful.

Additional Context

No response

@orfeas-k orfeas-k added the bug Something isn't working label May 9, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5650.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant