Polyaxon can't get plxlogs for pytorchjob in dashboard #1505

hongqing1986 · 2022-06-06T13:56:50Z

When i run the pytorchjob, i can't get the plxlogs in dashboard after the job finished,

But if i clicked the logs button of this job in dashboard before the job finished, I can collect the plxlogs.

If i run the common job, the plxlogs is normal.

yaml config

version: 1
kind: component
tags: [examples, pytorch, kubeflow]
run:
kind: pytorchjob
master:
replicas: 1
init:
- git: {"url": "https://github.com/polyaxon/polyaxon-examples"}
container:
image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
command: ["sh", "-c", "python -u {{ globals.artifacts_path }}/polyaxon-examples/in_cluster/kubeflow/pytorchjob/mnist.py"]
resources:
requests:
nvidia.com/gpu: 1
worker:
replicas: 1
init:
- git: {"url": "https://github.com/polyaxon/polyaxon-examples"}
container:
image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
command: ["sh", "-c", "python -u {{ globals.artifacts_path }}/polyaxon-examples/in_cluster/kubeflow/pytorchjob/mnist.py"]
resources:
requests:
nvidia.com/gpu: 1

polyaxon-team · 2022-06-06T13:59:34Z

what version are you using?

hongqing1986 · 2022-06-06T14:01:25Z

1.18.0

polyaxon-team · 2022-06-06T14:03:15Z

Please run:

polyaxon version --check

Also please report what verion of the training-operator you are using.

hongqing1986 · 2022-06-06T14:05:02Z

Current cli version: 1.18.0

Platform version:

KEY 182a99f9ce4853e38cc57644500fd73a
VERSION 1.18.1
DIST ce

Compatibility versions:

cli {"min": "1.10.0", "latest": "1.18.2"}
platform {"min": "1.10.0", "latest": "1.18.2"}
agent {"min": "1.10.0", "latest": "1.18.2"}
ui {"min": "1.10.0", "latest": "1.18.2"}

New version of Polyaxon CLI (1.18.2) is now available. To upgrade run:
pip install -U polyaxon

polyaxon-team · 2022-06-06T14:08:35Z

We will try to reproduce the problem and report back.

hongqing1986 · 2022-06-06T14:08:59Z

-- Also please report what verion of the training-operator you are using.
How can i get the training-operator?
I just use helm to install the trainingjobs.

polyaxon-team · 2022-06-06T14:12:12Z

Yes, I assumed you are using the latest version of the polyaxon/trainingjobs 1.15.1, you can check that using:

helm search repo polyaxon

hongqing1986 · 2022-06-06T14:30:17Z

helm search repo polyaxon
NAME CHART VERSION APP VERSION DESCRIPTION
polyaxon/polyaxon 1.18.1 1.18.1 An enterprise-grade open-source platform for bu...
polyaxon/agent 1.18.1 1.18.1 An enterprise-grade open-source platform for bu...
polyaxon/mpijob 1.4.0 1.4.0 Kubeflow MPIJob integration with Polyaxon
polyaxon/nfs-provisioner 0.4.1 Polyaxon in-cluster NFS provisioner to simplify...
polyaxon/pytorchjob 1.4.0 1.4.0 Kubeflow PytorchJob integration with Polyaxon
polyaxon/tfjob 1.4.0 1.4.0 Kubeflow TFJob integration with Polyaxon
polyaxon/trainingjobs 1.15.1 1.15.1 Kubeflow training operators integration with Po...
polyaxon/trainingoperators 1.14.0 1.14.0 Kubeflow training operators integration with Po...

hongqing1986 · 2022-06-06T14:37:20Z

Where is the polyaxon-operator's source code address?
Is it also in https://github.com/polyaxon/polyaxon?

hongqing1986 · 2022-06-07T07:14:38Z

Is there any feedback?

hongqing1986 · 2022-06-08T03:54:37Z

@polyaxon-team Have you reproduce the problem?

hongqing1986 · 2022-06-13T03:19:50Z

Is there any feedback?

polyaxon-team · 2022-06-13T10:24:13Z

I think you are using the default artifacts store (https://polyaxon.com/docs/setup/connections/artifacts/#default-behavior) which is a temp host path, this will not work on a multi-node cluster.

We could not reproduce the issue.

hongqing1986 · 2022-06-13T12:56:43Z

I have configured the artifactsStore, which is a nfs directory.

namespace: polyaxon
rbac:
enabled: true
postgresql:
persistence:
enabled: true
existingClaim: polyaxon-pg-pvc
artifactsStore:
name: artifacts-store
kind: volume_claim
schema:
mountPath: "/artifacts-store"
volumeClaim: "polyaxon-artifacts-pvc"
intervals:
compatibilityCheck: -1
operators:
tfjob: true
pytorchjob: true
mpijob: true

hongqing1986 · 2022-06-13T13:06:35Z

Following is the polyaxon's pods, i don't know if I missed some pods.
NAME READY STATUS
polyaxon-polyaxon-api-7476497d9f-zr6xl 1/1 Running
polyaxon-polyaxon-gateway-7fb978d658-rss9m 1/1 Running
polyaxon-polyaxon-operator-8987bb494-hzxsh 1/1 Running
polyaxon-polyaxon-streams-5bff76b98d-hlqww 1/1 Running
polyaxon-postgresql-0 1/1 Running
trainingjobs-trainingjobs-7989c46878-s46wd 1/1 Running

polyaxon-team added the question label Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polyaxon can't get plxlogs for pytorchjob in dashboard #1505

Polyaxon can't get plxlogs for pytorchjob in dashboard #1505

hongqing1986 commented Jun 6, 2022

polyaxon-team commented Jun 6, 2022

hongqing1986 commented Jun 6, 2022

polyaxon-team commented Jun 6, 2022

hongqing1986 commented Jun 6, 2022

polyaxon-team commented Jun 6, 2022

hongqing1986 commented Jun 6, 2022

polyaxon-team commented Jun 6, 2022

hongqing1986 commented Jun 6, 2022

hongqing1986 commented Jun 6, 2022

hongqing1986 commented Jun 7, 2022

hongqing1986 commented Jun 8, 2022

hongqing1986 commented Jun 13, 2022

polyaxon-team commented Jun 13, 2022 •

edited

hongqing1986 commented Jun 13, 2022 •

edited

hongqing1986 commented Jun 13, 2022

Polyaxon can't get plxlogs for pytorchjob in dashboard #1505

Polyaxon can't get plxlogs for pytorchjob in dashboard #1505

Comments

hongqing1986 commented Jun 6, 2022

yaml config

polyaxon-team commented Jun 6, 2022

hongqing1986 commented Jun 6, 2022

polyaxon-team commented Jun 6, 2022

hongqing1986 commented Jun 6, 2022

polyaxon-team commented Jun 6, 2022

hongqing1986 commented Jun 6, 2022

polyaxon-team commented Jun 6, 2022

hongqing1986 commented Jun 6, 2022

hongqing1986 commented Jun 6, 2022

hongqing1986 commented Jun 7, 2022

hongqing1986 commented Jun 8, 2022

hongqing1986 commented Jun 13, 2022

polyaxon-team commented Jun 13, 2022 • edited

hongqing1986 commented Jun 13, 2022 • edited

hongqing1986 commented Jun 13, 2022

polyaxon-team commented Jun 13, 2022 •

edited

hongqing1986 commented Jun 13, 2022 •

edited