Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polyaxon can't get plxlogs for pytorchjob in dashboard #1505

Open
hongqing1986 opened this issue Jun 6, 2022 · 15 comments
Open

Polyaxon can't get plxlogs for pytorchjob in dashboard #1505

hongqing1986 opened this issue Jun 6, 2022 · 15 comments
Labels

Comments

@hongqing1986
Copy link

When i run the pytorchjob, i can't get the plxlogs in dashboard after the job finished,

But if i clicked the logs button of this job in dashboard before the job finished, I can collect the plxlogs.

If i run the common job, the plxlogs is normal.

yaml config

version: 1
kind: component
tags: [examples, pytorch, kubeflow]
run:
kind: pytorchjob
master:
replicas: 1
init:
- git: {"url": "https://github.com/polyaxon/polyaxon-examples"}
container:
image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
command: ["sh", "-c", "python -u {{ globals.artifacts_path }}/polyaxon-examples/in_cluster/kubeflow/pytorchjob/mnist.py"]
resources:
requests:
nvidia.com/gpu: 1
worker:
replicas: 1
init:
- git: {"url": "https://github.com/polyaxon/polyaxon-examples"}
container:
image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
command: ["sh", "-c", "python -u {{ globals.artifacts_path }}/polyaxon-examples/in_cluster/kubeflow/pytorchjob/mnist.py"]
resources:
requests:
nvidia.com/gpu: 1

@polyaxon-team
Copy link
Contributor

what version are you using?

@hongqing1986
Copy link
Author

1.18.0

@polyaxon-team
Copy link
Contributor

Please run:

polyaxon version --check

Also please report what verion of the training-operator you are using.

@hongqing1986
Copy link
Author

Current cli version: 1.18.0

Platform version:

KEY 182a99f9ce4853e38cc57644500fd73a
VERSION 1.18.1
DIST ce

Compatibility versions:

cli {"min": "1.10.0", "latest": "1.18.2"}
platform {"min": "1.10.0", "latest": "1.18.2"}
agent {"min": "1.10.0", "latest": "1.18.2"}
ui {"min": "1.10.0", "latest": "1.18.2"}

New version of Polyaxon CLI (1.18.2) is now available. To upgrade run:
pip install -U polyaxon

@polyaxon-team
Copy link
Contributor

We will try to reproduce the problem and report back.

@hongqing1986
Copy link
Author

-- Also please report what verion of the training-operator you are using.
How can i get the training-operator?
I just use helm to install the trainingjobs.

@polyaxon-team
Copy link
Contributor

Yes, I assumed you are using the latest version of the polyaxon/trainingjobs 1.15.1, you can check that using:

helm search repo polyaxon

@hongqing1986
Copy link
Author

helm search repo polyaxon
NAME CHART VERSION APP VERSION DESCRIPTION
polyaxon/polyaxon 1.18.1 1.18.1 An enterprise-grade open-source platform for bu...
polyaxon/agent 1.18.1 1.18.1 An enterprise-grade open-source platform for bu...
polyaxon/mpijob 1.4.0 1.4.0 Kubeflow MPIJob integration with Polyaxon
polyaxon/nfs-provisioner 0.4.1 Polyaxon in-cluster NFS provisioner to simplify...
polyaxon/pytorchjob 1.4.0 1.4.0 Kubeflow PytorchJob integration with Polyaxon
polyaxon/tfjob 1.4.0 1.4.0 Kubeflow TFJob integration with Polyaxon
polyaxon/trainingjobs 1.15.1 1.15.1 Kubeflow training operators integration with Po...
polyaxon/trainingoperators 1.14.0 1.14.0 Kubeflow training operators integration with Po...

@hongqing1986
Copy link
Author

Where is the polyaxon-operator's source code address?
Is it also in https://github.com/polyaxon/polyaxon?

@hongqing1986
Copy link
Author

Is there any feedback?

@hongqing1986
Copy link
Author

@polyaxon-team Have you reproduce the problem?

@hongqing1986
Copy link
Author

Is there any feedback?

@polyaxon-team
Copy link
Contributor

polyaxon-team commented Jun 13, 2022

I think you are using the default artifacts store (https://polyaxon.com/docs/setup/connections/artifacts/#default-behavior) which is a temp host path, this will not work on a multi-node cluster.

We could not reproduce the issue.

@hongqing1986
Copy link
Author

hongqing1986 commented Jun 13, 2022

I have configured the artifactsStore, which is a nfs directory.

namespace: polyaxon
rbac:
enabled: true
postgresql:
persistence:
enabled: true
existingClaim: polyaxon-pg-pvc
artifactsStore:
name: artifacts-store
kind: volume_claim
schema:
mountPath: "/artifacts-store"
volumeClaim: "polyaxon-artifacts-pvc"
intervals:
compatibilityCheck: -1
operators:
tfjob: true
pytorchjob: true
mpijob: true

@hongqing1986
Copy link
Author

Following is the polyaxon's pods, i don't know if I missed some pods.
NAME READY STATUS
polyaxon-polyaxon-api-7476497d9f-zr6xl 1/1 Running
polyaxon-polyaxon-gateway-7fb978d658-rss9m 1/1 Running
polyaxon-polyaxon-operator-8987bb494-hzxsh 1/1 Running
polyaxon-polyaxon-streams-5bff76b98d-hlqww 1/1 Running
polyaxon-postgresql-0 1/1 Running
trainingjobs-trainingjobs-7989c46878-s46wd 1/1 Running

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants