New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polyaxon can't get plxlogs for pytorchjob in dashboard #1505
Comments
what version are you using? |
1.18.0 |
Please run:
Also please report what verion of the training-operator you are using. |
Current cli version: 1.18.0 Platform version: KEY 182a99f9ce4853e38cc57644500fd73a Compatibility versions: cli {"min": "1.10.0", "latest": "1.18.2"} New version of Polyaxon CLI (1.18.2) is now available. To upgrade run: |
We will try to reproduce the problem and report back. |
-- Also please report what verion of the training-operator you are using. |
Yes, I assumed you are using the latest version of the polyaxon/trainingjobs helm search repo polyaxon |
helm search repo polyaxon |
Where is the polyaxon-operator's source code address? |
Is there any feedback? |
@polyaxon-team Have you reproduce the problem? |
Is there any feedback? |
I think you are using the default artifacts store (https://polyaxon.com/docs/setup/connections/artifacts/#default-behavior) which is a temp host path, this will not work on a multi-node cluster. We could not reproduce the issue. |
I have configured the artifactsStore, which is a nfs directory. namespace: polyaxon |
Following is the polyaxon's pods, i don't know if I missed some pods. |
When i run the pytorchjob, i can't get the plxlogs in dashboard after the job finished,
But if i clicked the logs button of this job in dashboard before the job finished, I can collect the plxlogs.
If i run the common job, the plxlogs is normal.
yaml config
version: 1
kind: component
tags: [examples, pytorch, kubeflow]
run:
kind: pytorchjob
master:
replicas: 1
init:
- git: {"url": "https://github.com/polyaxon/polyaxon-examples"}
container:
image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
command: ["sh", "-c", "python -u {{ globals.artifacts_path }}/polyaxon-examples/in_cluster/kubeflow/pytorchjob/mnist.py"]
resources:
requests:
nvidia.com/gpu: 1
worker:
replicas: 1
init:
- git: {"url": "https://github.com/polyaxon/polyaxon-examples"}
container:
image: pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime
command: ["sh", "-c", "python -u {{ globals.artifacts_path }}/polyaxon-examples/in_cluster/kubeflow/pytorchjob/mnist.py"]
resources:
requests:
nvidia.com/gpu: 1
The text was updated successfully, but these errors were encountered: