MPI-Operator run example failed #598

q443048756 · 2023-10-25T10:09:57Z

I setup the mpi-operator v0.4.0

and try to deploy the example:
mpi-operator-0.4.0/examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml

My k8s has three node, each node has a 3060 graphics card

but it seem can not run it correctly:
1、Using the default configuration, I don't see any pods starting, it should be a failure

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: tensorflow-benchmarks
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
command:
- mpirun
- --allow-run-as-root
- -np
- "2"
- -bind-to
- none
- -map-by
- slot
- -x
- NCCL_DEBUG=INFO
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
- --model=resnet101
- --batch_size=64
- --variable_update=horovod
Worker:
replicas: 2
template:
spec:
containers:
- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
resources:
limits:
nvidia.com/gpu: 1
2、When replicas: 1, the pod starts normally，I suspect that the task cannot call the GPU across nodes.

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: tensorflow-benchmarks
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
command:
- mpirun
- --allow-run-as-root
- -np
- "1"
- -bind-to
- none
- -map-by
- slot
- -x
- NCCL_DEBUG=INFO
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
- --model=resnet101
- --batch_size=64
- --variable_update=horovod
Worker:
replicas: 1
template:
spec:
containers:
- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
resources:
limits:
nvidia.com/gpu: 1

3、After the pod is started, the launcher reports an error
2023-10-25 09:53:08.464568: E tensorflow/c/c_api.cc:2184] Internal: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid
Traceback (most recent call last):
File "scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 73, in
app.run(main) # Raises error on invalid flags, unlike tf.app.run()
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 61, in main
params = benchmark_cnn.setup(params)
File "/tensorflow/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 3538, in setup
with tf.Session(config=create_config_proto(params)) as sess:
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1586, in init
super(Session, self).init(target, graph, config=config)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 701, in init
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[12892,1],0]
Exit code: 1

tenzen-y · 2023-10-25T13:42:06Z

Does the example without GPUs work fine?

https://github.com/kubeflow/mpi-operator/blob/master/examples/v2beta1/pi/pi.yaml

alculquicondor · 2023-10-25T14:19:37Z

Alternatively, did you install the nvidia drivers on the nodes?

q443048756 · 2023-10-26T00:40:38Z

没有 GPU 的示例可以正常工作吗？

https://github.com/kubeflow/mpi-operator/blob/master/examples/v2beta1/pi/pi.yaml
At the beginning, I have used other methods to test the GPU availability. The first point I mentioned is that the task cannot use multi-node GPU to work. When replicas: 2, the pod cannot be started. When replicas: 1, the pod can work normally. start up

https://github.com/kubeflow/mpi-operator/blob/master/examples/v2beta1/pi/pi.yaml:This example is successful

q443048756 · 2023-10-26T00:41:55Z

或者，您是否在节点上安装了 nvidia 驱动程序？

After installing it and I have tested it, the GPU is working normally.he first point I mentioned is that the task cannot use multi-node GPU to work. When replicas: 2, the pod cannot be started. When replicas: 1, the pod can work normally. start up

alculquicondor · 2023-10-26T14:43:50Z

Uhm... interesting. Although it sounds like some networking problems that you need to work out with your provider. I don't think it's related to mpi-operator.

kuizhiqing · 2023-10-30T15:34:18Z

@q443048756 It would be the cuda thing in the default image mpioperator/tensorflow-benchmarks:latest DO NOT compatible with your local environment, I suggest you to find the right base image from here https://hub.docker.com/r/nvidia/cuda and then build your own one.

tenzen-y · 2023-10-30T17:34:10Z

@q443048756 It would be the cuda thing in the default image mpioperator/tensorflow-benchmarks:latest DO NOT compatible with your local environment, I suggest you to find the right base image from here https://hub.docker.com/r/nvidia/cuda and then build your own one.

It sounds reasonable. Thank you for the helping.

wang-mask · 2024-01-09T02:22:41Z

It is the same failure, maybe it is time to update the example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI-Operator run example failed #598

MPI-Operator run example failed #598

q443048756 commented Oct 25, 2023 •

edited

tenzen-y commented Oct 25, 2023

alculquicondor commented Oct 25, 2023

q443048756 commented Oct 26, 2023

q443048756 commented Oct 26, 2023

alculquicondor commented Oct 26, 2023

kuizhiqing commented Oct 30, 2023

tenzen-y commented Oct 30, 2023

wang-mask commented Jan 9, 2024

MPI-Operator run example failed #598

MPI-Operator run example failed #598

Comments

q443048756 commented Oct 25, 2023 • edited

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

tenzen-y commented Oct 25, 2023

alculquicondor commented Oct 25, 2023

q443048756 commented Oct 26, 2023

q443048756 commented Oct 26, 2023

alculquicondor commented Oct 26, 2023

kuizhiqing commented Oct 30, 2023

tenzen-y commented Oct 30, 2023

wang-mask commented Jan 9, 2024

q443048756 commented Oct 25, 2023 •

edited

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.