Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Status: CUDA driver version is insufficient for CUDA runtime version #1090

Open
dking21st opened this issue Dec 24, 2023 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@dking21st
Copy link

❓ Questions & Help

Using merlin tensorflow container to build a docker image but it shows an error:

2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - Traceback (most recent call last):
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     return _run_code(code, main_globals, None,
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     exec(code, run_globals)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/ads_content/batch/scripts/ads/ads_content/preranking/train_ohouse_ads_content_merlin.py", line 15, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     import merlin.models.tf as mm
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/__init__.py", line 108, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     from merlin.models.tf.models.retrieval import (
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/retrieval.py", line 22, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     from merlin.models.tf.prediction_tasks.retrieval import ItemRetrievalTask
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/prediction_tasks/retrieval.py", line 33, in <module>
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     class ItemRetrievalTask(MultiClassClassificationTask):
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/prediction_tasks/retrieval.py", line 70, in ItemRetrievalTask
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     DEFAULT_METRICS = TopKMetricsAggregator.default_metrics(top_ks=[10])
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [INFO]: sparse_operation_kit is imported
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Import /usr/local/lib/python3.8/dist-packages/merlin_sok-1.2.0-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Import /usr/local/lib/python3.8/dist-packages/merlin_sok-1.2.0-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - [SOK INFO] Initialize finished, communication tool: horovod
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 491, in default_metrics
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     metrics.extend([RecallAt(k), MRRAt(k), NDCGAt(k), AvgPrecisionAt(k), PrecisionAt(k)])
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 362, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     super().__init__(recall_at, k=k, pre_sorted=pre_sorted, name=name, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py", line 234, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     super().__init__(name=name, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/dtensor/utils.py", line 144, in _wrap_function
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     init_method(instance, *args, **kwargs)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 613, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     super().__init__(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 430, in __init__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     self.total = self.add_weight("total", initializer="zeros")
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py", line 366, in add_weight
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     return super().add_weight(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py", line 712, in add_weight
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     variable = self._add_variable_with_custom_getter(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/trackable/base.py", line 489, in _add_variable_with_custom_getter
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     new_variable = getter(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer_utils.py", line 134, in make_variable
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     return tf1.Variable(
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     raise e.with_traceback(filtered_tb) from None
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -   File "/usr/local/lib/python3.8/dist-packages/keras/initializers/initializers.py", line 171, in __call__
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO -     return tf.zeros(shape, dtype)
[2023-12-22, 04:15:17 KST] {pod_manager.py:203} INFO - tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Details

Here is my docker file:

FROM --platform=linux/amd64 nvcr.io/nvidia/merlin/merlin-tensorflow:23.06 as prod

WORKDIR /ads_content

COPY ./data-airflow .
COPY ./ads/images/requirements.txt .

WORKDIR /root

RUN pip install tf2onnx==1.15.1 
RUN pip install -r /ads_content/requirements.txt
RUN pip install requests "urllib3<2"

WORKDIR /ads_content

ENTRYPOINT ["python3"]

I'm trying to deploy merlin TF model training & AWS S3 uploading job using Airflow KubernetePodOperator and Docker Image. As I'm new to docker and airflow, I'm having a good amount of trouble.
I think I kept things pretty simple with my docker file - what am I doing wrong? Should I install cudf again on that base image? or something else?

@dking21st dking21st added the question Further information is requested label Dec 24, 2023
@rnyak
Copy link
Contributor

rnyak commented Dec 28, 2023

@dking21st hello. can you please share the HW specs, CUDA version and driver version on your AWS instance? are you able to see nvidia-smi output on that instance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants