Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix install_gpu_driver.sh failures in rocky 2.0 and 2.1 images #1116

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bcheena
Copy link

@bcheena bcheena commented Dec 1, 2023

  1. install_gpu_driver.sh init script fails with the following error in rocky 2.0 and 2.1 images:
++ dnf -y -q update
Error: 
 Problem: The operation would result in removing the following protected packages: systemd

We should exclude systemd from dnf update.

  1. In 2.0 rocky, the available version of kernel-devel is 4.18.0-513.9.1.el8_9.x86_64 which does not match the running kernel version 4.18.0-477.27.1.el8_8.x86_64.
++ dnf -y -q install kernel-devel-4.18.0-477.27.1.el8_8.x86_64
Error: Unable to find a match: kernel-devel-4.18.0-477.27.1.el8_8.x86_64

There should be a condition to check if the kernel needs to be upgraded.

upgrade_kernel method has been taken from https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh#L474.

@bcheena
Copy link
Author

bcheena commented Dec 1, 2023

/gcbrun

3 similar comments
@bcheena
Copy link
Author

bcheena commented Dec 1, 2023

/gcbrun

@bcheena
Copy link
Author

bcheena commented Dec 1, 2023

/gcbrun

@cjac
Copy link
Contributor

cjac commented Dec 1, 2023

/gcbrun

Copy link
Contributor

@cjac cjac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels unsafe. I stopped working on this a while ago because it felt flimsy...

# Get latest version available in repos
if [[ "${OS_NAME}" == "debian" ]]; then
apt-get -qq update
TARGET_VERSION=$(apt-cache show --no-all-versions linux-image-amd64 | awk '/^Version/ {print $2}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps allow the user to specify a target version with metadata value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you taken a look at this? It does look like some of your code mirrors the work I did in this incomplete patch...

https://github.com/cjac/initialization-actions/blob/change-kernel-version-202211/kernel/

@cjac
Copy link
Contributor

cjac commented Dec 1, 2023

upgrade_kernel method has been taken from https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh#L474.

oh gosh. I forgot that we put that into production. So sketchy...

@bcheena
Copy link
Author

bcheena commented Dec 4, 2023

Thanks for your comments @cjac! I kind of assumed that the upgrade_kernel() function in https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh#L474 was working as intended. I now see that this reruns all startup scripts and initialization actions and might leave cluster in an unexpected state.

Well I tried creating a 2.0-rocky8 cluster today (4th dec) and somehow the running kernel version was already upgraded to 4.18.0-513.9.1.el8_9.x86_64. The current workaround can be to skip this upgrade_kernel method altogether for now, but we should definitely revisit this later in a proper way - by adding checks in the agent to skip if startup script has already run once.

@bcheena
Copy link
Author

bcheena commented Dec 4, 2023

/gcbrun

@bcheena
Copy link
Author

bcheena commented Dec 4, 2023

dataproc-initialization-actions-presubmit-pr seems to be failing with an unrelated error.

Looks like gcloud config get-value project is unable to fetch the project-id cloud-dataproc-ci? Not sure what changed - I can see one more PR failing with the same error.

Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424567803Z ==================== Test output for //gpu:test_gpu (shard 3 of 15):
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424574406Z Running tests under Python 3.8.10: /usr/bin/python3
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424581292Z [  FAILED  ] setUpClass (__main__.NvidiaGpuDriverTestCase)
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424588473Z ======================================================================
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424595301Z ERROR: setUpClass (__main__.NvidiaGpuDriverTestCase)
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424602035Z ----------------------------------------------------------------------
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424608389Z Traceback (most recent call last):
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424615416Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/__main__/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/__main__/integration_tests/dataproc_test_case.py", line 62, in setUpClass
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424622487Z     assert cls.PROJECT
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424629333Z AssertionError

@cjac
Copy link
Contributor

cjac commented Dec 13, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants