-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix install_gpu_driver.sh failures in rocky 2.0 and 2.1 images #1116
base: master
Are you sure you want to change the base?
Conversation
/gcbrun |
3 similar comments
/gcbrun |
/gcbrun |
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels unsafe. I stopped working on this a while ago because it felt flimsy...
gpu/install_gpu_driver.sh
Outdated
# Get latest version available in repos | ||
if [[ "${OS_NAME}" == "debian" ]]; then | ||
apt-get -qq update | ||
TARGET_VERSION=$(apt-cache show --no-all-versions linux-image-amd64 | awk '/^Version/ {print $2}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps allow the user to specify a target version with metadata value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you taken a look at this? It does look like some of your code mirrors the work I did in this incomplete patch...
https://github.com/cjac/initialization-actions/blob/change-kernel-version-202211/kernel/
oh gosh. I forgot that we put that into production. So sketchy... |
Thanks for your comments @cjac! I kind of assumed that the upgrade_kernel() function in https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh#L474 was working as intended. I now see that this reruns all startup scripts and initialization actions and might leave cluster in an unexpected state. Well I tried creating a 2.0-rocky8 cluster today (4th dec) and somehow the running kernel version was already upgraded to |
/gcbrun |
Looks like
|
Hey there Cheena,
I've been taking with Gregory from Rocky. I think I should set up a call
with them, Nvidia, and some representatives from the Dataproc team to
discuss the problem.
I hope this helps me to remember to set it up!
C.J.
…On Mon, Dec 4, 2023, 06:55 Cheena Budhiraja ***@***.***> wrote:
dataproc-initialization-actions-presubmit-pr seems to be failing with an
unrelated error.
Looks like gcloud config get-value project is unable to fetch the
project-id cloud-dataproc-ci? Not sure what changed - I can see one more
PR failing with the same error.
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424567803Z ==================== Test output for //gpu:test_gpu (shard 3 of 15):
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424574406Z Running tests under Python 3.8.10: /usr/bin/python3
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424581292Z [ FAILED ] setUpClass (__main__.NvidiaGpuDriverTestCase)
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424588473Z ======================================================================
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424595301Z ERROR: setUpClass (__main__.NvidiaGpuDriverTestCase)
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424602035Z ----------------------------------------------------------------------
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424608389Z Traceback (most recent call last):
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424615416Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/__main__/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/__main__/integration_tests/dataproc_test_case.py", line 62, in setUpClass
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424622487Z assert cls.PROJECT
Step #7 - "dataproc-2.0-ubuntu18-tests": 2023-12-04T12:13:59.424629333Z AssertionError
—
Reply to this email directly, view it on GitHub
<#1116 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAM6UXME3CM4O7RNGULB2LYHXP5ZAVCNFSM6AAAAABADGWTHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZYHAYTONZZGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
We should exclude systemd from dnf update.
4.18.0-513.9.1.el8_9.x86_64
which does not match the running kernel version4.18.0-477.27.1.el8_8.x86_64
.There should be a condition to check if the kernel needs to be upgraded.
upgrade_kernel
method has been taken from https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh#L474.