Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add debugging output to spack and ramble installations #2568

Merged
merged 2 commits into from
May 15, 2024

Conversation

cdunbar13
Copy link
Contributor

@cdunbar13 cdunbar13 commented May 10, 2024

When running the spack and ramble installations if something fails it's difficult to know which node to check for failure because it depends on which got the lock first. This is the first step to make debugging easier.

This update prints the hostname of the node that has the lock to the lock directory and if it fails it will print out the contents of the lock directory, which will then have the hostname of the node that failed.

After this PR is approved, the next step should be to get the stderr of any command that fails, write it to a file in the lock directory, then print the contents in the rescue block of the ansible playbook.

This was tested by deploying a blueprint that uses spack and ramble, waiting until a node had gotten the lock, then suspending that node, forcing the other nodes to timeout and print the new debug messages.

Failed output looks like:

May 10 13:57:17 ramble-test-0 google_metadata_script_runner: startup-script: TASK [Wait for lock] ***********************************************************
May 10 14:01:01 ramble-test-0 systemd: Started Session 4 of user root.
May 10 14:02:43 ramble-test-0 systemd: Starting GCE Workload Certificate refresh...
May 10 14:02:43 ramble-test-0 gce_workload_cert_refresh: 2024/05/10 14:02:43: Done
May 10 14:02:43 ramble-test-0 systemd: Started GCE Workload Certificate refresh.
May 10 14:07:17 ramble-test-0 google_metadata_script_runner: startup-script: fatal: [localhost]: FAILED! => {
May 10 14:07:17 ramble-test-0 google_metadata_script_runner: startup-script:     "changed": false,
May 10 14:07:17 ramble-test-0 google_metadata_script_runner: startup-script:     "elapsed": 600
May 10 14:07:17 ramble-test-0 google_metadata_script_runner: startup-script: }
May 10 14:07:17 ramble-test-0 google_metadata_script_runner: startup-script:
May 10 14:07:17 ramble-test-0 google_metadata_script_runner: startup-script: MSG:
May 10 14:07:17 ramble-test-0 google_metadata_script_runner: startup-script:
May 10 14:07:17 ramble-test-0 google_metadata_script_runner: startup-script: Timeout when waiting for file /shared/.install_spack_lock/done
May 10 14:07:17 ramble-test-0 google_metadata_script_runner: startup-script:
May 10 14:07:17 ramble-test-0 google_metadata_script_runner: startup-script: TASK [Timed out on lock, get install directory contents] ***********************
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script: changed: [localhost]
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script:
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script: TASK [Print install directory contents with host that failed to install spack] ***
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script: ok: [localhost] => {}
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script:
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script: MSG:
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script:
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script: total 8
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script: -rw-r--r-- 1 root root    0 May 10 13:57 ramble-test-4
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script: drwxr-xr-x 2 root root 4096 May 10 13:57 .
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script: drwxr-xr-x 4 root root 4096 May 10 13:57 ..
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script:
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script: TASK [Failed to get lock] ******************************************************
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script: fatal: [localhost]: FAILED! => {
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script:     "changed": false
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script: }
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script:
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script: MSG:
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script:
May 10 14:07:18 ramble-test-0 google_metadata_script_runner: startup-script: Failed to get lock, exiting

@cdunbar13 cdunbar13 added the release-module-improvements Added to release notes under the "Module Improvements" heading. label May 10, 2024
@nick-stroud nick-stroud assigned cdunbar13 and unassigned nick-stroud May 10, 2024
@cdunbar13 cdunbar13 removed their assignment May 13, 2024
@nick-stroud nick-stroud assigned cdunbar13 and unassigned nick-stroud May 14, 2024
@cdunbar13 cdunbar13 force-pushed the ramble_spack_update branch 2 times, most recently from 76f1bb7 to 8f5e9c8 Compare May 15, 2024 15:34
@cdunbar13 cdunbar13 merged commit efee798 into GoogleCloudPlatform:develop May 15, 2024
8 of 48 checks passed
@cdunbar13 cdunbar13 deleted the ramble_spack_update branch May 20, 2024 14:32
@harshthakkar01 harshthakkar01 mentioned this pull request May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-module-improvements Added to release notes under the "Module Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants