Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryRetry doesn't work in cromwell v85 (f34251c) on GCP #7205

Open
doron-st opened this issue Aug 15, 2023 · 1 comment
Open

MemoryRetry doesn't work in cromwell v85 (f34251c) on GCP #7205

doron-st opened this issue Aug 15, 2023 · 1 comment

Comments

@doron-st
Copy link

Hi!

I have been trying to make memory retry work on our system without sucess.
Read all docs and previous issues I could find, but it still doesn't work for us.

I have written a test wdl with two tasks, both write "Killed" to stderr, and supposed to get retried with more memory.

The first task, TestBadCommandRetry is designed to fail regularly with rc 127, due to a bad command.
The purpose of this task is to prove the memory-retry mechanism is configured correctly in our system.

Result of TestBadCommandRetry:
The memory-error-key is caught and memory is increased as defined in memory-retry-multiplier.
I also see this failure message in metadata.json:
"message": "stderr for job MemoryRetryTest.TestBadCommandRetry:NA:1 contained one of the memory-retry-error-keys: [Killed] specified in the Cromwell config. Job might have run out of memory."

Grepping metadata for memory of this job, I see the expected behaviour:
"memory": "1 GB",
"memory": "2 GB",

The second task, TestOutOfMemoryRetry is designed to fail do to real out of memory error.
The purpose of this task is to shoe that memory-retry mechanism is not working when a task runs out of memory, even if "Killed" is written to stderr.

Result of TestOutOfMemoryRetry:
When this task is run, it fails but the job is retried with the same amount of memory.
This time I see the following failure message:
"message": "Task MemoryRetryTest.TestOutOfMemoryRetry:NA:1 failed. The job was stopped before the command finished. PAPI error code 9. Execution failed: generic::failed_precondition: while running "/cromwell_root/script": unexpected exit status 137 was not ignored\n[UserAction] Unexpected exit status 137 while running "/cromwell_root/script": Killed\n",

Grepping metadata for memory of this job, I see the memory expension is not working:
"memory": "1 GB",
"memory": "1 GB",

I have verified "Killed" is written correctly to stderr :

gsutil cat gs://<out_bucket>/cromwell-execution/MemoryRetryTest/3035199e-bf2b-49a2-be87-483
9e96a08eb/call-TestOutOfMemoryRetry/stderr
Killed    

We have also noticed that in the out of memory case, no retrurnCode is written to the metadata.

Test wdl for reproduction:
`version 1.0

workflow MemoryRetryTest {
input {
String message = "Killed"
}
call TestOutOfMemoryRetry {}
call TestBadCommandRetry {}
}

task TestOutOfMemoryRetry {
command <<<
echo "Killed" >&2
tail /dev/zero
>>>
runtime {
docker: "ubuntu:latest"
cpu: "1"
memory: "1 GB"
disks: "local-disk " + 16 + " HDD"
maxRetries: 1
preemptible: 0
}
}

task TestBadCommandRetry {
command <<<
echo "Killed" >&2
bedtools intersect nothing with nothing
>>>
runtime {
docker: "ubuntu:latest"
cpu: "1"
memory: "1 GB"
disks: "local-disk " + 16 + " HDD"
maxRetries: 1
preemptible: 0
}
}`

input_json:
{ "MemoryRetryTest.message": "Killed" }

Would appreciate your kind assistence!
Doron Shem-Tov

@kshakir
Copy link
Contributor

kshakir commented Sep 8, 2023

@doron-st TL;DR: Can you try again?


While debugging this issue it just suddenly started working again... 🤷

Using old runs, it seems to be that for a few days this was appearing in the cromwell logs when a job ran out of memory:

The job was stopped before the command finished. PAPI error code 9. Execution failed: generic::failed_precondition: while running "/cromwell_root/script": unexpected exit status 137 was not ignored

But PAPI (Google's LifeSciences API) should ignore container errors. I have no clue who reported and fixed the issue, but thanks all from afar.

The Failed lifesciences jobs triggered a very different code path in Cromwell. The memory retry logic here runs only when PAPI returns Success when no error is reported by the lifesciences API.

Anyway, I'm just glad the Google LifeSciences API isn't returning this error anymore, and I hope it stays that way until I can switch our lab's cromwell over to the Google Batch API 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants