You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've been looking into migrating from PAPIv2 backend to GCPBATCH backend. Callcaching fails on GCPBATCH but not on PAPIv2 when using a private docker image in gcr.io.
Is this a missing feature or a bug? The documentation on the subject could go either way, depending on whether GCPBATCH is part of the other backends or a subset of the pipelines backend (https://cromwell.readthedocs.io/en/latest/cromwell_features/CallCaching/).
I do not think this is a configuration error, since the same config works with PAPIv2 backend, but if it is, what configuration options would be necessary for configuring gcr.io authentication when using GCPBATCH?
Errors from cromwell logs when task is being callcached:
cromwell_1 | 2024-01-11 11:09:38 pool-9-thread-9 INFO - Manifest request failed for docker manifest V2, falling back to OCI manifest. Image: DockerImageIdentifierWithoutHash(Some(eu.gcr.io),Some(project),image_name,tag)
cromwell_1 | cromwell.docker.registryv2.DockerRegistryV2Abstract$Unauthorized: 401 Unauthorized {"errors":[{"code":"UNAUTHORIZED","message":"You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication"}]}
cromwell_1 | at cromwell.docker.registryv2.DockerRegistryV2Abstract.$anonfun$getDigestFromResponse$1(DockerRegistryV2Abstract.scala:321)
cromwell_1 | at map @ fs2.internal.CompileScope.$anonfun$close$9(CompileScope.scala:246)
cromwell_1 | at flatMap @ fs2.internal.CompileScope.$anonfun$close$6(CompileScope.scala:245)
cromwell_1 | at map @ fs2.internal.CompileScope.fs2$internal$CompileScope$$traverseError(CompileScope.scala:222)
cromwell_1 | at flatMap @ fs2.internal.CompileScope.$anonfun$close$4(CompileScope.scala:244)
cromwell_1 | at map @ fs2.internal.CompileScope.fs2$internal$CompileScope$$traverseError(CompileScope.scala:222)
cromwell_1 | at flatMap @ fs2.internal.CompileScope.$anonfun$close$2(CompileScope.scala:242)
cromwell_1 | at flatMap @ fs2.internal.CompileScope.close(CompileScope.scala:241)
cromwell_1 | at unsafeRunAsyncAndForget @ cromwell.docker.DockerInfoActor.$anonfun$startAndRegisterStream$2(DockerInfoActor.scala:163)
cromwell_1 | at flatMap @ fs2.internal.CompileScope.$anonfun$openAncestor$2(CompileScope.scala:261)
cromwell_1 | at flatMap @ fs2.internal.FreeC$.$anonfun$compile$17(Algebra.scala:545)
cromwell_1 | at map @ fs2.internal.CompileScope.$anonfun$close$9(CompileScope.scala:246)
cromwell_1 | at flatMap @ fs2.internal.CompileScope.$anonfun$close$6(CompileScope.scala:245)
cromwell_1 | at map @ fs2.internal.CompileScope.fs2$internal$CompileScope$$traverseError(CompileScope.scala:222)
cromwell_1 | at flatMap @ fs2.internal.CompileScope.$anonfun$close$4(CompileScope.scala:244)
cromwell_1 | at map @ fs2.internal.CompileScope.fs2$internal$CompileScope$$traverseError(CompileScope.scala:222)
cromwell_1 | 2024-01-11 11:09:38 cromwell-system-akka.dispatchers.engine-dispatcher-33 WARN - BackendPreparationActor_for_0845428a:myworkflow.mytask:-1:1 [UUID(0845428a)]: Docker lookup failed
cromwell_1 | java.lang.Exception: Unauthorized to get docker hash eu.gcr.io/project/image_name:tag
cromwell_1 | at cromwell.engine.workflow.WorkflowDockerLookupActor.cromwell$engine$workflow$WorkflowDockerLookupActor$$handleLookupFailure(WorkflowDockerLookupActor.scala:279)
cromwell_1 | at cromwell.engine.workflow.WorkflowDockerLookupActor$$anonfun$3.applyOrElse(WorkflowDockerLookupActor.scala:93)
cromwell_1 | at cromwell.engine.workflow.WorkflowDockerLookupActor$$anonfun$3.applyOrElse(WorkflowDockerLookupActor.scala:78)
cromwell_1 | at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
cromwell_1 | at akka.actor.FSM.processEvent(FSM.scala:707)
cromwell_1 | at akka.actor.FSM.processEvent$(FSM.scala:704)
cromwell_1 | at cromwell.engine.workflow.WorkflowDockerLookupActor.akka$actor$LoggingFSM$$super$processEvent(WorkflowDockerLookupActor.scala:45)
cromwell_1 | at akka.actor.LoggingFSM.processEvent(FSM.scala:847)
cromwell_1 | at akka.actor.LoggingFSM.processEvent$(FSM.scala:829)
cromwell_1 | at cromwell.engine.workflow.WorkflowDockerLookupActor.processEvent(WorkflowDockerLookupActor.scala:45)
cromwell_1 | at akka.actor.FSM.akka$actor$FSM$$processMsg(FSM.scala:701)
cromwell_1 | at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:695)
cromwell_1 | at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
cromwell_1 | at cromwell.docker.DockerClientHelper$$anonfun$dockerResponseReceive$1.applyOrElse(DockerClientHelper.scala:16)
cromwell_1 | at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:269)
cromwell_1 | at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:270)
cromwell_1 | at akka.actor.Actor.aroundReceive(Actor.scala:539)
cromwell_1 | at akka.actor.Actor.aroundReceive$(Actor.scala:537)
cromwell_1 | at cromwell.engine.workflow.WorkflowDockerLookupActor.aroundReceive(WorkflowDockerLookupActor.scala:45)
cromwell_1 | at akka.actor.ActorCell.receiveMessage(ActorCell.scala:614)
cromwell_1 | at akka.actor.ActorCell.invoke(ActorCell.scala:583)
cromwell_1 | at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
cromwell_1 | at akka.dispatch.Mailbox.run(Mailbox.scala:229)
cromwell_1 | at akka.dispatch.Mailbox.exec(Mailbox.scala:241)
cromwell_1 | at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
cromwell_1 | at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
cromwell_1 | at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
cromwell_1 | at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
cromwell_1 |
cromwell_1 | 2024-01-11 11:09:38 cromwell-system-akka.dispatchers.engine-dispatcher-38 INFO - BT-322 0845428a:myworkflow.mytask:-1:1 is not eligible for call caching
Used backend:
GCPBATCH. Callcaching works with PAPIv2, not on GCPBATCH.
We are using cromwell through broadinstitute/cromwell:87-ecd44b6 image.
cromwell configuration:
include required(classpath("application"))
system.new-workflow-poll-rate=1
// increase timeout for http requests..... getting meta-data can timeout for large workflows.
akka.http.server.request-timeout=600s
# Maximum number of input file bytes allowed in order to read each type.
# If exceeded a FileSizeTooBig exception will be thrown.
system {
job-rate-control {
jobs = 100
per = 1 second
}
input-read-limits {
lines = 128000000
bool = 7
int = 19
float = 50
string = 1280000
json = 12800000
tsv = 1280000000
map = 128000000
object = 128000000
}
# If 'true', a SIGTERM or SIGINT will trigger Cromwell to attempt to gracefully shutdown in server mode,
# in particular clearing up all queued database writes before letting the JVM shut down.
# The shutdown is a multi-phase process, each phase having its own configurable timeout. See the Dev Wiki for more details.
graceful-server-shutdown = true
max-concurrent-workflows = 5000
io {
throttle {
# # Global Throttling - This is mostly useful for GCS and can be adjusted to match
# # the quota availble on the GCS API
number-of-requests = 100000
per = 100 seconds
}
}
}
akka {
# Optionally set / override any akka settings
http {
server {
# Increasing these timeouts allow rest api responses for very large jobs
# to be returned to the user. When the timeout is reached the server would respond
# `The server was not able to produce a timely response to your request.`
# https://gatkforums.broadinstitute.org/wdl/discussion/10209/retrieving-metadata-for-large-workflows
request-timeout = 600s
idle-timeout = 600s
}
}
}
services {
MetadataService {
#class = "cromwell.services.metadata.impl.MetadataServiceActor"
config {
metadata-read-row-number-safety-threshold = 2000000
# # For normal usage the default value of 200 should be fine but for larger/production environments we recommend a
# # value of at least 500. There'll be no one size fits all number here so we recommend benchmarking performance and
# # tuning the value to match your environment.
db-batch-size = 700
}
}
}
google {
application-name = "cromwell"
auths = [
{
name = "application-default"
scheme = "application_default"
}
]
}
docker {
hash-lookup {
method = "remote"
}
}
engine {
filesystems {
gcs {
auth = "application-default"
}
}
}
call-caching {
enabled = true
}
backend {
default = GCPBATCH
providers {
GCPBATCH {
// life sciences
actor-factory = "cromwell.backend.google.batch.GcpBatchBackendLifecycleActorFactory"
config {
## Google project
project = "$PROJECT"
## Base bucket for workflow executions
root = "$BUCKET"
name-for-call-caching-purposes: PAPI
#60000/min in google
##genomics-api-queries-per-100-seconds = 90000
virtual-private-cloud {
network-name = "$NET"
subnetwork-name = "$SUBNET"
}
// Polling for completion backs-off gradually for slower-running jobs.
// This is the maximum polling interval (in seconds):
maximum-polling-interval = 600
request-workers = 4
batch-timeout = 7 days
# Emit a warning if jobs last longer than this amount of time. This might indicate that something got stuck in PAPI.
slow-job-warning-time: 24 hours
genomics {
// A reference to an auth defined in the `google` stanza at the top. This auth is used to create
// Pipelines and manipulate auth JSONs.
auth = "application-default"
compute-service-account = "default"
# Restrict access to VM metadata. Useful in cases when untrusted containers are running under a service
# account not owned by the submitting user
restrict-metadata-access = false
## Location
location = "europe-west1"
}
filesystems {
gcs {
// A reference to a potentially different auth for manipulating files via engine functions.
auth = "application-default"
project = "$PROJECT"
caching {
# When a cache hit is found, the following duplication strategy will be followed to use the cached outputs
# Possible values: "copy", "reference". Defaults to "copy"
# "copy": Copy the output files
# "reference": DO NOT copy the output files but point to the original output files instead.
# Will still make sure than all the original output files exist and are accessible before
# going forward with the cache hit.
duplication-strategy = "reference"
}
}
}
default-runtime-attributes {
cpu: 1
failOnStderr: false
continueOnReturnCode: 0
memory: "2 GB"
bootDiskSizeGb: 10
# Allowed to be a String, or a list of Strings
disks: "local-disk 10 HDD"
noAddress: false
preemptible: 1
zones: ["europe-west1-b"]
}
}
}
}
}
database {
...
}
The text was updated successfully, but these errors were encountered:
I believe that in Life Sciences and its predecessors, pull access to private GCR images was granted by the credentials on the job VM. Since Batch is a much larger step change, it could be that this behavior no longer holds true.
@Lipastomies what steps do you take to configure your system to use those private images?
Hi, we have Cromwell running in docker on a GCP VM, and the service account of the GCP VM has access to the image registry. I don't think we are doing anything else to gain access to the private registry.
I don't think we do anything else than give the service account required permissions. The VMs have been able to pull the images fine, that hasn't been a problem when running GCPBatch.
Hi!
We've been looking into migrating from PAPIv2 backend to GCPBATCH backend. Callcaching fails on GCPBATCH but not on PAPIv2 when using a private docker image in gcr.io.
Is this a missing feature or a bug? The documentation on the subject could go either way, depending on whether GCPBATCH is part of the other backends or a subset of the pipelines backend (https://cromwell.readthedocs.io/en/latest/cromwell_features/CallCaching/).
I do not think this is a configuration error, since the same config works with PAPIv2 backend, but if it is, what configuration options would be necessary for configuring gcr.io authentication when using GCPBATCH?
Errors from cromwell logs when task is being callcached:
Used backend:
GCPBATCH. Callcaching works with PAPIv2, not on GCPBATCH.
workflow used for testing:
We are using cromwell through broadinstitute/cromwell:87-ecd44b6 image.
cromwell configuration:
The text was updated successfully, but these errors were encountered: