Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cromwell workflow engine support #825

Open
tom-dyar opened this issue Jun 21, 2018 · 7 comments
Open

Cromwell workflow engine support #825

tom-dyar opened this issue Jun 21, 2018 · 7 comments

Comments

@tom-dyar
Copy link

The current wdl files were developed with DNANexus in mind. I am trying to modify and use in Cromwell, and it seems there are differences in the way paths for sub-workflows are handled vs. DNANexus. I got it to "work" by putting all the tasks and workflows into a single directory.

@dpark01
Copy link
Member

dpark01 commented Jun 21, 2018

Hi @tom-dyar , yes that's exactly what we do too. It seems that paths for sub-workflows often have shifting interpretations, and I think the DNAnexus parser has changed on this at one point as well. For now, I think the pipes/WDL directory was intended as a set of source files that might need manipulation prior to use, but we may end up flattening that directory structure in the end.

@tom-dyar
Copy link
Author

Great, thanks! I am hoping that's all there is regarding compatibility, so great you have it on your radar.

@tom-dyar
Copy link
Author

@dpark01 -- I am now trying to get this running on google compute from Cromwell -- do you have configuration files (machine requirements and reference file locations) for Google cloud, similar to those dx-**.json files in pipes/WDL ? I bumped up the local disk to 2TB due to a large run I tried, but still Kraken never finished after 24 hours, probably due to RAM issue working with my NextSeq500 run...

Thanks,
Tom

@dpark01
Copy link
Member

dpark01 commented Jul 16, 2018

@tom-dyar here is a json config file that we use (though see #843 for some caveats about it). The machine requirement specs should be fully derivable from the WDL task runtimes (in fact they were primarily designed around GCP instances in mind, with the dx_instance_type specifying the AWS/DNAnexus ones separately). But the config json defines a default disk setup that is important to get it to work (two LOCAL disks and a larger bootDiskSizeGb).

As for default databases, I don't have them all linked in properly, and they might not be the latest versions, but see gs://sabeti-public/meta_dbs and /depletion_dbs.

@tom-dyar
Copy link
Author

OK, not sure if I should submit a new ticket or not...

SamToFastQ is "hanging" when running demux_plus.wdl, so kraken.py never completes. I have a couple 5-7 GB BAM files, so it is failing on one of them. I am using the new Google Pipelines API version v2alpha1 and Cromwell verstion 34. I have bumped up the disk sizes, so I have 2 local disks 500GB each and the boot disk is 100GB. Below is my configuration file. I wonder how I should debug this, since there is no output in the log files, and perhaps there is a picard VERBOSITY option I could set but it seems I would have to update the container to put that in.

Thanks for any help!

include required(classpath("application"))

# Add customizations
#webservice.port = 8090


#MYSQL_DATABASE=cromwell_db -e MYSQL_USER=cromwell -e MYSQL_PASSWORD=cromwell

database {
  db.url = "jdbc:mysql://mysql-db/cromwell_db?useSSL=false&rewriteBatchedStatements=true"
  db.user = "cromwell"
  db.password = "cromwell"
  db.driver = "com.mysql.jdbc.Driver"
  profile = "slick.jdbc.MySQLProfile$"
}

google {

  application-name = "cromwell"

  auths = [
    {
      name = "application-default"
      scheme = "application_default"
    }
  ]
}

engine {
  filesystems {
    gcs {
      auth = "application-default"
      project = "014A9F-BB2CDD-822772"
    }
  }
}

backend {
  default = "Jes"
  providers {
    Jes {
      actor-factory = "cromwell.backend.google.pipelines.v2alpha1.PipelinesApiLifecycleActorFactory"
      config {
        // Google project
        project = "atvirology"

        // Base bucket for workflow executions
        root = "gs://atvir-cromwell/cromwell-execution"
        genomics-api-queries-per-100-seconds = 1000

        // Polling for completion backs-off gradually for slower-running jobs.
        // This is the maximum polling interval (in seconds):
        maximum-polling-interval = 300

        // Optional Dockerhub Credentials. Can be used to access private docker images.
        dockerhub {
          // account = ""
          // token = ""
        }

        genomics {
          // A reference to an auth defined in the `google` stanza at the top.  This auth is used to create
          // Pipelines and manipulate auth JSONs.
          auth = "application-default"
          // Endpoint for APIs, no reason to change this unless directed by Google.
          endpoint-url = "https://genomics.googleapis.com/"
          // Restrict access to VM metadata. Useful in cases when untrusted containers are running under a service
          // account not owned by the submitting user
          restrict-metadata-access = false
          // This allows you to use an alternative service account to launch jobs, by default uses default service account
          compute-service-account = "default"

          // Pipelines v2 only: specify the number of times localization and delocalization operations should be attempted
          // There is no logic to determine if the error was transient or not, everything is retried upon failure
          // Defaults to 3
          localization-attempts = 3
        }

        filesystems {
          gcs {
            // A reference to a potentially different auth for manipulating files via engine functions.
            auth = "application-default"
            project = "014A9F-BB2CDD-822772"

            caching {
              // When a cache hit is found, the following duplication strategy will be followed to use the cached outputs
              // Possible values: "copy", "reference". Defaults to "copy"
              // "copy": Copy the output files
              // "reference": DO NOT copy the output files but point to the original output files instead.
              //              Will still make sure than all the original output files exist and are accessible before
              //              going forward with the cache hit.
              duplication-strategy = "copy"
            }
          }
        }

        default-runtime-attributes {
          cpu: 2
          memory: "4 GB"
          failOnStderr: false
          continueOnReturnCode: 0
          bootDiskSizeGb: 100
          # Allowed to be a String, or a list of Strings NB, was "LOCAL" instead of "HDD"
          disks: "local-disk 500 HDD, /mnt/tmp 500 HDD"
          noAddress: false
          preemptible: 1
          zones: [ "us-central1-a", "us-central1-b", "us-central1-c", "us-east1-b", "us-east1-c", "us-east1-d" ]
        }

        #default-runtime-attributes {
        #  cpu: 2
        #  memory: "15G"
        #  failOnStderr: false
        #  continueOnReturnCode: 0
        #  bootDiskSizeGb: 50
        #  // Allowed to be a String, or a list of Strings
        #  disks: "local-disk 2000 LOCAL, /mnt/tmp 2000 LOCAL"
        #  noAddress: false
        #  preemptible: 1
        #  zones: [ "us-central1-a", "us-central1-b", "us-central1-c", "us-east1-b", "us-east1-c", "us-east1-d" ]
        #}
      }
    }
  }
}

call-caching {
  enabled = true
  invalidate-bad-cache-results = true
}

@dpark01
Copy link
Member

dpark01 commented Aug 17, 2018

Hi Tom, interesting... you should at least be able to deduce from the stdout/stderr log files (that Cromwell normally produces) for the kraken task which bam file it was processing at the time. And given that you have the input bam files, perhaps you could try reproducing that effect manually by spinning up a GCE VM manually, pulling the docker image, running it interactively (docker run -it --rm quay.io/....) which would give you an interactive shell as root within the container. You can manually run the metagenomics.py kraken on your input bam and watch the output, and since you have root, you could edit the source for more verbosity, but my real guess on this is that it has less to do with Picard and more to do with whatever is consuming its output pipes.

If it's reproducible and if your data isn't sensitive, we'd be happy to look at an example bam file.

@dpark01 dpark01 reopened this Aug 17, 2018
@tom-dyar
Copy link
Author

Thanks @dpark01 -- good tips and I will try to reproduce. Nothing particularly sensitive, here is th path to my logs, I tried to make my buckets publicly readable: gs://atvir-cromwell/cromwell-execution/demux_plus/2021156e-a3c3-45b1-9eb3-9171f70595f4/call-kraken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants