Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a little cleanup #422

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

a little cleanup #422

wants to merge 8 commits into from

Conversation

tedsharpe
Copy link
Contributor

No description provided.

Copy link
Collaborator

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tedsharpe , thanks for the cleanup!
I have some questions and think some small changes are needed.

Let me know!
BTW, the automated tests are catching these errors
https://github.com/broadinstitute/long-read-pipelines/actions/runs/6225233549/job/16895206594?pr=422#step:7:312

@@ -16,9 +16,10 @@ ENV PATH=/miniconda/bin/:/miniconda/envs/lr-metrics/bin/:/root/google-cloud-sdk/

# install conda packages
COPY ./environment.yml /
RUN conda install -n base conda-libmamba-solver && conda config --set solver libmamba
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with how mamba helps dependencies, in particular where it should be installed.
But if you look at the environment.yml, it's trying to create an env named lr-metrics, and here you're installing mamba into the base env.
Is this usually what people do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been reverted, so it's no longer relevant, but here's some info for posterity.
I tried to build this docker on a desktop with 16Gb of memory. No dice. If I was lucky it just got OOM'd after a few hours. So I ran out to microcenter and bought 32Gb of memory. After leaving it overnight it was still trying to solve the environment. Added libmamba and the docker built promptly (a few minutes--don't remember exactly) using little memory. Don't know about the correctness of the environments (but I'd think you'd want the solver in the base environment).

RUN conda env create -f /environment.yml && conda clean -a

# install gatk
# install super-special version of gatk
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a ticket for this to help us track #427

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current version eliminates it.

@@ -86,7 +85,7 @@ task IsLocusDeleted {
boot_disk_gb: 10,
preemptible_tries: 2,
max_retries: 1,
docker: "us.gcr.io/broad-dsp-lrma/lr-mosdepth:0.3.1"
docker: "us.gcr.io/broad-dsp-lrma/lr-mosdepth:0.3.2"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have 0.3.2 on GCR, but given that the main branch is using 0.3.1, we can do one of two things here:

  • not updating the version here
  • find out what 0.3.2 changed from 0.3.1. I see you included several more packages in lr-mosdepth, so if you built that, then let's also bump the version in the makefile in that docker foler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makefile version bumped.

preemptible_tries: 2,
max_retries: 1,
docker: "us.gcr.io/broad-dsp-lrma/lr-metrics:0.1.11"
docker: "us.gcr.io/broad-dsp-lrma/lr-mosdepth:0.3.2"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, mosdepth's docker is being updated. Can we look into what changed in 0.3.2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added a couple of packages so that we didn't have to needlessly break this task in two.

String basename = basename(bam, ".bam")
Int disk_size = 2*ceil(size(bam, "GB"))

command <<<
set -euxo pipefail

java -jar /usr/local/bin/gatk.jar ComputeLongReadMetrics -I ~{bam} -O ~{basename}.read_metrics -DF WellformedReadFilter
samtools flagstat ~{bam} > ~{basename}.flag_stats.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great.
Should it have its own task, though? I know it's always a balance (localization vs. separation of concerns).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an astounding amount of overhead in invoking a task. Provisioning the VM, downloading the docker image, checking the call cache database, localizing input files, etc., etc., etc. It's especially true of small tasks. For example, in this workflow the task MakeChrIntervalList (an extreme example, I admit) runs for 2 minutes. Of this, the useful work part (UserAction) takes 1.36 seconds. I'm voting for a rule that says, "You combine all the operations that process the same data into a single task unless it's a very long-running task that might escape preemption or significantly reduce workflow real clock time by being parallelized."

@@ -30,6 +31,9 @@ RUN git clone https://github.com/broadinstitute/gatk.git -b kvg_pbeap \
# install picard
RUN wget -O /usr/local/bin/picard.jar https://github.com/broadinstitute/picard/releases/download/2.22.1/picard.jar

# install gsutil
RUN curl -sSL https://sdk.cloud.google.com | bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this command, just a convention in this repo.

We usually bump the version in the companion make file when the Dockerfile/environment file are changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted.


import "../../structs/Structs.wdl"

workflow MergeVCFs {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this taken from the GATK-SV pipeline?

And is this integrated into other pipelines in this repo?
I checked out your branch and deleted that file, then went to validate all pipelines. All WDLs validated, so I suspect it is not integrated.

If it is indeed not integrated into other pipelines here, do you mind making a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was something I wrote for the GMKF data import. It doesn't belong in this PR. I'll remove it.

@tedsharpe
Copy link
Contributor Author

Thanks for the review, Steve. I think we're probably ready for another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants