Add TRGT Special-purpose caller #412

mdanzi · 2023-07-20T15:41:19Z

I would like to propose the incorporation of TRGT as a special-purpose caller of tandem repeat genotypes into the PacBio CCS variant calling workflow. I have added a task and sub-workflow for TRGT, details of how to create the docker image used for its run, and incorporated it into the PBCCSWholeGenome.wdl pipeline.

TRGT requires an input specification file. I am currently hosting the file used on the Phase 1 data and link to it in the TRGT.wdl file. This should be moved to a gcs bucket for better performance in production.

Added task and workflow for running TRGT to genotype tandem repeats in each sample. Included URL to the catalog used on the Phase1 Long Reads data. The WDL currently uses a container stored in AWS's container registry.

Added the lr-trgt folder with Dockerfile and Makefile which specifies how the lr-trgt docker image was created that is referenced in the TRGT.wdl task/workflow. The Makefile is currently configured to push a new build of this image to us.gcr.io/broad-dsp-lrma/lr-trgt. If done, the TRGT.wdl task should be updated to use that container as well.

Added logic to run the TRGT sub-workflow as part of the HiFi variant calling workflow

SHuang-Broad

@mdanzi
Thanks for issuing a PR for this!

I have some general comments that I'd value your input on.

docker/lr-trgt/Dockerfile

docker/lr-trgt/Makefile

SHuang-Broad · 2023-07-25T15:42:33Z

wdl/pipelines/PacBio/VariantCalling/PBCCSWholeGenome.wdl

@@ -4,6 +4,7 @@ import "../../../tasks/Utility/PBUtils.wdl" as PB
 import "../../../tasks/Utility/Utils.wdl" as Utils
 import "../../../tasks/VariantCalling/CallVariantsPBCCS.wdl" as VAR
 import "../../../tasks/Utility/Finalize.wdl" as FF
+import "../../../tasks/VariantCalling/TRGT.wdl" as TRGT


I have a general comment about including trgt in this production pipeline.
It's a bit complicated, so I'll outline the main logic here:

The production pipeline is intended for running routine and stable tools on all production samples. Its primary responsibility is two-fold: alignments merging (and associated QC/metric) + variant calling (SV, SNV, indel). The tools incorporated here are well accepted by the community and undergo less frequent updates. This means we don't need to update & rerun the workflow constantly.

There's another set of callers that are not quite "validated" yet, i.e. they are more special-purpose focused and/or target a specific type of variants. They are under active development and evaluation so their release frequencies are much higher. IMO trgt falls under this category.

So here's my proposal:
Let's leave trgt out of the main variant calling pipeline, and promote it to a standalone workflow (i.e. place it under /wdl/pipelines/PacBio/VariantCalling/). We have other means to ensure that such "special-purpose" callers are also run routinely, in addition to the main ones. This also has the benefit that if we decide to update the version of trgt or how to run it (added annotations, etc), we only need to update its WDLs and rerun that workflow, i.e. separation of concerns .

How does this sound?

You make an excellent point. I definitely agree that TRGT is much more 'under active development' than may be ideal for the main pipeline.

I agree with your plan and have gone ahead and created a new pipeline under /wdl/pipelines/PacBio/VariantCalling/ which is called PBCCS_CallTRs.wdl. This pipeline runs the processWithTRGT task in the TRGT.wdl file (under /wdl/tasks/VariantCalling/).

SHuang-Broad · 2023-07-25T15:52:04Z

wdl/tasks/VariantCalling/TRGT.wdl

+    File ref_dict
+    File ref_fasta
+    File ref_fasta_index
+    File repeatCatalog = "https://zuchnerlab.s3.amazonaws.com/RepeatExpansions/TRGT/adotto_TRregions_TRGTFormatWithFlankingSeq_v1.0.bed"


I've never tried running workflows that try to localize HTTP files. So I worry this default value will fail the workflow's executions.

This points to a larger issue: where should these annotations be stored?

I don't mean we should resolve this here, but just getting it out there for us to think about it.

You are correct. Apparently you can import WDLs from https, but can't localize files that way. But yes, this is obviously just because I didn't have a good place to store the annotation file. It is about 275MB, so not really appropriate for Github. Ideally, we would put it somewhere on gs. But I don't have an account where I can make a file publicly available there.

For now, I went ahead and moved this to a wget command within the Dockerfile so the annotation is built in to the image (and then referenced locally). But if we can make a place for me to host the annotation on gs, that would obviously be a much better solution as the annotation will certainly change over time and re-building the docker image for that is clearly not optimal.

Updated trgt to v0.8.0 and added T2T repeat catalog

Update TRGT to use v0.8.0 and to allow karyotype-specific analyses

Version bump

updated trgt pipeline to support karyotype-specific calling

Correct repeat catalog filename

T2T example

Corrected values in example

mdanzi · 2024-03-01T20:24:57Z

Hi Kiran and Steve,

Per conversation with Kiran today, I updated the wdls in this PR to use the current version of TRGT so that the results can be phased with all other calls using HiPhase. These will work with either the GRCh38 or T2T genomes, they just require whoever runs the WDLs to input the correct reference fasta file and select the correct repeat catalog (which are stored on the docker image). The other change is that now users have the option to specify XX or XY for individuals in order to potentially get haploid calls on X and Y. I did it this way in my recent processing run of the data on T2T, but not on my original GRCh38 run, so it will be nice to have it all consistent going forward.

I also added example input JSON files to hopefully make it easier for implementation.

mdanzi · 2024-05-21T21:09:09Z

Updated TRGT to v1.0 and edited the scripts to accept an int variable called 'is_female' which will be based off the imputed sex. This should be suitable for re-processing of the phase 1 samples for HiPhase work on both hg38 and T2T.

mdanzi added 3 commits July 20, 2023 10:21

Added TRGT.wdl

98df965

Added task and workflow for running TRGT to genotype tandem repeats in each sample. Included URL to the catalog used on the Phase1 Long Reads data. The WDL currently uses a container stored in AWS's container registry.

Add TRGT to PBCCSWholeGenome.wdl

d20883a

Added logic to run the TRGT sub-workflow as part of the HiFi variant calling workflow

bshifaw requested review from SHuang-Broad and kvg July 21, 2023 18:16

bshifaw assigned mdanzi Jul 21, 2023

SHuang-Broad requested changes Jul 25, 2023

View reviewed changes

Move TRGT to separate pipeline

7206419

mdanzi requested a review from SHuang-Broad August 1, 2023 20:34

mdanzi added 8 commits March 1, 2024 13:53

Update Dockerfile

53f8e5d

Updated trgt to v0.8.0 and added T2T repeat catalog

Update TRGT.wdl

d433589

Update TRGT to use v0.8.0 and to allow karyotype-specific analyses

Update Makefile

3974b44

Version bump

Update PBCCS_CallTRs.wdl

673bb03

updated trgt pipeline to support karyotype-specific calling

Update PBCCS_CallTRs.wdl

15adde1

Correct repeat catalog filename

Create PBCCS_CallTRs_exampleInputs_GRCh38.json

12b0413

Create PBCCS_CallTRs_exampleInputs_T2T.json

f6f8c10

T2T example

Update PBCCS_CallTRs_exampleInputs_GRCh38.json

37eaee2

Corrected values in example

mdanzi added 9 commits May 21, 2024 15:29

Merge branch 'broadinstitute:main' into main

9454e44

Update to TRGT v1.0 in Docker image

723b058

Update Makefile for TRGT to v1.0

78bfab5

Update TRGT.wdl to v1.0

0a895f0

Update PBCCS_CallTRs.wdl to v1.0

c2c4a98

Update Dockerfile

a54ed63

Update Dockerfile

aa66981

Update PBCCS_CallTRs_exampleInputs_GRCh38.json

6c7fc74

Update PBCCS_CallTRs_exampleInputs_T2T.json

269d0d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TRGT Special-purpose caller #412

Add TRGT Special-purpose caller #412

mdanzi commented Jul 20, 2023

SHuang-Broad left a comment

SHuang-Broad Jul 25, 2023

mdanzi Aug 1, 2023

SHuang-Broad Jul 25, 2023

mdanzi Aug 1, 2023

mdanzi commented Mar 1, 2024

mdanzi commented May 21, 2024

Add TRGT Special-purpose caller #412

Are you sure you want to change the base?

Add TRGT Special-purpose caller #412

Conversation

mdanzi commented Jul 20, 2023

SHuang-Broad left a comment

Choose a reason for hiding this comment

SHuang-Broad Jul 25, 2023

Choose a reason for hiding this comment

mdanzi Aug 1, 2023

Choose a reason for hiding this comment

SHuang-Broad Jul 25, 2023

Choose a reason for hiding this comment

mdanzi Aug 1, 2023

Choose a reason for hiding this comment

mdanzi commented Mar 1, 2024

mdanzi commented May 21, 2024