Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TRGT Special-purpose caller #412

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Conversation

mdanzi
Copy link

@mdanzi mdanzi commented Jul 20, 2023

I would like to propose the incorporation of TRGT as a special-purpose caller of tandem repeat genotypes into the PacBio CCS variant calling workflow. I have added a task and sub-workflow for TRGT, details of how to create the docker image used for its run, and incorporated it into the PBCCSWholeGenome.wdl pipeline.

TRGT requires an input specification file. I am currently hosting the file used on the Phase 1 data and link to it in the TRGT.wdl file. This should be moved to a gcs bucket for better performance in production.

Added task and workflow for running TRGT to genotype tandem repeats in each sample. Included URL to the catalog used on the Phase1 Long Reads data. The WDL currently uses a container stored in AWS's container registry.
Added the lr-trgt folder with Dockerfile and Makefile which specifies how the lr-trgt docker image was created that is referenced in the TRGT.wdl task/workflow. The Makefile is currently configured to push a new build of this image to us.gcr.io/broad-dsp-lrma/lr-trgt. If done, the TRGT.wdl task should be updated to use that container as well.
Added logic to run the TRGT sub-workflow as part of the HiFi variant calling workflow
Copy link
Collaborator

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdanzi
Thanks for issuing a PR for this!

I have some general comments that I'd value your input on.

docker/lr-trgt/Dockerfile Outdated Show resolved Hide resolved
docker/lr-trgt/Makefile Outdated Show resolved Hide resolved
@@ -4,6 +4,7 @@ import "../../../tasks/Utility/PBUtils.wdl" as PB
import "../../../tasks/Utility/Utils.wdl" as Utils
import "../../../tasks/VariantCalling/CallVariantsPBCCS.wdl" as VAR
import "../../../tasks/Utility/Finalize.wdl" as FF
import "../../../tasks/VariantCalling/TRGT.wdl" as TRGT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a general comment about including trgt in this production pipeline.
It's a bit complicated, so I'll outline the main logic here:

The production pipeline is intended for running routine and stable tools on all production samples. Its primary responsibility is two-fold: alignments merging (and associated QC/metric) + variant calling (SV, SNV, indel). The tools incorporated here are well accepted by the community and undergo less frequent updates. This means we don't need to update & rerun the workflow constantly.

There's another set of callers that are not quite "validated" yet, i.e. they are more special-purpose focused and/or target a specific type of variants. They are under active development and evaluation so their release frequencies are much higher. IMO trgt falls under this category.

So here's my proposal:
Let's leave trgt out of the main variant calling pipeline, and promote it to a standalone workflow (i.e. place it under /wdl/pipelines/PacBio/VariantCalling/). We have other means to ensure that such "special-purpose" callers are also run routinely, in addition to the main ones. This also has the benefit that if we decide to update the version of trgt or how to run it (added annotations, etc), we only need to update its WDLs and rerun that workflow, i.e. separation of concerns .

How does this sound?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make an excellent point. I definitely agree that TRGT is much more 'under active development' than may be ideal for the main pipeline.

I agree with your plan and have gone ahead and created a new pipeline under /wdl/pipelines/PacBio/VariantCalling/ which is called PBCCS_CallTRs.wdl. This pipeline runs the processWithTRGT task in the TRGT.wdl file (under /wdl/tasks/VariantCalling/).

File ref_dict
File ref_fasta
File ref_fasta_index
File repeatCatalog = "https://zuchnerlab.s3.amazonaws.com/RepeatExpansions/TRGT/adotto_TRregions_TRGTFormatWithFlankingSeq_v1.0.bed"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never tried running workflows that try to localize HTTP files. So I worry this default value will fail the workflow's executions.

This points to a larger issue: where should these annotations be stored?

I don't mean we should resolve this here, but just getting it out there for us to think about it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Apparently you can import WDLs from https, but can't localize files that way. But yes, this is obviously just because I didn't have a good place to store the annotation file. It is about 275MB, so not really appropriate for Github. Ideally, we would put it somewhere on gs. But I don't have an account where I can make a file publicly available there.

For now, I went ahead and moved this to a wget command within the Dockerfile so the annotation is built in to the image (and then referenced locally). But if we can make a place for me to host the annotation on gs, that would obviously be a much better solution as the annotation will certainly change over time and re-building the docker image for that is clearly not optimal.

Updated trgt to v0.8.0 and added T2T repeat catalog
Update TRGT to use v0.8.0 and to allow karyotype-specific analyses
Version bump
updated trgt pipeline to support karyotype-specific calling
Correct repeat catalog filename
@mdanzi
Copy link
Author

mdanzi commented Mar 1, 2024

Hi Kiran and Steve,

Per conversation with Kiran today, I updated the wdls in this PR to use the current version of TRGT so that the results can be phased with all other calls using HiPhase. These will work with either the GRCh38 or T2T genomes, they just require whoever runs the WDLs to input the correct reference fasta file and select the correct repeat catalog (which are stored on the docker image). The other change is that now users have the option to specify XX or XY for individuals in order to potentially get haploid calls on X and Y. I did it this way in my recent processing run of the data on T2T, but not on my original GRCh38 run, so it will be nice to have it all consistent going forward.

I also added example input JSON files to hopefully make it easier for implementation.

@mdanzi
Copy link
Author

mdanzi commented May 21, 2024

Updated TRGT to v1.0 and edited the scripts to accept an int variable called 'is_female' which will be based off the imputed sex. This should be suitable for re-processing of the phase 1 samples for HiPhase work on both hg38 and T2T.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants