Refine Contamination Analysis Workflow for Extra Input Resources #443

shadizaheri · 2024-02-29T21:38:38Z

Summary:
This Pull Request introduces several refinements to the LongReadsContaminationEstimation and Contamination workflows. The updates aim to add the .mu, .UD, and modified .bed resources, streamline the analysis process, reduce unnecessary input requirements, and leverage our newly updated Docker resources.

Changes:
Hardcoded Resource Paths: Updated the workflow to utilize hardcoded paths within our custom Docker image for BED, UD, and MU files. This change simplifies the input requirements and ensures consistency across runs.

Input Optimization: Removed the gt_sites_bed input from LongReadsContaminationEstimation as the necessary BED file is now embedded within the Docker image, reducing manual input errors and streamlining data processing.

Workflow Clarity: Enhanced readability and maintainability of the WDL scripts by removing unused variables and inputs, particularly focusing on deprecated is_hgdp_sites and is_100k_sites flags that no longer affect the workflow due to internalized resource paths.

Documentation Updates: Revised comments and documentation within WDL files to accurately reflect the new workflow process and input requirements.

Rationale:
These changes were motivated by the need to streamline our contamination estimation process for long-read datasets and reduce the complexity of workflow inputs. We minimize user error and ensure a more uniform analysis framework by embedding critical resources directly within our Docker environment.

Testing:
I will test the updated workflows and will update the PR accordingly.

* update Hifiasm to version 0.19.5 * update how Hifiasm outputs are compressed (bgz replacing gz), also * monitor hifiasm resources usage

* update docker used in PBSV tasks to the version coming with official SMRTLink releases (2.9.0) * change how the 2-step PBSV process is done (following the recommended way now)

* to version 2.0.7 * using TRF bed * conditionally phase sv (requires phased bam) * generates its own vcf.gz and tbi

Overhaul how small variants are called in the WG pipelines * default to use DV to call small variants, Clair3 analysis needs to be requested explicitly * retire the Pepper toolchain completely from the CCS pipeline, using DV directly * for R10.4+ ONT data, also use DV directly * older ONT data would still use the PEPPER-DV-Margin pipeline * offers GPU version (though based on, it's not worth it yet) * update how bam haplotagging is done Cleanup structural variants calling * experiment with SNF2 phasing SV calls (implicitly depends on small variants calling now) * tune PBSV calling - discover now supports --hifi - output vcf.gz and tbi - less verbose logging by default Misc.: * optimizations to BAM merging and metrics workflow * updates coverage collection step * new R script to visualize log from vm_monitoring_script.sh

* organize dockstore.yml file a bit * make WDL validation shell script more usable * update pbmm2 and pbindex to versions in SMRTLink * update GeneralUtils.wdl - two bash-like new tasks [CoerceMapToArrayOfPairs, CoerceArrayOfPairsToMap] - cleanup task CollapseArrayOfStrings * update resource allocations to tasks - NanoplotFromBam (also changes docker) - MosDepthWGS

* incorporates gcloud cli (not just gsutil) * integrate libdeflate for more speedups

…sorted ONT bam

incorporate new tasks and optimize them * [CountMethylCallReads, GatherReadsWithoutMethylCalls] from sh_beans * [GetPileup, BamToRelevantPileup] from sh_more_atomic_qc * [GetReadGroupLines, GetSortOrder, SplitNameSortedUbam] from sh_ont_fc * [SamtoolsFlagStats, ParseFlagStatsJson] from sh_trvial_stats * [FilterBamByLen, InferSampleName] from sh_seqkit * [CountAlignmentRecords, StreamingBamErrored, CountAlignmentRecordsByFlag] from sh_maha_aln_metrics * [ResetSamplename] from sh_ingest_singlerg * [MergeBamsWithSamtools] from sh_ont_fc.Utils.wdl * [BamToFastq] from sh_more_bam_qcs and optimize it with sh_ingest_singlerg.Utils.wdl delete * GetSortOrder as that's now implemented in GatherBamMetadata * Drop2304Alignments as that's no longer used update dockers to the latest

CHERRY-PICK FROM VARIOUS QC/METRICS BRANCHES: * collect information about ML/MM tags in a long-read BAM (sh_beans) * a heuristic way to find peaks in a distribution (using dyst) (sh_dyst_peaker) * filter reads by length in a BAM * collect some read quality stats from (length-filtered) FASTQ/BAM (sh_seq_kit) * VerifyBamID2 (for contamination estimation) * naive sex-concordance check (sh_more_atomic_qc) * check fingerprint of a single BAM file (sh_sample_fp) * collect SAM flag stats (sh_trivial_stats)

* make BeanCounter finalization optional (wdl/pipelines/TechAgnostic/Utility/CountTheBeans.wdl) * custom struct for sub-workflow config using a JSON (wdl/pipelines/TechAgnostic/Utility/LongReadsContaminationEstimation.wdl) * make fingerprint checking subworkflow control size filtering (wdl/tasks/QC/FPCheckAoU.wdl) (wdl/pipelines/TechAgnostic/Utility/VerifyBamFingerprint.wdl) * fix a warning by IDE/miniwdl complaining WDL stdlib function length only applies to Array (wdl/tasks/Utility/BAMutils.wdl) * various updates to Finalize (wdl/tasks/Utility/Finalize.wdl) New tasks in (wdl/tasks/Utility/GeneralUtils.wdl) to * correctly convert Map to TSV * concatenate files

* AlignAndCheckFingerprintCCS.wdl * CollectPacBioAlignedMetrics.wdl * CollectSMRTCellUnalignedMetrics.wdl

(CHRRY-PICK & follow up to PR 406)

* SampleLevelAlignedMetrics.wdl * PBCLRWholeGenome.wdl

* new struct in AlignedBamQCandMetrics.wdl to facilicate as-sub-workflow calling * change parameters name for fingerprint workflows

* make saving of reads without methylation SAM tags optional * better parameter naming

(affects contamination estimation)

… for efficiency - Remove unnecessary BED file input from LongReadsContaminationEstimation workflow as BED paths are now hardcoded in the Docker image. - Modify the inputs and commands in Contamination.wdl to align with new Docker setup and work with the .mu, .UD, and .bed files from the docker. - Adjust workflow parameters to better reflect current data processing requirements and practices.

Removing SVDPrefix from the command line.

SHuang-Broad and others added 30 commits November 8, 2023 12:08

UPDATES TO THE Hifiasm pipeline:

e09dc65

* update Hifiasm to version 0.19.5 * update how Hifiasm outputs are compressed (bgz replacing gz), also * monitor hifiasm resources usage

For both CCS/ONT, update PBSV

d3afc4b

* update docker used in PBSV tasks to the version coming with official SMRTLink releases (2.9.0) * change how the 2-step PBSV process is done (following the recommended way now)

For both CCS/ONT, update Sniffles-2

3a2ac5c

* to version 2.0.7 * using TRF bed * conditionally phase sv (requires phased bam) * generates its own vcf.gz and tbi

CRITICAL: updating pbindex to that from the SMRTLink 12 release

29aa964

New docker that's intended to replace lr-basic:

1ff0912

* incorporates gcloud cli (not just gsutil) * integrate libdeflate for more speedups

New docker that offers a new mode remove duplicates from a queryname-…

4224ab4

…sorted ONT bam

New docker that updates the samtools in GATK to the latest (1.18)

a13a541

update some legacy code due to new interfaces in updated BAMutils.wdl

b16c619

New workflow to save files (avoid multiple calls to Finalize)

cbfed4e

New workflow to unify QC and metrics collection on aligned BAM

8039c88

Update grouping of pipelines in .dockstore.yml

b16155f

update WDL validation bash script

5bf0b2f

DEPRECATION

0d61935

* AlignAndCheckFingerprintCCS.wdl * CollectPacBioAlignedMetrics.wdl * CollectSMRTCellUnalignedMetrics.wdl

Fix bug in deduplicating aligned ONT BAM

989ead1

(CHRRY-PICK & follow up to PR 406)

DEPRECATION:

b8e67ff

* SampleLevelAlignedMetrics.wdl * PBCLRWholeGenome.wdl

Update utility code:

e5a79c2

* new struct in AlignedBamQCandMetrics.wdl to facilicate as-sub-workflow calling * change parameters name for fingerprint workflows

a few tweaks to to AlignedBamQCandMetrics:

b5dc978

* make saving of reads without methylation SAM tags optional * better parameter naming

correct two mistakes: 1. a typo; 2. should append not replace

1abcc7a

make read-length util tasks pre-emptible

da5bfb0

make fingerprint reads extraction more efficient/resilient

7975d6d

fix a stupid logical typo error (in saving files)

5620d74

option to force run fingerprintcheck on small bams

e910bac

make pbindex task more frugal

04b82f8

Safer and more efficient way to do targetted pileup conversion

a03ff34

(affects contamination estimation)

shadizaheri added 2 commits February 29, 2024 16:53

Update Contamination.wdl

7db20f6

Removing SVDPrefix from the command line.

Update LongReadsContaminationEstimation.wdl

784081a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine Contamination Analysis Workflow for Extra Input Resources #443

Refine Contamination Analysis Workflow for Extra Input Resources #443

shadizaheri commented Feb 29, 2024

Refine Contamination Analysis Workflow for Extra Input Resources #443

Are you sure you want to change the base?

Refine Contamination Analysis Workflow for Extra Input Resources #443

Conversation

shadizaheri commented Feb 29, 2024