bamQC

QC metrics for BAM files

Overview

Dependencies

Usage

Cromwell

java -jar cromwell.jar run bamQC.wdl --inputs inputs.json

Inputs

Required workflow parameters:

Parameter	Value	Description
`inputGroups`	Array[InputGroup]	Array of objects describing sets of bams to merge together and on which to compute QC metrics
`metadata`	Map[String,String]	JSON file containing metadata
`mode`	String	running mode for the workflow, only allow value 'lane_level' and 'call_ready'
`bamQCMetrics.refFasta`	String	Path to human genome FASTA reference
`bamQCMetrics.refSizesBed`	String	Path to human genome BED reference with chromosome sizes
`bamQCMetrics.workflowVersion`	String	Workflow version string

Optional workflow parameters:

Parameter	Value	Default	Description
`outputFileNamePrefix`	String	"bamQC"	Prefix for output files
`intervalsToParallelizeByString`	String	"chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY,chrM"	Comma separated list of intervals to split by (e.g. chr1,chr2,chr3,chr4).

Optional task parameters:

Parameter	Value	Default	Description
`splitStringToArray.lineSeparator`	String	","	Interval group separator - these are the intervals to split by.
`splitStringToArray.recordSeparator`	String	"+"	Record separator - a delimiter for joining records
`splitStringToArray.jobMemory`	Int	1	Memory allocated to job (in GB).
`splitStringToArray.cores`	Int	1	The number of cores to allocate to the job.
`splitStringToArray.timeout`	Int	1	Maximum amount of time (in hours) the task can run for.
`splitStringToArray.modules`	String	""	Environment module name and version to load (space separated) before command execution.
`getChrCoefficient.memory`	Int	2	Memory allocated for this job
`getChrCoefficient.timeout`	Int	1	Hours before task timeout
`getChrCoefficient.modules`	String	"samtools/1.14"	Names and versions of modules to load
`preFilter.filterFlags`	Int	260	Samtools filter flags to apply.
`preFilter.minMapQuality`	Int?	None	Minimum alignment quality to pass filter
`preFilter.filterAdditionalParams`	String?	None	Additional parameters to pass to samtools.
`preFilter.modules`	String	"samtools/1.14"	required environment modules
`preFilter.jobMemory`	Int	6	Memory allocated for this job
`preFilter.minMemory`	Int	2	Minimum amount of RAM allocated to the task
`preFilter.threads`	Int	4	Requested CPU threads
`preFilter.timeout`	Int	4	hours before task timeout
`mergeSplitByIntervalFiles.suffix`	String	".merge"	suffix to use for merged bam
`mergeSplitByIntervalFiles.jobMemory`	Int	24	Memory allocated to job (in GB).
`mergeSplitByIntervalFiles.overhead`	Int	6	Java overhead memory (in GB). jobMemory - overhead == java Xmx/heap memory.
`mergeSplitByIntervalFiles.cores`	Int	1	The number of cores to allocate to the job.
`mergeSplitByIntervalFiles.timeout`	Int	24	Maximum amount of time (in hours) the task can run for.
`mergeSplitByIntervalFiles.modules`	String	"gatk/4.1.6.0"	Environment module name and version to load (space separated) before command execution.
`mergeFiles.suffix`	String	".merge"	suffix to use for merged bam
`mergeFiles.jobMemory`	Int	24	Memory allocated to job (in GB).
`mergeFiles.overhead`	Int	6	Java overhead memory (in GB). jobMemory - overhead == java Xmx/heap memory.
`mergeFiles.cores`	Int	1	The number of cores to allocate to the job.
`mergeFiles.timeout`	Int	24	Maximum amount of time (in hours) the task can run for.
`mergeFiles.modules`	String	"gatk/4.1.6.0"	Environment module name and version to load (space separated) before command execution.
`filter.minQuality`	Int	30	Minimum alignment quality to pass filter
`filter.modules`	String	"samtools/1.14"	required environment modules
`filter.jobMemory`	Int	16	Memory allocated for this job
`filter.threads`	Int	4	Requested CPU threads
`filter.timeout`	Int	4	hours before task timeout
`updateMetadata.modules`	String	"python/3.6"	required environment modules
`updateMetadata.jobMemory`	Int	16	Memory allocated for this job
`updateMetadata.threads`	Int	4	Requested CPU threads
`updateMetadata.timeout`	Int	4	hours before task timeout
`findDownsampleParams.targetReads`	Int	100000	Desired number of reads in downsampled output
`findDownsampleParams.minReadsAbsolute`	Int	10000	Minimum value of targetReads to allow pre-downsampling
`findDownsampleParams.minReadsRelative`	Int	2	Minimum value of (inputReads)/(targetReads) to allow pre-downsampling
`findDownsampleParams.precision`	Int	8	Number of decimal places in fraction for pre-downsampling
`findDownsampleParams.preDSMultiplier`	Float	1.5	Determines target size for pre-downsampled set (if any). Must have (preDSMultiplier) < (minReadsRelative).
`findDownsampleParams.modules`	String	"python/3.6"	required environment modules
`findDownsampleParams.jobMemory`	Int	16	Memory allocated for this job
`findDownsampleParams.threads`	Int	4	Requested CPU threads
`findDownsampleParams.timeout`	Int	4	hours before task timeout
`findDownsampleParamsMarkDup.threshold`	Int	10000000	Minimum number of reads to conduct downsampling
`findDownsampleParamsMarkDup.chromosomes`	Array[String]	["chr12", "chr13", "chrXII", "chrXIII"]	Array of chromosome identifiers for downsampled subset
`findDownsampleParamsMarkDup.baseInterval`	Int	15000	Base width of interval in each chromosome, for very large BAMs
`findDownsampleParamsMarkDup.intervalStart`	Int	100000	Start of interval in each chromosome, for very large BAMs
`findDownsampleParamsMarkDup.customRegions`	String	""	Custom downsample regions; overrides chromosome and interval parameters
`findDownsampleParamsMarkDup.modules`	String	"python/3.6"	required environment modules
`findDownsampleParamsMarkDup.jobMemory`	Int	16	Memory allocated for this job
`findDownsampleParamsMarkDup.threads`	Int	4	Requested CPU threads
`findDownsampleParamsMarkDup.timeout`	Int	4	hours before task timeout
`downsample.downsampleSuffix`	String	"downsampled.bam"	Suffix for output file
`downsample.randomSeed`	Int	42	Random seed for pre-downsampling (if any)
`downsample.modules`	String	"samtools/1.14"	required environment modules
`downsample.jobMemory`	Int	16	Memory allocated for this job
`downsample.threads`	Int	4	Requested CPU threads
`downsample.timeout`	Int	4	hours before task timeout
`downsampleRegion.modules`	String	"samtools/1.14"	required environment modules
`downsampleRegion.jobMemory`	Int	16	Memory allocated for this job
`downsampleRegion.threads`	Int	4	Requested CPU threads
`downsampleRegion.timeout`	Int	4	hours before task timeout
`markDuplicates.opticalDuplicatePixelDistance`	Int	100	Maximum offset between optical duplicate clusters
`markDuplicates.picardMaxMemMb`	Int	6000	Memory requirement in MB for running Picard JAR
`markDuplicates.modules`	String	"picard/2.21.2"	required environment modules
`markDuplicates.jobMemory`	Int	16	Memory allocated for this job
`markDuplicates.threads`	Int	4	Requested CPU threads
`markDuplicates.timeout`	Int	4	hours before task timeout
`bamQCMetrics.normalInsertMax`	Int	1500	Maximum of expected insert size range
`bamQCMetrics.modules`	String	"bam-qc-metrics/0.2.5"	required environment modules
`bamQCMetrics.jobMemory`	Int	16	Memory allocated for this job
`bamQCMetrics.threads`	Int	4	Requested CPU threads
`bamQCMetrics.timeout`	Int	4	hours before task timeout
`runMosdepth.modules`	String	"mosdepth/0.2.9"	required environment modules
`runMosdepth.jobMemory`	Int	16	Memory allocated for this job
`runMosdepth.threads`	Int	4	Requested CPU threads
`runMosdepth.timeout`	Int	4	hours before task timeout
`cumulativeDistToHistogram.modules`	String	"python/3.6"	required environment modules
`cumulativeDistToHistogram.jobMemory`	Int	8	Memory allocated for this job
`cumulativeDistToHistogram.threads`	Int	4	Requested CPU threads
`cumulativeDistToHistogram.timeout`	Int	1	hours before task timeout
`collateResults.modules`	String	"python/3.6"	required environment modules
`collateResults.jobMemory`	Int	8	Memory allocated for this job
`collateResults.threads`	Int	4	Requested CPU threads
`collateResults.timeout`	Int	1	hours before task timeout

Outputs

Output	Type	Description
`result`	File	json file that contains metrics and meta data described in https://github.com/oicr-gsi/bam-qc-metrics/blob/master/metrics.md

This section lists command(s) run by WORKFLOW workflow

Run getChrCoefficient

 
  CHROM_LEN=$(samtools view -H ~{bamFile} | grep ^@SQ | grep -v _ | grep -w ~{chromosome} | cut -f 3 | sed 's/LN://')
  LARGEST=$(samtools view -H ~{bamFile} | grep ^@SQ | grep -v _ | cut -f 3 | sed 's/LN://' | sort -n | tail -n 1)
  echo | awk -v chrom_len=$CHROM_LEN -v largest=$LARGEST '{print int((chrom_len/largest + 0.1) * 10)/10}'

Run bamQCMetrics

  run_bam_qc.py \
   -b ~{bamFile} \
   -d ~{markDuplicates} \
   --debug \
   -i ~{normalInsertMax} \
   -o ~{resultName} \
   -r ~{refFasta} \
   -t ~{refSizesBed} \
   -T . \
   -w ~{workflowVersion} \
   ~{dsInput}

Run collateResults

  python3 <<CODE
  import json
  data = json.loads(open("~{bamQCMetricsResult}").read())
  histogram = json.loads(open("~{histogram}").read())
  data["coverage histogram"] = histogram
  metadata = json.loads(open("~{metadata}").read())
  for key in metadata.keys():
      data[key] = metadata[key]
  out = open("~{outputFileName}", "w")
  json.dump(data, out, sort_keys=True)
  out.close()
  CODE

Run cumulativeDistToHistogram

  python3 <<CODE
  import csv, json
  summary = open("~{summary}").readlines()
  globalDist = open("~{globalDist}").readlines()
  # read chromosome lengths from the summary
  summaryReader = csv.reader(summary, delimiter="\t")
  lengthByChr = {}
  for row in summaryReader:
      if row[0] == 'chrom' or row[0] == 'total':
	  continue # skip initial header row, and final total row
      lengthByChr[row[0]] = int(row[1])
  chromosomes = sorted(lengthByChr.keys())
  # read the cumulative distribution for each chromosome
  globalReader = csv.reader(globalDist, delimiter="\t")
  cumDist = {}
  for k in chromosomes:
      cumDist[k] = {}
  for row in globalReader:
      if row[0]=="total":
	  continue
      cumDist[row[0]][int(row[1])] = float(row[2])
  # convert the cumulative distributions to non-cumulative and populate histogram
  # if the input BAM is empty, chromosomes and histogram will also be empty
  histogram = {}
  for k in chromosomes:
      depths = sorted(cumDist[k].keys())
      dist = {}
      for i in range(len(depths)-1):
	  depth = depths[i]
	  nextDepth = depths[i+1]
	  dist[depth] = cumDist[k][depth] - cumDist[k][nextDepth]
      maxDepth = max(depths)
      dist[maxDepth] = cumDist[k][maxDepth]
      # now find the number of loci at each depth of coverage to construct the histogram
      for depth in depths:
	  loci = int(round(dist[depth]*lengthByChr[k], 0))
	  histogram[depth] = histogram.get(depth, 0) + loci
  # if histogram is non-empty, fill in zero values for missing depths
  for i in range(max(histogram.keys(), default=0)):
      if i not in histogram:
	  histogram[i] = 0
  out = open("~{outFileName}", "w")
  json.dump(histogram, out, sort_keys=True)
  out.close()
  CODE

Run downsample

  	set -e
  	set -o pipefail
  	samtools view -b -h ~{bamFile} | \
  	~{preDownsampleCommand} ~{downsampleCommand} \
  	samtools view -b > ~{resultName}

downsampleRegion

  	set -e
  	# ensure BAM file and index are symlinked to working directory
  	ln -s ~{bamFile}
  	ln -s ~{bamIndex}
  	samtools view -b -h ~{bamFileName} ~{region} > ~{resultName}

Run filter

  	set -e
  	set -o pipefail
  	samtools view -h -b -F 2304 -U ~{nonPrimaryReadsFile} ~{bamFile} | \
  	samtools view -h -b -F 4 -U ~{unmappedReadsFile} | \
  	samtools view -h -b -q ~{minQuality} -U ~{lowQualityReadsFile} \
  	> ~{resultName}
  	samtools view -c ~{bamFile} > ~{totalInputReadsFile}
  	samtools view -c ~{nonPrimaryReadsFile} > ~{totalNonPrimaryReadsFile}
  	samtools view -c ~{unmappedReadsFile} > ~{totalUnmappedReadsFile}
  	samtools view -c ~{lowQualityReadsFile} > ~{totalLowQualityReadsFile}
    samtools index ~{bamFile} > ~{resultIndexName}

Run findDownsampleParams

  python3 <<CODE
  import json, math, sys
  readsIn = ~{inputReads}
  readsTarget = ~{targetReads}
  precision = ~{precision}
  print("Input reads param =", readsIn, file=sys.stderr)
  print("Target reads param =", readsTarget, file=sys.stderr)
  minReadsAbsolute = ~{minReadsAbsolute}
  minReadsRelative = ~{minReadsRelative}
  preDownsampleMultiplier = ~{preDSMultiplier}
  if readsIn <= readsTarget:
    # absolutely no downsampling
    applyPreDownsample = False
    applyDownsample = False
    preDownsampleTarget = "no_pre_downsample"
    downSampleTarget = "no_downsample"
  elif readsIn < readsTarget * minReadsRelative or readsTarget < minReadsAbsolute:
    # no predownsampling
    applyPreDownsample = False
    applyDownsample = True
    preDownsampleTarget = "no_pre_downsample"
    downSampleTarget = str(readsTarget)
  else:
    # predownsampling and downsampling
    applyPreDownsample = True
    applyDownsample = True
    probability = (readsTarget * preDownsampleMultiplier)/readsIn
    formatString = "{:0"+str(precision)+"d}"
    preDownsampleTarget = formatString.format(int(math.floor(probability * 10**precision)))
    downSampleTarget = str(readsTarget)
  status = {
    "pre_ds": applyPreDownsample,
    "ds": applyDownsample
  }
  targets = {
    "pre_ds": preDownsampleTarget,
    "ds": downSampleTarget
  }
  statusFile = open("~{statusFile}", "w")
  json.dump(status, statusFile)
  statusFile.close()
  targetFile = open("~{targetsFile}", "w")
  json.dump(targets, targetFile)
  targetFile.close()
  CODE

Run findDownsampleParamsMarkDup

  python3 <<CODE
  readsIn = ~{inputReads}
  threshold = ~{threshold}
  interval = ~{baseInterval}
  start = ~{intervalStart} + 1 # start of sub-chromosome window, if needed; exclude telomeres
  chromosomes = [line.strip() for line in open("~{chromosomesText}").readlines()]
  customRegions = "~{customRegions}" # overrides other chromosome/interval parameters
  ds = True # True if downsampling, false otherwise
  end = None # end of window, if needed
  if readsIn <= threshold:
      ds = False # no downsampling
  elif readsIn <= threshold*10:
      pass # default to chr12 & chr13 =~ 8% of genome
  elif readsIn <= threshold*10**2:
      end = start + interval*10**3 - 1 # default 2*15 million base window ~ 1% of genome
  elif readsIn <= threshold*10**3:
      end = start + interval*10**2 - 1
  elif readsIn <= threshold*10**4:
      end = start + interval*10 - 1
  else:
      end = start + interval - 1
  if ds:
      status = "true"
      if customRegions != "":
	  region = customRegions
      elif end == None:
	  region = " ".join(chromosomes)
      else:
	  regions = ["%s:%i-%i" % (chromosome, start, end) for chromosome in chromosomes ]
	  region = " ".join(regions)
  else:
      status = "false"
      region = ""
  outStatus = open("~{outputStatus}", "w")
  print(status, file=outStatus)
  outStatus.close()
  outRegion = open("~{outputRegion}", "w")
  print(region, file=outRegion)
  outRegion.close()
  CODE

Run markDuplicates

	java -Xmx~{picardMaxMemMb}M \
	-jar ${PICARD_ROOT}/picard.jar \
	MarkDuplicates \
	INPUT=~{bamFile} \
	OUTPUT=~{outFileBam} \
	VALIDATION_STRINGENCY=SILENT \
	TMP_DIR=${PWD} \
	METRICS_FILE=~{outFileText} \
	OPTICAL_DUPLICATE_PIXEL_DISTANCE=~{opticalDuplicatePixelDistance}

Run MergeFiles

      set -euo pipefail
  
      gatk --java-options "-Xmx~{jobMemory - overhead}G" MergeSamFiles \
      ~{sep=" " prefix("--INPUT=", bams)} \
      --OUTPUT="~{outputFileName}~{suffix}.bam" \
      --CREATE_INDEX=true \
      --SORT_ORDER=coordinate \
      --ASSUME_SORTED=false \
      --USE_THREADING=true \
      --VALIDATION_STRINGENCY=SILENT

Run prefilter

     set -e
     set -o pipefail
     samtools view -b \
          -F ~{filterFlags} \
          ~{"-q " + minMapQuality} \
          ~{filterAdditionalParams} \
          ~{bamFile} \
          ~{sep=" " intervals} > ~{resultName}

Run runMosdepth

  set -eo pipefail
  # ensure BAM file and index are symlinked to working directory
  ln -s ~{bamFile}
  ln -s ~{bamIndex}
  # run mosdepth
  MOSDEPTH_PRECISION=8 mosdepth -x -n -t 3 bamqc ~{bamFileName}

splitStringToArray

  set -euo pipefail

  echo "~{str}" | tr '~{lineSeparator}' '\n' | tr '~{recordSeparator}' '\t'

Run updateMetadata

  python3 <<CODE
  import json
  metadata = json.loads(open("~{metadataJson}").read())
  metadata["total input reads meta"] = ~{totalInputReads}
  metadata["non-primary reads meta"] = ~{nonPrimaryReads}
  metadata["unmapped reads meta"] = ~{unmappedReads}
  metadata["low-quality reads meta"] = ~{lowQualityReads}
  outFile = open("~{outFileName}", "w")
  json.dump(metadata, outFile)
  outFile.close()
  CODE

Support

For support, please file an issue on the Github project or send an email to gsi@oicr.on.ca .

Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)

Name		Name	Last commit message	Last commit date
Latest commit History 340 Commits
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Jenkinsfile		Jenkinsfile
README.md		README.md
bamQC.wdl		bamQC.wdl
bamqc_wdl.png		bamqc_wdl.png
commands.txt		commands.txt
filter_downsample.md		filter_downsample.md
vidarrbuild.json		vidarrbuild.json
vidarrtest-regression.json.in		vidarrtest-regression.json.in

oicr-gsi/bamQC

Folders and files

Latest commit

History

Repository files navigation