QC metrics for BAM files
java -jar cromwell.jar run bamQC.wdl --inputs inputs.json
Parameter | Value | Description |
---|---|---|
inputGroups |
Array[InputGroup] | Array of objects describing sets of bams to merge together and on which to compute QC metrics |
metadata |
Map[String,String] | JSON file containing metadata |
mode |
String | running mode for the workflow, only allow value 'lane_level' and 'call_ready' |
bamQCMetrics.refFasta |
String | Path to human genome FASTA reference |
bamQCMetrics.refSizesBed |
String | Path to human genome BED reference with chromosome sizes |
bamQCMetrics.workflowVersion |
String | Workflow version string |
Parameter | Value | Default | Description |
---|---|---|---|
outputFileNamePrefix |
String | "bamQC" | Prefix for output files |
intervalsToParallelizeByString |
String | "chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY,chrM" | Comma separated list of intervals to split by (e.g. chr1,chr2,chr3,chr4). |
Parameter | Value | Default | Description |
---|---|---|---|
splitStringToArray.lineSeparator |
String | "," | Interval group separator - these are the intervals to split by. |
splitStringToArray.recordSeparator |
String | "+" | Record separator - a delimiter for joining records |
splitStringToArray.jobMemory |
Int | 1 | Memory allocated to job (in GB). |
splitStringToArray.cores |
Int | 1 | The number of cores to allocate to the job. |
splitStringToArray.timeout |
Int | 1 | Maximum amount of time (in hours) the task can run for. |
splitStringToArray.modules |
String | "" | Environment module name and version to load (space separated) before command execution. |
getChrCoefficient.memory |
Int | 2 | Memory allocated for this job |
getChrCoefficient.timeout |
Int | 1 | Hours before task timeout |
getChrCoefficient.modules |
String | "samtools/1.14" | Names and versions of modules to load |
preFilter.filterFlags |
Int | 260 | Samtools filter flags to apply. |
preFilter.minMapQuality |
Int? | None | Minimum alignment quality to pass filter |
preFilter.filterAdditionalParams |
String? | None | Additional parameters to pass to samtools. |
preFilter.modules |
String | "samtools/1.14" | required environment modules |
preFilter.jobMemory |
Int | 6 | Memory allocated for this job |
preFilter.minMemory |
Int | 2 | Minimum amount of RAM allocated to the task |
preFilter.threads |
Int | 4 | Requested CPU threads |
preFilter.timeout |
Int | 4 | hours before task timeout |
mergeSplitByIntervalFiles.suffix |
String | ".merge" | suffix to use for merged bam |
mergeSplitByIntervalFiles.jobMemory |
Int | 24 | Memory allocated to job (in GB). |
mergeSplitByIntervalFiles.overhead |
Int | 6 | Java overhead memory (in GB). jobMemory - overhead == java Xmx/heap memory. |
mergeSplitByIntervalFiles.cores |
Int | 1 | The number of cores to allocate to the job. |
mergeSplitByIntervalFiles.timeout |
Int | 24 | Maximum amount of time (in hours) the task can run for. |
mergeSplitByIntervalFiles.modules |
String | "gatk/4.1.6.0" | Environment module name and version to load (space separated) before command execution. |
mergeFiles.suffix |
String | ".merge" | suffix to use for merged bam |
mergeFiles.jobMemory |
Int | 24 | Memory allocated to job (in GB). |
mergeFiles.overhead |
Int | 6 | Java overhead memory (in GB). jobMemory - overhead == java Xmx/heap memory. |
mergeFiles.cores |
Int | 1 | The number of cores to allocate to the job. |
mergeFiles.timeout |
Int | 24 | Maximum amount of time (in hours) the task can run for. |
mergeFiles.modules |
String | "gatk/4.1.6.0" | Environment module name and version to load (space separated) before command execution. |
filter.minQuality |
Int | 30 | Minimum alignment quality to pass filter |
filter.modules |
String | "samtools/1.14" | required environment modules |
filter.jobMemory |
Int | 16 | Memory allocated for this job |
filter.threads |
Int | 4 | Requested CPU threads |
filter.timeout |
Int | 4 | hours before task timeout |
updateMetadata.modules |
String | "python/3.6" | required environment modules |
updateMetadata.jobMemory |
Int | 16 | Memory allocated for this job |
updateMetadata.threads |
Int | 4 | Requested CPU threads |
updateMetadata.timeout |
Int | 4 | hours before task timeout |
findDownsampleParams.targetReads |
Int | 100000 | Desired number of reads in downsampled output |
findDownsampleParams.minReadsAbsolute |
Int | 10000 | Minimum value of targetReads to allow pre-downsampling |
findDownsampleParams.minReadsRelative |
Int | 2 | Minimum value of (inputReads)/(targetReads) to allow pre-downsampling |
findDownsampleParams.precision |
Int | 8 | Number of decimal places in fraction for pre-downsampling |
findDownsampleParams.preDSMultiplier |
Float | 1.5 | Determines target size for pre-downsampled set (if any). Must have (preDSMultiplier) < (minReadsRelative). |
findDownsampleParams.modules |
String | "python/3.6" | required environment modules |
findDownsampleParams.jobMemory |
Int | 16 | Memory allocated for this job |
findDownsampleParams.threads |
Int | 4 | Requested CPU threads |
findDownsampleParams.timeout |
Int | 4 | hours before task timeout |
findDownsampleParamsMarkDup.threshold |
Int | 10000000 | Minimum number of reads to conduct downsampling |
findDownsampleParamsMarkDup.chromosomes |
Array[String] | ["chr12", "chr13", "chrXII", "chrXIII"] | Array of chromosome identifiers for downsampled subset |
findDownsampleParamsMarkDup.baseInterval |
Int | 15000 | Base width of interval in each chromosome, for very large BAMs |
findDownsampleParamsMarkDup.intervalStart |
Int | 100000 | Start of interval in each chromosome, for very large BAMs |
findDownsampleParamsMarkDup.customRegions |
String | "" | Custom downsample regions; overrides chromosome and interval parameters |
findDownsampleParamsMarkDup.modules |
String | "python/3.6" | required environment modules |
findDownsampleParamsMarkDup.jobMemory |
Int | 16 | Memory allocated for this job |
findDownsampleParamsMarkDup.threads |
Int | 4 | Requested CPU threads |
findDownsampleParamsMarkDup.timeout |
Int | 4 | hours before task timeout |
downsample.downsampleSuffix |
String | "downsampled.bam" | Suffix for output file |
downsample.randomSeed |
Int | 42 | Random seed for pre-downsampling (if any) |
downsample.modules |
String | "samtools/1.14" | required environment modules |
downsample.jobMemory |
Int | 16 | Memory allocated for this job |
downsample.threads |
Int | 4 | Requested CPU threads |
downsample.timeout |
Int | 4 | hours before task timeout |
downsampleRegion.modules |
String | "samtools/1.14" | required environment modules |
downsampleRegion.jobMemory |
Int | 16 | Memory allocated for this job |
downsampleRegion.threads |
Int | 4 | Requested CPU threads |
downsampleRegion.timeout |
Int | 4 | hours before task timeout |
markDuplicates.opticalDuplicatePixelDistance |
Int | 100 | Maximum offset between optical duplicate clusters |
markDuplicates.picardMaxMemMb |
Int | 6000 | Memory requirement in MB for running Picard JAR |
markDuplicates.modules |
String | "picard/2.21.2" | required environment modules |
markDuplicates.jobMemory |
Int | 16 | Memory allocated for this job |
markDuplicates.threads |
Int | 4 | Requested CPU threads |
markDuplicates.timeout |
Int | 4 | hours before task timeout |
bamQCMetrics.normalInsertMax |
Int | 1500 | Maximum of expected insert size range |
bamQCMetrics.modules |
String | "bam-qc-metrics/0.2.5" | required environment modules |
bamQCMetrics.jobMemory |
Int | 16 | Memory allocated for this job |
bamQCMetrics.threads |
Int | 4 | Requested CPU threads |
bamQCMetrics.timeout |
Int | 4 | hours before task timeout |
runMosdepth.modules |
String | "mosdepth/0.2.9" | required environment modules |
runMosdepth.jobMemory |
Int | 16 | Memory allocated for this job |
runMosdepth.threads |
Int | 4 | Requested CPU threads |
runMosdepth.timeout |
Int | 4 | hours before task timeout |
cumulativeDistToHistogram.modules |
String | "python/3.6" | required environment modules |
cumulativeDistToHistogram.jobMemory |
Int | 8 | Memory allocated for this job |
cumulativeDistToHistogram.threads |
Int | 4 | Requested CPU threads |
cumulativeDistToHistogram.timeout |
Int | 1 | hours before task timeout |
collateResults.modules |
String | "python/3.6" | required environment modules |
collateResults.jobMemory |
Int | 8 | Memory allocated for this job |
collateResults.threads |
Int | 4 | Requested CPU threads |
collateResults.timeout |
Int | 1 | hours before task timeout |
Output | Type | Description |
---|---|---|
result |
File | json file that contains metrics and meta data described in https://github.com/oicr-gsi/bam-qc-metrics/blob/master/metrics.md |
This section lists command(s) run by WORKFLOW workflow
CHROM_LEN=$(samtools view -H ~{bamFile} | grep ^@SQ | grep -v _ | grep -w ~{chromosome} | cut -f 3 | sed 's/LN://')
LARGEST=$(samtools view -H ~{bamFile} | grep ^@SQ | grep -v _ | cut -f 3 | sed 's/LN://' | sort -n | tail -n 1)
echo | awk -v chrom_len=$CHROM_LEN -v largest=$LARGEST '{print int((chrom_len/largest + 0.1) * 10)/10}'
run_bam_qc.py \
-b ~{bamFile} \
-d ~{markDuplicates} \
--debug \
-i ~{normalInsertMax} \
-o ~{resultName} \
-r ~{refFasta} \
-t ~{refSizesBed} \
-T . \
-w ~{workflowVersion} \
~{dsInput}
python3 <<CODE
import json
data = json.loads(open("~{bamQCMetricsResult}").read())
histogram = json.loads(open("~{histogram}").read())
data["coverage histogram"] = histogram
metadata = json.loads(open("~{metadata}").read())
for key in metadata.keys():
data[key] = metadata[key]
out = open("~{outputFileName}", "w")
json.dump(data, out, sort_keys=True)
out.close()
CODE
python3 <<CODE
import csv, json
summary = open("~{summary}").readlines()
globalDist = open("~{globalDist}").readlines()
# read chromosome lengths from the summary
summaryReader = csv.reader(summary, delimiter="\t")
lengthByChr = {}
for row in summaryReader:
if row[0] == 'chrom' or row[0] == 'total':
continue # skip initial header row, and final total row
lengthByChr[row[0]] = int(row[1])
chromosomes = sorted(lengthByChr.keys())
# read the cumulative distribution for each chromosome
globalReader = csv.reader(globalDist, delimiter="\t")
cumDist = {}
for k in chromosomes:
cumDist[k] = {}
for row in globalReader:
if row[0]=="total":
continue
cumDist[row[0]][int(row[1])] = float(row[2])
# convert the cumulative distributions to non-cumulative and populate histogram
# if the input BAM is empty, chromosomes and histogram will also be empty
histogram = {}
for k in chromosomes:
depths = sorted(cumDist[k].keys())
dist = {}
for i in range(len(depths)-1):
depth = depths[i]
nextDepth = depths[i+1]
dist[depth] = cumDist[k][depth] - cumDist[k][nextDepth]
maxDepth = max(depths)
dist[maxDepth] = cumDist[k][maxDepth]
# now find the number of loci at each depth of coverage to construct the histogram
for depth in depths:
loci = int(round(dist[depth]*lengthByChr[k], 0))
histogram[depth] = histogram.get(depth, 0) + loci
# if histogram is non-empty, fill in zero values for missing depths
for i in range(max(histogram.keys(), default=0)):
if i not in histogram:
histogram[i] = 0
out = open("~{outFileName}", "w")
json.dump(histogram, out, sort_keys=True)
out.close()
CODE
set -e
set -o pipefail
samtools view -b -h ~{bamFile} | \
~{preDownsampleCommand} ~{downsampleCommand} \
samtools view -b > ~{resultName}
set -e
# ensure BAM file and index are symlinked to working directory
ln -s ~{bamFile}
ln -s ~{bamIndex}
samtools view -b -h ~{bamFileName} ~{region} > ~{resultName}
set -e
set -o pipefail
samtools view -h -b -F 2304 -U ~{nonPrimaryReadsFile} ~{bamFile} | \
samtools view -h -b -F 4 -U ~{unmappedReadsFile} | \
samtools view -h -b -q ~{minQuality} -U ~{lowQualityReadsFile} \
> ~{resultName}
samtools view -c ~{bamFile} > ~{totalInputReadsFile}
samtools view -c ~{nonPrimaryReadsFile} > ~{totalNonPrimaryReadsFile}
samtools view -c ~{unmappedReadsFile} > ~{totalUnmappedReadsFile}
samtools view -c ~{lowQualityReadsFile} > ~{totalLowQualityReadsFile}
samtools index ~{bamFile} > ~{resultIndexName}
python3 <<CODE
import json, math, sys
readsIn = ~{inputReads}
readsTarget = ~{targetReads}
precision = ~{precision}
print("Input reads param =", readsIn, file=sys.stderr)
print("Target reads param =", readsTarget, file=sys.stderr)
minReadsAbsolute = ~{minReadsAbsolute}
minReadsRelative = ~{minReadsRelative}
preDownsampleMultiplier = ~{preDSMultiplier}
if readsIn <= readsTarget:
# absolutely no downsampling
applyPreDownsample = False
applyDownsample = False
preDownsampleTarget = "no_pre_downsample"
downSampleTarget = "no_downsample"
elif readsIn < readsTarget * minReadsRelative or readsTarget < minReadsAbsolute:
# no predownsampling
applyPreDownsample = False
applyDownsample = True
preDownsampleTarget = "no_pre_downsample"
downSampleTarget = str(readsTarget)
else:
# predownsampling and downsampling
applyPreDownsample = True
applyDownsample = True
probability = (readsTarget * preDownsampleMultiplier)/readsIn
formatString = "{:0"+str(precision)+"d}"
preDownsampleTarget = formatString.format(int(math.floor(probability * 10**precision)))
downSampleTarget = str(readsTarget)
status = {
"pre_ds": applyPreDownsample,
"ds": applyDownsample
}
targets = {
"pre_ds": preDownsampleTarget,
"ds": downSampleTarget
}
statusFile = open("~{statusFile}", "w")
json.dump(status, statusFile)
statusFile.close()
targetFile = open("~{targetsFile}", "w")
json.dump(targets, targetFile)
targetFile.close()
CODE
python3 <<CODE
readsIn = ~{inputReads}
threshold = ~{threshold}
interval = ~{baseInterval}
start = ~{intervalStart} + 1 # start of sub-chromosome window, if needed; exclude telomeres
chromosomes = [line.strip() for line in open("~{chromosomesText}").readlines()]
customRegions = "~{customRegions}" # overrides other chromosome/interval parameters
ds = True # True if downsampling, false otherwise
end = None # end of window, if needed
if readsIn <= threshold:
ds = False # no downsampling
elif readsIn <= threshold*10:
pass # default to chr12 & chr13 =~ 8% of genome
elif readsIn <= threshold*10**2:
end = start + interval*10**3 - 1 # default 2*15 million base window ~ 1% of genome
elif readsIn <= threshold*10**3:
end = start + interval*10**2 - 1
elif readsIn <= threshold*10**4:
end = start + interval*10 - 1
else:
end = start + interval - 1
if ds:
status = "true"
if customRegions != "":
region = customRegions
elif end == None:
region = " ".join(chromosomes)
else:
regions = ["%s:%i-%i" % (chromosome, start, end) for chromosome in chromosomes ]
region = " ".join(regions)
else:
status = "false"
region = ""
outStatus = open("~{outputStatus}", "w")
print(status, file=outStatus)
outStatus.close()
outRegion = open("~{outputRegion}", "w")
print(region, file=outRegion)
outRegion.close()
CODE
java -Xmx~{picardMaxMemMb}M \
-jar ${PICARD_ROOT}/picard.jar \
MarkDuplicates \
INPUT=~{bamFile} \
OUTPUT=~{outFileBam} \
VALIDATION_STRINGENCY=SILENT \
TMP_DIR=${PWD} \
METRICS_FILE=~{outFileText} \
OPTICAL_DUPLICATE_PIXEL_DISTANCE=~{opticalDuplicatePixelDistance}
set -euo pipefail
gatk --java-options "-Xmx~{jobMemory - overhead}G" MergeSamFiles \
~{sep=" " prefix("--INPUT=", bams)} \
--OUTPUT="~{outputFileName}~{suffix}.bam" \
--CREATE_INDEX=true \
--SORT_ORDER=coordinate \
--ASSUME_SORTED=false \
--USE_THREADING=true \
--VALIDATION_STRINGENCY=SILENT
set -e
set -o pipefail
samtools view -b \
-F ~{filterFlags} \
~{"-q " + minMapQuality} \
~{filterAdditionalParams} \
~{bamFile} \
~{sep=" " intervals} > ~{resultName}
set -eo pipefail
# ensure BAM file and index are symlinked to working directory
ln -s ~{bamFile}
ln -s ~{bamIndex}
# run mosdepth
MOSDEPTH_PRECISION=8 mosdepth -x -n -t 3 bamqc ~{bamFileName}
set -euo pipefail
echo "~{str}" | tr '~{lineSeparator}' '\n' | tr '~{recordSeparator}' '\t'
python3 <<CODE
import json
metadata = json.loads(open("~{metadataJson}").read())
metadata["total input reads meta"] = ~{totalInputReads}
metadata["non-primary reads meta"] = ~{nonPrimaryReads}
metadata["unmapped reads meta"] = ~{unmappedReads}
metadata["low-quality reads meta"] = ~{lowQualityReads}
outFile = open("~{outFileName}", "w")
json.dump(metadata, outFile)
outFile.close()
CODE
For support, please file an issue on the Github project or send an email to gsi@oicr.on.ca .
Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)