Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No reads for mitochondrial chromosome and additional contigs after recalibration #1443

Open
edg1983 opened this issue Mar 20, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@edg1983
Copy link

edg1983 commented Mar 20, 2024

Description of the bug

We have run Sarek pipeline v3.1.2 on 3 WGS batches, including around 2500 samples, using the GATK.GRCh38 reference genome from iGenomes and mostly default parameters (see the command).

In all of these samples, the alignment CRAM files from the mapping and markduplicate steps look fine, but in the CRAM file after the recalibration step, all reads outside canonical chromosomes (i.e., chr1-22, X, Y) are lost.
This means that we have no reads on the mitochondrial genome or any other additional contig, and therefore, we can't call variants or perform other analyses on chrM, for example.

Is this a known behavior related to recalibration, or is something unexpected happening during the recalibration step?

Command used and terminal output

#variables
pipeline_name="nf-core/sarek"
pipeline_revision="3.1.2"
pipeline_hub="github"
pipeline_work_dir="/scratch/nextflow_works/primary_analysis"

#run nextflow
nextflow run \
	$pipeline_name -r $pipeline_revision -hub $pipeline_hub \
	-c nextflow.config \
	-profile genfac \
	-params-file params.yaml \
	-work-dir $pipeline_work_dir \
	-resume;

#Content of params.yml
#genome: GATK.GRCh38
#igenomes_base: /processing_data/reference_datasets/iGenomes/2023.1
#input: metadata.csv
#outdir: output_nf-core@sarek_3.1.2/
#save_mapped: true

Relevant files

No response

System information

  • nextflow v23.10.0
  • sarek v3.1.2
  • system: HPC
  • executor: SLURM
  • container engine: singularity v3.8.5
@edg1983 edg1983 added the bug Something isn't working label Mar 20, 2024
@FriederikeHanssen
Copy link
Contributor

FriederikeHanssen commented Mar 20, 2024

Hey! Yes this is due to the intervals file that is used. We use the one provided by GATK. You can change the intervals file or skip the intervals usage all together (beware this will increase runtime quite a bit because then no parallelization can get done).

@edg1983
Copy link
Author

edg1983 commented Mar 20, 2024

Thanks for the explanation.
So you are using this file iGenomes/2023.1/Homo_sapiens/GATK/GRCh38/Annotation/intervals/wgs_calling_regions.hg38.bed from the iGenomes release, right?

I'm fine with excluding additional contigs (like decoys, HLA, ALT, etc.), but also excluding chrM is not the best solution, especially when this is applied to the alignments. I guess most people expect to have chrM reads in all the BAM/CRAM files produced by the pipeline.
Overall, the current behavior seems a little confusing since the data is present in some BAM/CRAM (mapped and markduplicates), but then removed in the recalibrated one.

@FriederikeHanssen
Copy link
Contributor

I need to read up on it a bit more. We basically just took the reference files as is from GATK and ran with it: https://gatk.broadinstitute.org/hc/en-us/articles/360035889551-When-should-I-restrict-my-analysis-to-specific-intervals

So I am guessing they had a good reason to not have this in the intervals file? In any case you can easily circumvent this by adding the mitochondrial coordinates to the file and specifying it with --intervals.

We are actually using this file:

intervals = "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Annotation/intervals/wgs_calling_regions_noseconds.hg38.bed"
which is the same except that we removed the seconds because it was easier for grouping them together. Regions are all the same though.

The duplicate marking is not parallelised along genomic coordinates but everything is processed together which is why nothing is removed.

@edg1983
Copy link
Author

edg1983 commented Mar 21, 2024

Thanks for the advice. I will procede preparing my custom intervals file for future use.

In think GATK compiled this list of regions mainly for variant calling, to remove noisy or difficult to re-assemble regions that would slow down the variant caller and give low confidence calls. They probably excluded chrM from these regions because they have a dedicated pipeline and calling variants on chrM is usually treated with dedicated methods.

I can understand this approach in the context of variant calling, but keep in mind that using these regions has the subtle consequence of not only skipping chrM, but also specific regions across chromosomes. If this is the default behaviour, this needs to be strongly highlighted in the documentation so that users have the right expectations on the output. For example, some downstream analysis expect variants called genome without any exclusion regions and this is relevant also when performing datasets comparison.

Finally, IMHO, it is never a good practice to remove reads from the BAM files since usually these are supposed to losslessly represent the input sequences (after reads QC and trimming). If you decide not to change the current approach with recalibration then my suggestion is to at least strongly highlight this behaviour in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants