Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add option to allow mapping to pangenome #274

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

huzuner
Copy link

@huzuner huzuner commented Nov 16, 2023

This PR allows an optional pangenome alignment with vg giraffe.
The following additions/changes are made:

  • A new option pangenome is added to the config.
  • A rule to download the vg pangenome index is added into ref.smk.
  • Six other rules for aligning reads to the pangenome and preprocessing the resulting bam were added, into mapping.smk.
  • All input functions are updated to get pangenome aligned bam upon activation of pangenome in the config file.

@huzuner huzuner changed the title add option to allow mapping to pangenome feat: add option to allow mapping to pangenome Nov 17, 2023
@huzuner
Copy link
Author

huzuner commented Feb 15, 2024

Bug with vg: alignments with deletions on ends: vgteam/vg#4204. They fixed the issue and the new release is coming on Feb. 26.

@@ -112,7 +115,7 @@ calling:
gene_list_filter:
aux-files:
super_interesting_genes: "config/super_interesting_genes.tsv"
expression: "ANN['GENE'] in AUX['super_interesting_genes']"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FelixMoelder please have a look. Did VEP change the column name for the gene?

Comment on lines 142 to 143
- somatic_tumor_high
- somatic_tumor_medium
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rather not change the scenario of the default config (same below in config/scenario.yaml).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use default workflow config and scenario.

Comment on lines 243 to 245
return "results/mapped/vg/{{sample}}_rg_added.{ext}".format(ext=ext)
else:
return "results/mapped/bwa/{{sample}}.{ext}".format(ext=ext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use f-string in both cases, use {mapper} or something like that in order to not mostly repeat the f-string.

Comment on lines +422 to +429
if get_sample_datatype(wildcards.sample) == "rna":
aligner = "star"
elif get_sample_datatype(wildcards.sample) == "dna" & is_activated(
"ref/pangenome"
):
aligner = "vg"
else:
aligner = "bwa"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same code as above, create new function get_aligner(wildcards) and use in both blocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and below as well

@@ -19,6 +19,7 @@ rule filter_candidates_by_annotation:
rule filter_by_annotation:
input:
bcf=get_annotated_bcf,
csi=lambda wc: get_annotated_bcf(wc, index=True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
csi=lambda wc: get_annotated_bcf(wc, index=True),
csi=partial(get_annotated_bcf, index=True),

rule map_reads_vg_giraffe:
input:
reads=get_map_reads_input,
idx="resources/pangenome/hprc-v1.0-mc-grch38.xg",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be hardcoded. What if a different pangenome is used? The pangenome index should be configurable via the config.

benchmark:
"benchmarks/samtools_view_primary_chr/{sample}.tsv"
params:
region="GRCh38.chr1 GRCh38.chr2 GRCh38.chr3 GRCh38.chr4 GRCh38.chr5 GRCh38.chr6 GRCh38.chr7 GRCh38.chr8 GRCh38.chr9 GRCh38.chr10 GRCh38.chr11 GRCh38.chr12 GRCh38.chr13 GRCh38.chr14 GRCh38.chr15 GRCh38.chr16 GRCh38.chr17 GRCh38.chr18 GRCh38.chr19 GRCh38.chr20 GRCh38.chr21 GRCh38.chr22 GRCh38.chrX GRCh38.chrY GRCh38.chrM",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be hardcoded. Workflow has to work for all species in principle (as long as there is a pangenome index). We have a selection for the first n chromosomes already implemented in the workflow (at the level of the variants). Isn't that enough?

Comment on lines +313 to +325
rule add_rg:
input:
"results/mapped/vg/{sample}_reheadered.bam",
output:
"results/mapped/vg/{sample}.bam",
log:
"logs/picard/add_rg/{sample}.log",
params:
extra="--RGLB lib1 --RGPL illumina --RGPU {sample} --RGSM {sample}",
resources:
mem_mb=60000,
wrapper:
"v2.3.2/bio/picard/addorreplacereadgroups"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that it is impossible to define the readgroup directly when calling vg giraffe?


rule get_vg_pangenome:
output:
"resources/pangenome/hprc-v1.0-mc-grch38.xg",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather name this analogously to the bwa index output, hence:

Suggested change
"resources/pangenome/hprc-v1.0-mc-grch38.xg",
f"{genome}.xg",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants