Skip to content
Joseph Brew edited this page Mar 19, 2024 · 26 revisions

SCAN2 full run example

The following commands exemplify a full run of SCAN2, including cross sample panel building and indel calling, on a SLURM cluster.

Prerequisites

You will need:

  1. the BAM files of interest (your scWGS samples and at least one bulk sample)
  2. BAM files from scWGS data (including a bulk sample) from a completely different individual. These files are necessary to construct a cross-sample panel for filtering indel calls that SCAN2 will make. Sometimes, you might want to call sSNVs and sIDs from these BAM files too, but you definitely need BAM files from a different individual to construct a cross-sample filter. For our purposes, we used a set of 52 neurons assayed by PTA and scWGS from Luquette et al 2022.
  3. a CSV file (you can title it panel_meta.csv) containing the following metadata for each single cell and bulk in #1 and #2: an ID for the donor of each single cell (e.g., one human brain or one unique cell line), the single cell (or bulk) sample ID as specified in the BAM, and either "PTA", "MDA" or "bulk" for the amplification type. The tags used in the donor field can be determined by the user; however, the sample tags MUST match the SM: tag in the associated BAM file. It would look like so for 4 samples: two scWGS from PTA, one scWGS from MDA, and one bulk sample
donor,sample,amp
d1,single_cell_1,PTA
d1,single_cell_2,PTA
d2,single_cell_3,MDA
d1,hbulk,bulk
  1. The required sequence/variant references installed per the instructions in the SCAN2 github repository.
  2. Ideally, you are running this entire single-cell mutation calling workflow on a cluster with a job scheduler to allow for job parallelization (eg: SLURM). The instructions that follow will provide options for running on our local SLURM cluster.
  3. A list of genomic intervals over which you can split the GATK runs. We provide an example file (in hg19 coordinates) with SCAN2 that is suitable for analyses with <10 cells: gatk_regions_file_hg19.txt. This is strongly recommended so that you can parallelize the jobs effectively. For typical runs of <5 single cells, dividing chrs 1-22 of the human genome into 300 windows works well. For much larger projects (and especially when building a cross-sample panel with 50-100 single cells and bulks), we often divide chrs 1-22 into ~30,000 windows of ~100 kb each. These windows can be generated using standard tools such as bedtools makewindows -g <genome file> -n <number of desired windows>. An example file with ~30,000 windows is provided: gatk_regions_file_hg19.30970windows_100kb.chr2_1kb_window_removed.autosome_only.txt. Ensure that chromosomes are named in the same way as the aligned BAMs (i.e., with or without a 'chr' prefix). DO NOT INCLUDE SEX CHROMOSOMES IN YOUR PANEL. THESE ARE CURRENTLY UNSUPPORTED.

Step 1: construct the cross-sample panel

An automated version of the below will be provided in a script, but here are the basic instructions below. Suppose that we go back to the panel_meta.csv from the Prerequisites step. We can set up a panel as below

#!/bin/bash

scan2 -d panel init
scan2 -d panel config \
	--verbose \
	--analysis makepanel \
	--regions-file /path/to/panel_regions_300windows.txt \
	--verbose \
	--analysis makepanel \
	--ref /path/to/reference/genome/fasta/file.fasta \
	--dbsnp /path/to/dbsnp/germline/variants.vcf \
	--shapeit-refpanel /path/to/1000GP_Phase3 \
	--makepanel-metadata panel_meta.csv \
	--bam /path/to/bulk.bam \
	--bam /path/to/single_cell_1.bam \
	--bam /path/to/single_cell_2.bam \
	--bam /path/to/single_cell_3.bam
    
scan2 -d panel validate

scan2 -d panel \
        makepanel --joblimit 300 \
        --snakemake-args '--slurm --default-resources slurm_account=park_contrib slurm_partition=park runtime=2880 --keep-going --max-status-checks-per-second 0.1 ' # --dryrun' \

The first command initializes your job and creates a folder called panel. The second command configures the entire run (essentially, by creating a file called panel/scan.yaml which will contain all of the parameters for this run). The third command will make sure the job has been specified correctly. The fourth command will actually run the job. Note a few things about this command.

  1. --joblimit 300 will specify the max number of jobs that will be submitted at once. Here, at most 300 jobs will be submitted at once. You could increase this number, but be wary about how many cores your queue can access.
  2. --slurm after --snakemake-args will specify details for submitting your jobs to SLURM. The option --default-resources will specify the values provided to slurm_account, slurm_partition and runtime (in minutes) as default resources that all jobs will use. See this link for more information on default resources. Log files will automatically be generated in the .snakemake directory in the panel directory. NOTE: if not using SLURM, the option --cluster can be used instead of --slurm with a command of the following format:
scan2 -d panel \
        makepanel --joblimit 300 \
        --cluster ' sbatch -p park -A park_contrib -c {threads} --mem={resources.mem_mb} -t 6:00:00 -o %logdir/slurm-%A.log' \
        --snakemake-args ' --keep-going --max-status-checks-per-second 0.1'

See this link for more information on using --cluster and on executing jobs on clusters in Snakemake in general.

If you want to see the set of operations for the job and not actually execute the operations, you can replace the last command with the following:

scan2 -d panel \
        makepanel --joblimit 1000 \
        --snakemake-args ' --dryrun'

This will just do a dry run, i.e. print out all of the jobs that will be run and exit.

NOTE: memory allocation for each job works such that when running the analysis on many single cells, some jobs may fail, as certain genomic regions will require more memory than others. SCAN2 has an option called --retries which can be added after --snakemake-args, e.g.

--snakemake-args '--slurm --default-resources slurm_account=park_contrib slurm_partition=park runtime=2880 --keep-going --max-status-checks-per-second 0.1 --retries 3' # --dryrun'

With --retries 3 added, the program attempts to resubmit each failed job 3 times, each time with a higher memory allocation, hence increasing the chances that all jobs are completed successfully. If jobs are still failing, you may have to modify the associated Snakefiles' resources (e.g. increasing mem_mb) manually to ensure completion of all jobs. This could require some trial and error. See here for information on Snakefile resources.

Once you successfully run the job, you will end up with two new files

$ls panel/panel/panel.tab.gz{,.tbi}
panel/panel/panel.tab.gz  panel/panel/panel.tab.gz.tbi

This is your panel used for filtering. Make sure that you keep the panel.tab.gz.tbi file with the panel.tab.gz file.

Step 2: run SCAN2 on your desired BAM files of interest

The SCAN2 repository gives the basic instructions for how to run the demo. These instructions are modified, taking into account the cross-sample panel that you have just created and the other steps needed for parallelizing jobs. Let's say that we want to do analyses on just single_cell_1 and single_cell_2. Run the below commands in the same directory as where you created the panel folder (i.e. the run_job and panel folders should sit in the same directory).

scan2 -d run_job init
cd run_job

# Configure analysis parameters. Can be run multiple times to change
# parameters.
scan2 config \
	--verbose \
	--ref /path/to/reference/genome/fasta/file.fasta \
	--dbsnp /path/to/dbsnp/germline/variants.vcf \
	--shapeit-refpanel /path/to/1000GP_Phase3 \
	--regions-file /path/to/panel_regions_300windows.txt \
	--bulk-bam /path/to/hunamp.chr22.bam \
	--sc-bam /path/to/single_cell_1.bam \
	--sc-bam /path/to/single_cell_2.bam \
	--abmodel-n-cores 10 \
	--cross-sample-panel panel/panel/panel.tab.gz

scan2 validate

scan2 run \
        --joblimit 300 \
        --snakemake-args '--slurm --default-resources slurm_account=park_contrib slurm_partition=park runtime=2880 --keep-going --max-status-checks-per-second 0.1 ' # --dryrun'

To explain some of these steps

  1. --abmodel-n-cores 10 is suggested for the # of cores to use for fitting the allele balance model. You could go up to 20, but 10 works for our machines much better.
  2. Note that we specified our --regions-file /path/to/panel_regions_300windows.txt as the regions over which we do GATK mutation calls.
  3. Note that the cross-sample panel that we constructed in Step 1 is provided as --cross-sample-panel panel/panel/panel.tab.gz.

Step 3: run the signature rescue steps

The rescue step will use signature-based rescue described in the SCAN2 paper for rescuing mutations. Unlike steps 1 and 2, which do take a while to run, step 3 runs relatively quickly. Run this, like before, in the same directory where panel and run_job are.

scan2 -d rescue init

scan2 -d rescue config \
	--verbose \
	--analysis rescue \
	--rescue-target-fdr 0.01 \
	--scan2-object single_cell_1 run_job/call_mutations/single_cell_1/scan2_object.rda \
	--scan2-object single_cell_2 run_job/call_mutations/single_cell_2/scan2_object.rda

scan2 -d rescue validate

scan2 -d rescue rescue --joblimit 20

Some notes:

  1. if you wanted to load mutations from another SCAN2 run (say other_muts.csv), you can add them like so
scan2 -d rescue config \
    --verbose \
    --analysis rescue \
    --add-muts other_muts.csv \
    --rescue-target-fdr 0.01 \
    --scan2-object single_cell_1 run_job/call_mutations/single_cell_1/scan2_object.rda \
    --scan2-object single_cell_2 run_job/call_mutations/single_cell_2/scan2_object.rda
  1. You can parallelize this job if you want, but it runs within a few minutes, so in theory you could just let this job proceed on your terminal without submitting as an sbatch job.

A note on timing

The panel and run steps are the most time-consuming steps in this workflow, particularly the GATK steps. It helps to think about how many compute-hours you will require to run your jobs and, with parallelization, how much real-world time your runs will take. Here's a breakdown for a run of 40 PTA single cells that we are interested in analyzing, along with 52 PTA neurons from Luquette et al 2021 to help us construct our cross-sample panel;

  1. We estimate that these 92 cells will take about 80,000 compute hours to construct the cross-sample panel.
  2. The core SCAN2 run has three major time-consuming units a. There are two GATK runs that occur during the actual SCAN2 run. For the 40 cells of interest, each GATK job will take about 40,000 compute hours, so we expect that the GATK runs in the actual SCAN2 run itself will take about 80,000 more compute hours. b. After the GATK runs, the allele balance model is fit, taking about 6 compute hours per chromosome per cell. This brings us up to about 5,000 more compute hours. c. The rest of the pipeline in the core will take about the same number of hours to finish, or another 5,000 compute hours
  3. The signature-based rescue takes only a few minutes.

In all, we estimate a run of 170,000 compute hours. We suggested --joblimit 300, which means that the above jobs would take about 23-24 days to run on SLURM with DRMAA, on our local cluster. If you used --joblimit 800 (rather large # of jobs), we are looking at about 8-9 days.