Skip to content

stjude/RNAIndel

Repository files navigation

RNAIndel

Status Github Issues Pull Requests License: MIT PyPI version
Actions: Documentation Status Actions: Package Status

RNAIndel calls coding indels from tumor RNA-Seq data and classifies them as somatic, germline, and artifactual. RNAIndel supports GRCh38 and 37.

Explore the docs »
Read the paper »

Request Feature | Report Bug
⭐ Consider starring the repo! ⭐

What's new in Version 3

New implementation with indelpost, an indel realigner/phaser.

  • faster analysis (typically < 20 min with 8 cores)
  • somatic complex indel calling in RNA-Seq
  • ensemble calling with your own caller (e.g., GATK HaplotypeCaller/MuTect2)
  • improved sensitivity for homopolymer indels by error-profile outlier analysis

Quick Start

RNAIndel can be executed via Docker or run locally, downloadable via PyPI.

Docker

We publish our latest docker builds on GitHub. You can run the latest code base by running the following command

> docker run --rm -v ${PWD}:/data ghcr.io/stjude/rnaindel:latest

If you want to have a more native feel, you can add an alias to your shell's rc file.

> alias rnaindel="docker run --rm -v ${PWD}:/data ghcr.io/stjude/rnaindel:latest"

Note: if its the first time you are executing the docker run command, you will see the output of docker downloading the image

PyPI

RNAIndel depends on python>=3.8.0 and java>=1.8.0.
Installing via the pip command will install the following packages:

> pip install indelpost --no-binary indelpost --no-build-isolation  
> pip install rnaindel

Test the installation.

> rnaindel -h
usage: rnaindel <subcommand> [<args>]

subcommands are:
    SetUp                     Initialize predicition models
    PredictIndels             Predict somatic/germline/artifact indels from tumor RNA-Seq data
    CalculateFeatures         Calculate and report features for training
    Train                     Perform model training
    CountOccurrence           Count occurrence within cohort to filter false somatic predictions
positional arguments:
  subcommand  PredictIndels, CalculateFeatures, Train, CountOccurrence

optional arguments:
  -h, --help  show this help message and exit
  --version   show program's version number and exit

DataPackage

Download data package (version 3 is not compatible with the previous data package).

#GRCh38
curl -LO https://zenodo.org/records/10552784/files/data_dir_grch38.tar.gz
tar -zxf data_dir_grch38.tar.gz

#GRCh37
curl -LO https://zenodo.org/records/10552784/files/data_dir_grch37.tar.gz
tar -zxf data_dir_grch37.tar.gz

Usage

RNAIndel has 5 subcommands:

  • SetUp pretrain the model with user's sklearn version
  • PredictIndels analyze RNA-Seq data for indel discovery
  • CalculateFeatures calculate features for training
  • Train train models with user's dataset
  • CountOccurrence annotate over-represented somatic predictions

Subcommands are invoked:

> rnaindel subcommand [subcommand-specific options]

Set up

Run the first-time-only command. Takes 5 to 10 minutes to complete.
NOTE: not required to run the docker image.

> rnaindel SetUp -d data_dir

Discover somatic indels

Input BAM file

RNAIndel expects STAR 2-pass mapped BAM file with sorted by coordinate and MarkDuplicates. Further preprocessing such as indel realignment may prevent desired behavior.

Standard calling

This mode uses the built-in caller to analyze simple and complex indels.

> rnaindel PredictIndels -i input.bam -o output.vcf -r ref.fa -d data_dir -p 8 (default 1) 

Ensemble calling

Indels in the exernal VCF (supplied by -v) are integrated to the callset by the built-in caller to boost performance.
See demo.

> rnaindel PredictIndels -i input.bam -o output.vcf -r ref.fa -d data_dir -v gatk.vcf.gz -p 8

With DNA-Seq

Somatic predictions from RNA-Seq are validated against DNA-Seq on the fly.

> rnaindel PredictIndels -i input.bam -o output.vcf -r ref.fa -d data_dir -t tumor.dna.bam -n normal.dna.bam -p 8

Extravaganza

Leverage all resources for best performance.

> rnaindel PredictIndels -i input.bam -o output.vcf -r ref.fa -d data_dir -v mutect2.vcf.gz -t tumor.dna.bam -n normal.dna.bam -p 8

Options

  • -i input STAR-mapped BAM file (required)

  • -o output VCF file (required)

  • -r reference genome FASTA file (required)

  • -d data directory contains trained models and databases (required)

  • -v VCF file (must be .vcf.gz + index) from user's caller. (default: None)

  • -p number of cores (default: 1)

  • other options (click to open)

    • -t Tumor DNA-Seq BAM file (default: None)
    • -n Normal DNA-Seq BAM file (default: None)
    • -q STAR mapping quality MAPQ for unique mappers (default: 255)
    • -m maximum heap space (default: 6000m)
    • --region target genomic region. specify by chrN:start-stop (default: None)
    • --pon user's defined list of non-somatic calls such as PanelOfNormals. Supply as .vcf.gz with index (default: None)
    • --include-all-external-calls set to include all indels in VCF file supplied by -v. (default: False. Use only calls with PASS in FILTER)
    • --skip-homopolyer-outlier-analysis no outlier analysis for homopolymer indels (repeat > 4) performed if set. (default: False)
    • --safety-mode deactivate parallelism at realignment step. may be required to run with -p > 1 on some platforms. (default: False)

Benchmarking

Using pediatric tumor RNA-Seq samples (SJC-DS-1003, n=77), the time and memory consumption was benchmarked for ensemble calling with 8 cores (i.e., -p 8) on a server with 32-core AMD EPYC 7542 CPU @2.90 GHz.

Run time (wall) Max memory
median 374 sec 18.6 GB
max 1388 sec 23.5 GB

Train RNAIndel

Users can train RNAIndel with their own training set.

Annotate over-represented putative somatic indels

Check occurrence to filter probable false positives.

Contact

  • kohei.hagiwara[AT]stjude.org

Citation

Published in Bioinformatics