Skip to content

10x single cell short- and long-read RNA sequencing

License

Notifications You must be signed in to change notification settings

vpc-ccg/scTagger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

install with bioconda

scTagger

scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to enable relating at the cell level gene expression (from short-reads) and RNA splicing (from the long-reads).

Installation

Conda

scTagger is available as a Conda package:

conda create -n sctagger-env -c bioconda sctagger 
conda activate sctagger-env
scTagger.py -h

Running with Snakemake

We provided a simple Snakefile alongside a config.yaml file that runs the three stages of scTagger as well as Cell Ranger (assumes Cell Ranger is in path).

Running manually

scTagger has a single python script containing different functions to match long-reads and short-reads barcodes.

The whole pipeline contains three steps that you can run each part separately:

1) Extract long-reads segment

The first step of the scTagger pipeline is to extract a segment where the probability of seeing a barcode is more than in other places. To run this step, you can use the following command.

./scTagger.py extract_lr_bc -r "path/to/long/read/fastq" -o "path/to/output/file" -p "path/to/output/plots"

Augments

  • -r: Space separated paths to reads in FASTQ
  • -g: Space separated of the ranges of where SR adapter should be found on the LR's (Optional, Default: Detect from data)
  • -z: Indicate input is gzipped (Optional, Default: Assume input is gzipped if it ends with ".gz")
  • -t: Number of threads (Optional, Default: 1)
  • -sa: Short-read adapter (Optional, Default: CTACACGACGCTCTTCCGATCT)
  • --num-bp-afte: Number of bases after the end of the SR adapter alignment to generate (Optional, Default: 20)
  • -o: Path to output file
  • -p: Path to plot file (Optional, Default: No plotting)

Inputs

  • A list of FASTQ files of long-reads

Outputs

  • A Tsv file:
    • First column is read-id
    • Second column is the best edit distance with the short-read adapter
    • Third column is the starting point of long-read that matches with the adapter
    • Fourth column is the long-read segment that find.
  • A plot of optimal alignment locations of the short read adapter to the long-reads.

2) Extract short-reads barcodes

The second step is to extract the top short-reads barcodes that cover most of the reads.

./scTagger.py extract_sr_bc -i "path/to/bam/file" -o "path/to/output/file" -p "path/to/output/plot"

Arguments

  • -i: Input file
  • -o: Path to output file.
  • -p: Path to plot file (Optional, Default: No plotting)
  • --thresh: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)
  • --step-size: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)
  • --max-barcode-cnt: Max number of barcodes to keep (Optional, Default: 25000)

Input

  • A bam file of short reads data

Output

  • A TSV file
    • First column is barcodes
    • Second column is the number of appearances of the barcode
  • A cumulative plot of SR coverage with batches of 1,000 barcodes

Alt. 2) Extract short-reads barcodes directly from long-reads

This is an alternative to the second step which avoids using the short-reads all together and inteads builds a whiltelist of cellular barcodes from the long-reads directly. This is done by looking for exact matches of the 10x Chromium list of cellular barcodes on the long-read barcode segments. The barcodes are sorted by frequency and the most frequent barcodes are kept using the strategy as the extract_sr_bc module.

./scTagger.py extract_sr_bc_from_lr -i "path/to/long-read-segments" -wl "/path/to/10x-barcode-list.txt" -o "path/to/output.txt"'

Arguments

  • -i: Input TSV file containing the long-read segments file generated by extract_lr_bc step
  • -o: Path to output file.
  • -wl: Path to 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz). Accepts both txt.gz files and .txt files.
  • --thresh: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)
  • --step-size: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)
  • --max-barcode-cnt: Max number of barcodes to keep (Optional, Default: 25000)

Input

  • The output file of the extract_lr_bc step
  • 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz)

Output

  • A TSV file
    • First column is barcodes
    • Second column is the number of appearances of the barcode

3) Match long-reads segment with short-reads barcodes

The last step is to match long-read segments with selected barcodes from short reads

./scTagger.py match_trie -lr "path/to/output/extract/long-read/segment" -sr "path/to/output/extract/top/short-read" -o "path/to/output/file" -t "number of threads"

Arguments

  • -lr: Long-read segments TSV file
  • -sr: Short-read barcode list TSV file
  • -mr: Maximum number of errors allowed for barcode matching (Optional, Default: 2)
  • -m: Maximum number of GB of RAM to be used (Optional, Default: 16.0)
  • -bl: Length of barcodes (Optional, Default: 16)
  • -t: Number of threads to use for searching (Optional, Default: 16)
  • -p: Path of plot file
  • -o: Path to output file. Output file is gzipped

Inputs

  • Use the output of extracting long-read segment and selecting top barcodes part as the inputs of this section

Outputs

  • A TSV file
    • First column is the read id
    • Second column is the minimum edit distance
    • Third column is the number of short reads barcodes that match with the long-read
    • Fourth column is the long-read segment, and the Fifth column is a list of all short-read barcodes with minimum edit distance
  • A bar plot that shows the number of long-reads by the minimum edit distance of their match barcode

Citing scTaggger

scTagger was first accepted to RECOMB-seq 2022 and is now published by iScience:

Ghazal Ebrahimi, Baraa Orabi, Meghan Robinson, Cedric Chauve, Ryan Flannigan, and Faraz Hach. "Fast and accurate matching of cellular barcodes across short-and long-reads of single-cell RNA-seq experiments." iScience (2022). DOI:10.1016/j.isci.2022.104530

Please check the paper branch of this repository for the archived paper experiements and implementation.