scTagger

scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to enable relating at the cell level gene expression (from short-reads) and RNA splicing (from the long-reads).

Installation

Conda

scTagger is available as a Conda package:

conda create -n sctagger-env -c bioconda sctagger 
conda activate sctagger-env
scTagger.py -h

Running with Snakemake

We provided a simple Snakefile alongside a config.yaml file that runs the three stages of scTagger as well as Cell Ranger (assumes Cell Ranger is in path).

Running manually

scTagger has a single python script containing different functions to match long-reads and short-reads barcodes.

The whole pipeline contains three steps that you can run each part separately:

1) Extract long-reads segment

The first step of the scTagger pipeline is to extract a segment where the probability of seeing a barcode is more than in other places. To run this step, you can use the following command.

./scTagger.py extract_lr_bc -r "path/to/long/read/fastq" -o "path/to/output/file" -p "path/to/output/plots"

Augments

-r: Space separated paths to reads in FASTQ
-g: Space separated of the ranges of where SR adapter should be found on the LR's (Optional, Default: Detect from data)
-z: Indicate input is gzipped (Optional, Default: Assume input is gzipped if it ends with ".gz")
-t: Number of threads (Optional, Default: 1)
-sa: Short-read adapter (Optional, Default: CTACACGACGCTCTTCCGATCT)
--num-bp-afte: Number of bases after the end of the SR adapter alignment to generate (Optional, Default: 20)
-o: Path to output file
-p: Path to plot file (Optional, Default: No plotting)

Inputs

A list of FASTQ files of long-reads

Outputs

A Tsv file:
- First column is read-id
- Second column is the best edit distance with the short-read adapter
- Third column is the starting point of long-read that matches with the adapter
- Fourth column is the long-read segment that find.
A plot of optimal alignment locations of the short read adapter to the long-reads.

2) Extract short-reads barcodes

The second step is to extract the top short-reads barcodes that cover most of the reads.

./scTagger.py extract_sr_bc -i "path/to/bam/file" -o "path/to/output/file" -p "path/to/output/plot"

Arguments

-i: Input file
-o: Path to output file.
-p: Path to plot file (Optional, Default: No plotting)
--thresh: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)
--step-size: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)
--max-barcode-cnt: Max number of barcodes to keep (Optional, Default: 25000)

Input

A bam file of short reads data

Output

A TSV file
- First column is barcodes
- Second column is the number of appearances of the barcode
A cumulative plot of SR coverage with batches of 1,000 barcodes

Alt. 2) Extract short-reads barcodes directly from long-reads

This is an alternative to the second step which avoids using the short-reads all together and inteads builds a whiltelist of cellular barcodes from the long-reads directly. This is done by looking for exact matches of the 10x Chromium list of cellular barcodes on the long-read barcode segments. The barcodes are sorted by frequency and the most frequent barcodes are kept using the strategy as the extract_sr_bc module.

./scTagger.py extract_sr_bc_from_lr -i "path/to/long-read-segments" -wl "/path/to/10x-barcode-list.txt" -o "path/to/output.txt"'

Arguments

-i: Input TSV file containing the long-read segments file generated by extract_lr_bc step
-o: Path to output file.
-wl: Path to 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz). Accepts both txt.gz files and .txt files.
--thresh: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)
--step-size: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)
--max-barcode-cnt: Max number of barcodes to keep (Optional, Default: 25000)

Input

The output file of the extract_lr_bc step
10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz)

Output

A TSV file
- First column is barcodes
- Second column is the number of appearances of the barcode

3) Match long-reads segment with short-reads barcodes

The last step is to match long-read segments with selected barcodes from short reads

./scTagger.py match_trie -lr "path/to/output/extract/long-read/segment" -sr "path/to/output/extract/top/short-read" -o "path/to/output/file" -t "number of threads"

Arguments

-lr: Long-read segments TSV file
-sr: Short-read barcode list TSV file
-mr: Maximum number of errors allowed for barcode matching (Optional, Default: 2)
-m: Maximum number of GB of RAM to be used (Optional, Default: 16.0)
-bl: Length of barcodes (Optional, Default: 16)
-t: Number of threads to use for searching (Optional, Default: 16)
-p: Path of plot file
-o: Path to output file. Output file is gzipped

Inputs

Use the output of extracting long-read segment and selecting top barcodes part as the inputs of this section

Outputs

A TSV file
- First column is the read id
- Second column is the minimum edit distance
- Third column is the number of short reads barcodes that match with the long-read
- Fourth column is the long-read segment, and the Fifth column is a list of all short-read barcodes with minimum edit distance
A bar plot that shows the number of long-reads by the minimum edit distance of their match barcode

Citing scTaggger

scTagger was first accepted to RECOMB-seq 2022 and is now published by iScience:

Ghazal Ebrahimi, Baraa Orabi, Meghan Robinson, Cedric Chauve, Ryan Flannigan, and Faraz Hach. "Fast and accurate matching of cellular barcodes across short-and long-reads of single-cell RNA-seq experiments." iScience (2022). DOI:10.1016/j.isci.2022.104530

Please check the paper branch of this repository for the archived paper experiements and implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
env.yaml		env.yaml
scTagger.py		scTagger.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

.gitmodules

.gitmodules

LICENSE

LICENSE

README.md

README.md

Snakefile

Snakefile

config.yaml

config.yaml

env.yaml

env.yaml

scTagger.py

scTagger.py

Repository files navigation

scTagger

Installation

Conda

Running with Snakemake

Running manually

1) Extract long-reads segment

2) Extract short-reads barcodes

Alt. 2) Extract short-reads barcodes directly from long-reads

3) Match long-reads segment with short-reads barcodes

Citing scTaggger

About

Releases 4

Packages

Contributors 2

Languages

License

vpc-ccg/scTagger

Folders and files

Latest commit

History

Repository files navigation

scTagger

Installation

Conda

Running with Snakemake

Running manually

1) Extract long-reads segment

2) Extract short-reads barcodes

Alt. 2) Extract short-reads barcodes directly from long-reads

3) Match long-reads segment with short-reads barcodes

Citing scTaggger

About

Resources

License

Stars

Watchers

Forks

Languages