Skip to content

martinjzhang/scDRS

Repository files navigation

DOI

scDRS (single-cell disease-relevance score) is a method for associating individual cells in single-cell RNA-seq data with disease GWASs, built on top of AnnData and Scanpy.

Read the documentation: installation, usage, command-line interface (CLI), file formats, etc.

Check out instructions for making customized gene sets using MAGMA.

Reference

Zhang*, Hou*, et al. "Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data", Nature Genetics, 2022.

Versions

  • v1.0.3: development version. Fixing a bug of negative values of ct_mean when --adj-prop and --cov are on and there are genes extremely low expression; print --adj-prop info in scdrs compute-score; check p-value and z-score files that the gene column should have header GENE; force index in df_cov and df_score to be str; add --min-genes and --min-cells in CLI for customized filtering; adjustable FDR threshold for plot_group_stats #75.
  • v1.0.2: latest stable version. Bug fixes on scdrs.util.plot_group_stats; input checks in scdrs munge-gs and scdrs.util.load_h5ad.

Older versions

  • v1.0.1: stable version used in publication. Identical to v1.0.0 except documentation.
  • v1.0.0: stable version used in revision 1. Results are identical to v0.1 for binary gene sets. Changes with respect to v0.1:
    • scDRS command-line interface (CLI) instead of .py scripts for calling scDRS in bash, including scdrs munge-gs, scdrs compute-score, and scdrs perform-downstream.
    • More efficient in memory use due to the use of sparse matrix throughout the computation.
    • Allow the use of quantitative weights.
    • New feature --adj-prop for adjusting for cell type-proportions.
  • v0.1: stable version used in the initial submission.

Code and data to reproduce results of the paper

See scDRS_paper for more details (experiments folder is deprecated). Data are at figshare.

  • Download GWAS gene sets (.gs files) for 74 diseases and complex traits.
  • Download scDRS results (.score.gz and .full_score.gz files) for TMS FACS + 74 diseases/trait.

Older versions

Explore scDRS results via CELLxGENE

cellxgene cellxgene
110,096 cells from 120 cell types in TMS FACS IBD-associated cells

scDRS scripts (deprecated)


NOTE: scDRS scripts are still maintained but deprecated. Consider using scDRS command-line interface instead.


scDRS script for score calculation

Input: scRNA-seq data (.h5ad file) and gene set file (.gs file)

Output: scDRS score file ({trait}.score.gz file) and full score file ({trait}.full_score.gz file) for each trait in the .gs file

h5ad_file=your_scrnaseq_data
cov_file=your_covariate_file
gs_file=your_gene_set_file
out_dir=your_output_folder

python compute_score.py \
    --h5ad_file ${h5ad_file}.h5ad\
    --h5ad_species mouse\
    --cov_file ${cov_file}.cov\
    --gs_file ${gs_file}.gs\
    --gs_species human\
    --flag_filter True\
    --flag_raw_count True\
    --n_ctrl 1000\
    --flag_return_ctrl_raw_score False\
    --flag_return_ctrl_norm_score True\
    --out_folder ${out_dir}
  • --h5ad_file (.h5ad file) : scRNA-seq data
  • --h5ad_species ("hsapiens"/"human"/"mmusculus"/"mouse") : species of the scRNA-seq data samples
  • --cov_file (.cov file) : covariate file (optional, .tsv file, see file format)
  • --gs_file (.gs file) : gene set file (see file format)
  • --gs_species ("hsapiens"/"human"/"mmusculus"/"mouse") : species for genes in the gene set file
  • --flag_filter ("True"/"False") : if to perform minimum filtering of cells and genes
  • --flag_raw_count ("True"/"False") : if to perform normalization (size-factor + log1p)
  • --n_ctrl (int) : number of control gene sets (default 1,000)
  • --flag_return_ctrl_raw_score ("True"/"False") : if to return raw control scores
  • --flag_return_ctrl_norm_score ("True"/"False") : if to return normalized control scores
  • --out_folder : output folder. Score files will be saved as {out_folder}/{trait}.score.gz (see file format)

scDRS script for downsteam applications

Input: scRNA-seq data (.h5ad file), gene set file (.gs file), and scDRS full score files (.full_score.gz files)

Output: {trait}.scdrs_ct.{cell_type} file (same as the new {trait}.scdrs_group.{cell_type} file) for cell type-level analyses (association and heterogeneity); {trait}.scdrs_var file (same as the new {trait}.scdrs_cell_corr file) for cell variable-disease association; {trait}.scdrs_gene file for disease gene prioritization.

h5ad_file=your_scrnaseq_data
out_dir=your_output_folder
python compute_downstream.py \
    --h5ad_file ${h5ad_file}.h5ad \
    --score_file @.full_score.gz \
    --cell_type cell_type \
    --cell_variable causal_variable,non_causal_variable,covariate\
    --flag_gene True\
    --flag_filter False\
    --flag_raw_count False\ # flag_raw_count is set to `False` because the toy data is already log-normalized, set to `True` if your data is not log-normalized
    --out_folder ${out_dir}
  • --h5ad_file (.h5ad file) : scRNA-seq data
  • --score_file (.full_score.gz files) : scDRS full score files; supporting use of "@" to match strings
  • --cell_type (str) : cell type column (supporting multiple columns separated by comma); must be present in adata.obs.columns; used for cell type-disease association analyses (5% quantile as test statistic) and detecting association heterogeneity within cell type (Geary's C as test statistic)
  • --cell_variable (str) : cell-level variable columns (supporting multiple columns separated by comma); must be present in adata.obs.columns; used for cell variable-disease association analyses (Pearson's correlation as test statistic)
  • --flag_gene ("True"/"False") : if to correlate scDRS disease scores with gene expression
  • --flag_filter ("True"/"False") : if to perform minimum filtering of cells and genes
  • --flag_raw_count ("True"/"False") : if to perform normalization (size-factor + log1p)
  • --out_folder : output folder. Score files will be saved as {out_folder}/{trait}.scdrs_ct.{cell_type} for cell type-level analyses (association and heterogeneity); {out_folder}/{trait}.scdrs_var file for cell variable-disease association; {out_folder}/{trait}.scdrs_var.{trait}.scdrs_gene file for disease gene prioritization. (see file format)