Skip to content

ToryDeng/scRNA-FeatureSelection

Repository files navigation

GitHub GitHub Repo stars GitHub repo size

scRNA-FeatureSelection

Evaluation of several gene selection methods (including ensemble gene selection methods).

‼️This repo is no longer being maintained. Please refer to the new repo, which includes benchmarks of feature selection methods for both scRNA-seq and SRT.

Program Structure

│  main.py
│          
├─cache
│  │  
│  ├─geneData       # store selected genes
│  └─preprocessedData  # store preprocessed datasets
│                   
├─common_utils
│      __init__.py
│      utils.py           #  common utils
│ 
├─config
│      __init__.py                  
│      datasets_config.py   
│      experiments_config.py 
│      methods_config.py
│      
├─data_loader
│      __init__.py
│      dataset.py         # load and preprocess datasets
│      utils.py           # utils used in loading and preprocessing data
│      
├─experiments
│      __init__.py
│      metrics.py         # metrics used in batch correction, cell classification and cell clustering
│      recorders.py       # record the evaluation results and sink them to disk
│      run_experiments.py # run each experiment by calling the corresponding function
│      
├─figures                 # store the umap and t-sne figures
│      
├─other_steps
│      __init__.py
│      classification.py  # cell classification algorithms
│      clustering.py      # cell clustering algorithms
│      correction.py      # batch correction algorithms
│      
├─records                 # store the evaluation results and recorders
└─selection
        __init__.py
        fisher_score.py
        methods.py        # all feature selection algorithms
        nearest_centroid.py
        utils.py          # utils used in feature selection

Included Feature Selection Methods

Supervised Methods

Method Language Reference
Random Forest Python [1]
XGBoost Python [2]
LightGBM Python [3]
Nearest Shrunken Centroid Python [4]
scGeneFit Python [5]
CellRanger Python [6]
Fisher Score Python [7]
Mutual Information Python [8]

Unsupervised Methods

Method Language Reference
Variance Python [9]
CV Python [10]
Seurat Python [11]
Deviance R [12]
M3Drop R [13]
scmap R [14]
FEAST R [15]
scran R [16]
triku Python [17]
sctransform R [18]
GiniClust3 Python [19]
pagest Python

Quality Control

The function that detects ouliers in Besca.

Normalization

The normalization method in Seurat and the implementation in Scanpy.

Reproduce Our Results

Before the evaluation you should specify the paths to data (and marker genes if you want to run the marker discovery experiment) in config/datasets_config.py:

class DatasetConfig:
    def __init__(self):
        self.data_path = "/path/to/datasets/"
        self.marker_path = "/path/to/marker/genes/"  # optional

Then you can run certain experiment with single line of code:

from experiments.run_experiments import run_cell_clustering, run_cell_classification

run_cell_clustering(fs_methods=['var', 'feast'])  # single FS methods
run_cell_classification(fs_methods=['lgb+rf'])  # ensemble FS method

All the records will be stored in the directory records/. The recorders in .pkl format are in records/pkl/, and the tables are in records/xlsx/.

Evaluating new feature selection methods step by step

Here we present an easy way to evaluate new feature selection methods on all datasets we used. if you just want to test on only a few datasets, please check the notebook for examples.

  1. Add new methods to the function single_select_by_batch() in selection/methods.py:

    elif method == 'deviance':
        selected_genes_df = deviance_compute_importance(adata)
    elif method == 'abbreviation_1':
        selected_genes_df = your_new_fucntion_1(adata)
    elif method == 'abbreviation_2':
        selected_genes_df = your_new_fucntion_2(adata)
    else:
        raise NotImplementedError(f"No implementation of {method}!")
    • input of your new functions: an AnnData object, in which the adata.X is the scaled data after log-normalization, the adata.raw is the data after quality control but before normalization. The log-normalized data is in adata.layers['log-normalized'], and the normalized data is in adata.layers['normalized'].
    • output of your new functions: a dataframe. The first column with name Gene contains gene names. The second column is not necessary. It contains scores of each genes (if they exist). The higher the score is, the more important the gene.
  2. Modify the method configuration config/methods_config.py:

    • in self.formal_names
    'feast': 'FEAST',
    'abbreviation_1': 'formal_name_1',
    'abbreviation_2': 'formal_name_2',
    'rf+fisher_score': 'RF+\nFisher Score',
    • unsupervised methods should be added in self.unsupervised, and supervised methods should be added in self.supervised
    self.unsupervised = ['abbreviation_1', 'var', 'cv2', ...]
    self.supervised = ['abbreviation_2', 'rf', 'lgb', 'xgb', ...]
  3. Then you can run the function as shown in examples!

    from experiments.run_experiments import run_cell_clustering
    
    run_cell_clustering(fs_methods=['abbreviation_1', 'abbreviation_2'])

About

Compare several feature selection methods in scRNA-seq analysis

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published