scRNA-FeatureSelection

Evaluation of several gene selection methods (including ensemble gene selection methods).

‼️This repo is no longer being maintained. Please refer to the new repo, which includes benchmarks of feature selection methods for both scRNA-seq and SRT.

Program Structure

│  main.py
│          
├─cache
│  │  
│  ├─geneData       # store selected genes
│  └─preprocessedData  # store preprocessed datasets
│                   
├─common_utils
│      __init__.py
│      utils.py           #  common utils
│ 
├─config
│      __init__.py                  
│      datasets_config.py   
│      experiments_config.py 
│      methods_config.py
│      
├─data_loader
│      __init__.py
│      dataset.py         # load and preprocess datasets
│      utils.py           # utils used in loading and preprocessing data
│      
├─experiments
│      __init__.py
│      metrics.py         # metrics used in batch correction, cell classification and cell clustering
│      recorders.py       # record the evaluation results and sink them to disk
│      run_experiments.py # run each experiment by calling the corresponding function
│      
├─figures                 # store the umap and t-sne figures
│      
├─other_steps
│      __init__.py
│      classification.py  # cell classification algorithms
│      clustering.py      # cell clustering algorithms
│      correction.py      # batch correction algorithms
│      
├─records                 # store the evaluation results and recorders
└─selection
        __init__.py
        fisher_score.py
        methods.py        # all feature selection algorithms
        nearest_centroid.py
        utils.py          # utils used in feature selection

Included Feature Selection Methods

Supervised Methods

Method	Language	Reference
Random Forest	Python	[1]
XGBoost	Python	[2]
LightGBM	Python	[3]
Nearest Shrunken Centroid	Python	[4]
scGeneFit	Python	[5]
CellRanger	Python	[6]
Fisher Score	Python	[7]
Mutual Information	Python	[8]

Unsupervised Methods

Method	Language	Reference
Variance	Python	[9]
CV	Python	[10]
Seurat	Python	[11]
Deviance	R	[12]
M3Drop	R	[13]
scmap	R	[14]
FEAST	R	[15]
scran	R	[16]
triku	Python	[17]
sctransform	R	[18]
GiniClust3	Python	[19]
pagest	Python

Quality Control

The function that detects ouliers in Besca.

Normalization

The normalization method in Seurat and the implementation in Scanpy.

Reproduce Our Results

Before the evaluation you should specify the paths to data (and marker genes if you want to run the marker discovery experiment) in config/datasets_config.py:

class DatasetConfig:
    def __init__(self):
        self.data_path = "/path/to/datasets/"
        self.marker_path = "/path/to/marker/genes/"  # optional

Then you can run certain experiment with single line of code:

from experiments.run_experiments import run_cell_clustering, run_cell_classification

run_cell_clustering(fs_methods=['var', 'feast'])  # single FS methods
run_cell_classification(fs_methods=['lgb+rf'])  # ensemble FS method

All the records will be stored in the directory records/. The recorders in .pkl format are in records/pkl/, and the tables are in records/xlsx/.

Evaluating new feature selection methods step by step

Here we present an easy way to evaluate new feature selection methods on all datasets we used. if you just want to test on only a few datasets, please check the notebook for examples.

Add new methods to the function single_select_by_batch() in selection/methods.py:
```
elif method == 'deviance':
    selected_genes_df = deviance_compute_importance(adata)
elif method == 'abbreviation_1':
    selected_genes_df = your_new_fucntion_1(adata)
elif method == 'abbreviation_2':
    selected_genes_df = your_new_fucntion_2(adata)
else:
    raise NotImplementedError(f"No implementation of {method}!")
```
- input of your new functions: an AnnData object, in which the adata.X is the scaled data after log-normalization, the adata.raw is the data after quality control but before normalization. The log-normalized data is in adata.layers['log-normalized'], and the normalized data is in adata.layers['normalized'].
- output of your new functions: a dataframe. The first column with name Gene contains gene names. The second column is not necessary. It contains scores of each genes (if they exist). The higher the score is, the more important the gene.

Modify the method configuration config/methods_config.py:

in self.formal_names

'feast': 'FEAST',
'abbreviation_1': 'formal_name_1',
'abbreviation_2': 'formal_name_2',
'rf+fisher_score': 'RF+\nFisher Score',

unsupervised methods should be added in self.unsupervised, and supervised methods should be added in self.supervised

self.unsupervised = ['abbreviation_1', 'var', 'cv2', ...]
self.supervised = ['abbreviation_2', 'rf', 'lgb', 'xgb', ...]

Then you can run the function as shown in examples!

from experiments.run_experiments import run_cell_clustering

run_cell_clustering(fs_methods=['abbreviation_1', 'abbreviation_2'])

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
common_utils		common_utils
config		config
data_loader		data_loader
experiments		experiments
notebooks		notebooks
other_steps		other_steps
selection		selection
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
console.py		console.py
feature_selection.ipynb		feature_selection.ipynb
main.py		main.py
requirements.txt		requirements.txt

License

ToryDeng/scRNA-FeatureSelection

Folders and files

Latest commit

History

Repository files navigation

scRNA-FeatureSelection

Program Structure

Included Feature Selection Methods

Supervised Methods

Unsupervised Methods

Quality Control

Normalization

Reproduce Our Results

Evaluating new feature selection methods step by step

About

Topics

Resources

License

Stars

Watchers

Forks

Languages