This repository demonstrates the feasibility of using the OpenAI API and GPT models to perform bioinformatics analyses, with a focus on single-cell RNA sequencing (scRNA-seq) analysis. Utilizing simple prompts, users can execute major steps of scRNA-seq analysis and obtain insights with minimal to no background in bioinformatics. This project also illustrates how to connect to the OpenAI API using the openai
Python package.
The scAnalysis
package includes the following modules:
scAnalysis
├── init.py
├── read_and_merge_data.py
├── preprocess_and_normalize.py
├── identify_top_genes.py
├── cluster_top_genes.py
├── rank_and_visualize_markers.py
├── annotate_clusters.py
├── validate_canonical_markers.py
├── analyze_differential_expression.py
The Jupyter Notebook in this repository performs the following steps for scRNA-seq analysis, with all Python code obtained from the OpenAI API outputs:
- OpenAI API Integration Setup
- Load required packages
- Load scRNA-Seq data and Merge the datasets
- Remove cells with too few genes or too many genes, and genes detected in too few cells
- Normalize the gene expression measurements to account for differences in sequencing depth
- Log-transform the data for downstream analysis (data processing)
- Select genes that show high variation across cells (most informative for clustering)
- Scale the data to have zero mean and unit variance
- Perform Principal Component Analysis (PCA)
- Run clustering algorithms to identify distinct groups of cells
- Identifying marker genes to enhance visualization of distinct cellular populations
- Identify and confirm the cell types associated with each cluster (validation of cellular identities)
- Create mappings from clusters to identified cell types
- Validate the cell types by plotting expression levels of known canonical markers
- Identify genes that are differentially expressed between different cell populations or conditions
To set up your environment for running the analysis, follow these steps:
- Create a new Conda environment:
conda create -n scrna_analysis python=3.11.5
- Activate the environment:
conda activate scrna_analysis
- Install the required packages:
The required packages and versions are also listed in
conda install jupyter=7.0.6 leidenalg=0.10.1 openai=1.6.1 re=2.2.1 scanpy=1.9.6 numpy=1.26.3 pandas=2.1.4 matplotlib=3.8.2 seaborn=0.13.1 scipy=1.11.4 scikit-learn=1.3.2 scikit-misc=0.3.1
requirements.txt
Store your OpenAI API key in a .env file at the base directory of the repository in the format:
OPENAI_API_KEY="[INSERT KEY HERE]"
You can run the analysis either by executing the Jupyter Notebook openai_jupyter_integration.ipynb or by running the Python code output by GPT through main.py.
Perform scRNA-seq analysis.
positional arguments:
treatment Path to the treated sample for analysis.
control Path to the control (untreated) sample.
options:
-h, --help show this help message and exit
--save_table Save tables produced in the analysis.
--save_plot Save plots produced in the analysis.
- The data were obtained from 10x Genomics and produce by CellRanger. It comprises A549 lung carcinoma cells that expressed dCas9-KRAB and were transduced with a pool containing 93 total sgRNAs. Selected cells for each condition were individually frozen, then thawed and counted for analysis.
- The dataset can be obtained by searching
5k A549, Lung Carcinoma Cells, No Treatment Transduced with a CRISPR Pool
on the 10x Genomics website. The treatment sampleGene Expression - Feature / cell matrix (raw)
is under the 'Inputs/Library' tab, and the control sampleGene Expression - Feature / cell matrix (per-sample)
is under the 'No Treatment' tab. - Included in this repo are the corresponding HDF5 files, stored in the
data/lung_control
anddata/lung_treatment
folders.
- The Jupyter Notebook openai_jupyter_integration.ipynb is not guaranteed to be reproducible as intended since GPT outputs may vary, providing different analysis methods each time.
- To reproduce the analysis in this repository, please use main.py.
- Note that the analysis is not comprehensive and has not undergone rigorous validation of the identified cell types. This project mainly serves as a demonstration of using OpenAI's API for bioinformatics analyses.
Tarsus Lam