Skip to content

TeeSnee is an algorithm designed to take high-dimensional gene expression data and represent it in low-dimensional space.

Notifications You must be signed in to change notification settings

m1ma0314/CSE185Group16_TeeSnee

Repository files navigation

CSE185 The Super Naïve Low Dimensional Embedding Clustering Method

This is group 16 final project for CSE185. It follows the algorithm description in the original t-SNE paper to implement t-SNE tool. The main application of TeeSnee is to cluster gene expression data of cells to reveal cell types.

Python Version PyPI Version NumPy pandas Matplotlib scikit-learn Anndata Scanpy

Table of contents

Sample Data Download

BACK TO TABLE OF CONTENTS

Sample data (count matrix) taken from: https://www.10xgenomics.com/resources/datasets/human-brain-cancer-11-mm-capture-area-ffpe-2-standard

Check file data_processing.py to copy the commands needed to process the data to get gene by cell matrix.

Installation Instruction

BACK TO TABLE OF CONTENTS

  • Install teesnee program with the following command:
git clone https://github.com/m1ma0314/CSE185Group16_TeeSnee.git
cd CSE185Group16_TeeSnee
python -m ensurepip --upgrade
python -m pip install --upgrade pip
  • Installation requires the numpy, pandas, matplotlib, scikit-learn libraries to be installed. You can install these with pip:
pip install -r requirements.txt
  • Change permissions of teesnee.py:
chmod 777 teesnee.py

Basic Usage

BACK TO TABLE OF CONTENTS

The basic usage of teesnee is:

python teesnee.py [-p targer_perlexity] [-z ifzipped] [-o output] filename

To run teesnee on a small test example (using files in this repo):

python teesnee.py -p 100 -o ./ minimal_dataset.csv

Complete usage instructions

BACK TO TABLE OF CONTENTS

The only required input to my_tsne is a cell x gene matrix data file. Users may additionally specify the options below:

  • -p PERPLEXITY, --target_perplexity PERPLEXITY: specify target perplexity. If specified, the tsne function will calculate similarities matrix based on specified perplexity value and generate t-SNE plot. Higher perplexity value is associated with tighter clusters in the final output plot. Otherwise, the tsne function use perplexity=100 by default.
  • -z ifzipped, --zipped: unzip dataset file if this argument is specified. By default, the datafile is viewed as unzipped and will be converted into matrix for further processing.
  • -o FILE, --output FILE: Write output to file. By default, output is written to tsneplot.png

File Format

The output t-SNE plot is in png format.

tsne_plot

Benchmarking Information

BACK TO TABLE OF CONTENTS

The benchmarking analysis is recorded in the folder benchmarking. Here is the time complexity plot comparison between teeSnee time complexity and scanpy’s t-SNE.

Screen Shot 2023-06-09 at 10 28 31 PM

p.s. teeSnee still needs optimization for runtime TAT

Contributors

BACK TO TABLE OF CONTENTS

This repository was generated by Annabelle Coles and Mijia Ma, under the guidance of the original t-SNE paper and with inspiration from the article t-SNE from scratch.

Special thanks to CSE185 Professor Dr. Melissa Gymrek, TA Ryan Eveloff, TA Luisa Amaral, TA Himanshu.

Please submit a pull request with any corrections or suggestions. Your suggestions matter a lot to us <3

About

TeeSnee is an algorithm designed to take high-dimensional gene expression data and represent it in low-dimensional space.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published