Skip to content

Code to generate a KEGG pathway visualization for differentially expressed genes for iPSC-derived and primary cardiomyocytes

Notifications You must be signed in to change notification settings

alanamer/20.440PSET6

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 

Repository files navigation

iPSC-derived vs. Primary Cardiomyocytes Transcriptome Pathway Analysis

This repo contains the code to generate two graphs of differentially expressed genes in iPSC-derived vs. primary cardiomyocytes using transcriptome data from GSE146096 (Primary cardiomyocytes, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE146096) and GSE226159 (iPSC-derived cardiomyocytes, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM7065927) from the NIH Gene Expression Omnibus (GEO) database.

RNA count normalization and differential expression analysis was performed using the DESeq2 implementation in Python (PyDESeq2 https://bioconductor.org/packages/release/bioc/html/DESeq2.html), and we identified differentially expressed genes based on the criteria of adjusted p-value (FDR) > 0.05 and |log2 fold change| > 2. These differentially expressed genes were then passed to the package PyKEGG (https://pypi.org/project/pykegg/) which allows visualization of KEGG information using a network approach.

Motivation

In vitro human organ models offer the potential for a more cost-effective and efficient drug development pipeline, yet there is still a critical lack of understanding of differences between cells used in these models derived from induced pluripotent stem cells (iPSCs) compared to primary cells. Elucidating the transcriptomic distinctions between cells of different developmental origins is imperative to identify the optimal method for mimicking primary cell responses, substantiate the equivalence of stem cell-derived cells, and inform future in vitro model development. There is a lack of in depth means of verifying stem cell derived lineages’ fidelity to primary tissues. Current validation methods include expensive functional assays and/or looking for expression of a few select markers which can be co-expressed across cell types of similar lineages. We seek to provide an open source, standardized framework for transcriptomic analysis to allow researchers to validate their differentiation protocols more comprehensively against primary cell types. This repo offers one step in the analysis process where differentially expressed genes can be viewed within a relevant pathway for the cell type.

Installing

To re-make all of the analyses, you'll first need to install the required modules.

Please do this within a Python 3 (or latest) environment. You will need to import the following into your python file for the analysis.

%pip install pydeseq2
%pip install scanpy
%pip install sanbomics
%pip install bioinfokit

from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

import pandas as pd
import seaborn as sns
import scanpy as sc
import numpy as np
import pykegg
import requests_cache
from PIL import Image

Directory Structure and Reproducing analyses

The repo contains three folders within main: data_files, code, and figures.

data_files contains the tsv files of expression data used to generate the figures. code contains the python file that can be used to create the figures. figures contains the generated figures.

To reproduce the analyses, download the data_files folder and code file. Edit the file pathways if needed for the data within the python file as needed for the locations you have placed them. Install the required packages described above. You can then run the code and generate the three figures. The first two figures will use the transcriptome data from "MetaData_iPSCvsPrimary.tsv" (described below), and process it through DeSeq to obtain metrics for differential expression between the groups. This information will be passed to pyKEGG to generate "res.tsv" which contains the data necessary to plot the pathway map highlighting which genes are differentially expressed. The pathway number can be modified to whatever you wish, here I have chosen cardiac muscle contraction and vascular smooth muscle contraction for their relevance to the cell type.

The last figure, which is simply a plot of the top 20 GO pathways associated with differentailly expressed genes in iPSC-derived vs. primary cardiomyocytes can be obtained by pulling from the data file "OG_primary" which contains the GO pathways. This analysis was also completed in DeSeq. All you will need to do is modify the file path if necessary to produce the figure, as the data file already contains the GO pathways.

data

All data-related files are in data_files/:

"MetaData_iPSCvsPrimary.tsv" contains compiled transcriptome data from GSE146096 (Primary cardiomyocytes, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE146096) and GSE226159 (iPSC-derived cardiomyocytes, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM7065927) from the NIH Gene Expression Omnibus (GEO) database.

"res.tsv" contains the input for the pyKEGG figure generation pathway. This file is generated by running DeSEQ on the primary and iPSC-derived cardiomyocyte transcriptome data.

"OG_primary.tsv" contains the GO pathways associated with differentially expressed genes in iPSC-derived vs. primary cardiomyocytes that is used to produce the figure of the top 20 pathways. This file was generated by runnining DeSEQ on the primary and iPSC-derived cardiomyocyte transcriptome data described above.

code

All of the code is in the code/ folder:

"alana_GO_primary_vs._diff.py" contains all the python code needed to generate the three figures.

figures

The generated figures are in the figures/ folder:

"Cardiac_muscle_contraction.png" displays the differentially expressed genes between iPSC-derived and Primary cardiomyocytes in the context of the cardiac muscle contraction KEGG pathway. "Vascular_smooth_muscle_contaction" displays the differentially expressed genes between iPSC-derived and Primary cardiomyocytes in the context of the vascular smooth muscle contraction KEGG pathway. "Top_20_GO" displays the top 20 differentially expressed GO pathways based on normalized enrichment score for iPSC-derived and primary cardiomyocytes.

*currently, only the cardiac muscle contraction will be saved as a png, but you can easily change the KEGG pid in the commented location to generate vascular smooth muscle contraction instead

About

Code to generate a KEGG pathway visualization for differentially expressed genes for iPSC-derived and primary cardiomyocytes

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages