Skip to content

Commit 23e6788

Browse files
ajlee21danich1Jake Crawford
authored
Add new Pa results and Redo human analysis (#11)
* add start of gsea analysis * mv configs and update nbs * remove unused metadata files * add new template experiment run * add files for new Pa template experiment runs * add pre-processing script to human nb and rerun * re-run human analysis with dif num sim exp * update scripts * add viz of experiment * update modules to use read counts * update scripts to use DESeq * update readme to reflect new DESEq params * update human nb using DESeq and rerun * update gsea analysis and add validation * update viz nbs * add df for Deb * add newline Co-authored-by: David Nicholson <dnicholson329@gmail.com> * add newline Co-authored-by: David Nicholson <dnicholson329@gmail.com> * Update configs/config_pseudomonas_33245.tsv Co-authored-by: David Nicholson <dnicholson329@gmail.com> * Update configs/config_pseudomonas_7704.tsv Co-authored-by: David Nicholson <dnicholson329@gmail.com> * Update generic_expression_patterns_modules/GSEA_analysis.R Co-authored-by: David Nicholson <dnicholson329@gmail.com> * fix spacing * fix spacing * Update generic_expression_patterns_modules/GSEA_analysis.R Co-authored-by: Jake Crawford <kitttttens@yahoo.com> * Update generic_expression_patterns_modules/GSEA_analysis.R Co-authored-by: Jake Crawford <kitttttens@yahoo.com> * Update generic_expression_patterns_modules/DE_analysis.R Co-authored-by: Jake Crawford <kitttttens@yahoo.com> * Update human_analysis/nbconverted/2_identify_generic_genes_pathways.py Co-authored-by: Jake Crawford <kitttttens@yahoo.com> * update based on PR Co-authored-by: David Nicholson <dnicholson329@gmail.com> Co-authored-by: Jake Crawford <kitttttens@yahoo.com>
1 parent 5772884 commit 23e6788

File tree

58 files changed

+96972
-25993
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+96972
-25993
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,6 @@ Note: Some of these parameters are required by the imported [ponyo](https://gith
103103
| epsilon_std | float: Standard deviation of Normal distribution to sample latent space|
104104
| num_simulated| int: Simulate a compendia with these many experiments, created by shifting the template experiment these many times|
105105
| project_id | str: Experiment id to use as a template experiment|
106-
| col_to_rank | str: Name of column header from DE association statistic results. This column will be use to rank genes. Select `logFC`, `P.Value`, `adj.P.Val`, `t`|
106+
| col_to_rank | str: Name of column header from DE association statistic results. This column will be use to rank genes. Select `logFC`, `P.Value`, `adj.P.Val`, `t` if using Limma. Select Select `log2FoldChange`, `pvalue`, `padj` if using DESeq.|
107107
| num_recount2_experiments | int: Number of recount2 experiments to download. Note this will not be needed when we update the training to use all of recount2|
108108
| compare_genes | bool: 1 if comparing gene ranks with reference gene ranks. 0 if just identifying generic genes and gene sets but not comparing against a reference.|

config_human.tsv renamed to configs/config_human.tsv

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ validation_frac 0.25
2020
project_id "SRP012656"
2121
metadata_colname 'run'
2222
num_simulated 25
23-
col_to_rank "logFC"
23+
col_to_rank "log2FoldChange"
2424
num_recount2_experiments 200
25-
compare_genes 1
25+
gsea_statistic 'log2FoldChange'
26+
compare_genes 1

configs/config_pseudomonas_1183.tsv

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
local_dir "/home/alexandra/Documents/Data/Generic_expression_patterns/"
2+
dataset_name "pseudomonas_analysis"
3+
template_data_file "/home/alexandra/Documents/Data/Generic_expression_patterns/pseudomonas_template_data.tsv"
4+
compendium_data_file "/home/alexandra/Documents/Data/Generic_expression_patterns/Pa_compendium_02.22.2014.pcl"
5+
normalized_compendium_data_file "/home/alexandra/Documents/Data/Generic_expression_patterns/normalized_pseudomonas_compendium_data.tsv"
6+
shared_genes_file "/home/alexandra/Documents/Data/Generic_expression_patterns/shared_genes_pseudomonas.pickle"
7+
scaler_transform_file "/home/alexandra/Documents/Data/Generic_expression_patterns/scaler_transform_pseudomonas.pickle"
8+
NN_architecture "NN_2500_30"
9+
learning_rate 0.001
10+
batch_size 10
11+
epochs 100
12+
kappa 0.01
13+
intermediate_dim 2500
14+
latent_dim 30
15+
epsilon_std 1.0
16+
validation_frac 0.25
17+
project_id "E-MEXP-1183"
18+
metadata_colname 'ml_data_source'
19+
num_simulated 25
20+
col_to_rank "logFC"
21+
compare_genes 0

configs/config_pseudomonas_33245.tsv

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
local_dir "/home/alexandra/Documents/Data/Generic_expression_patterns/"
2+
dataset_name "pseudomonas_analysis"
3+
template_data_file "/home/alexandra/Documents/Data/Generic_expression_patterns/pseudomonas_template_data.tsv"
4+
compendium_data_file "/home/alexandra/Documents/Data/Generic_expression_patterns/Pa_compendium_02.22.2014.pcl"
5+
normalized_compendium_data_file "/home/alexandra/Documents/Data/Generic_expression_patterns/normalized_pseudomonas_compendium_data.tsv"
6+
shared_genes_file "/home/alexandra/Documents/Data/Generic_expression_patterns/shared_genes_pseudomonas.pickle"
7+
scaler_transform_file "/home/alexandra/Documents/Data/Generic_expression_patterns/scaler_transform_pseudomonas.pickle"
8+
NN_architecture "NN_2500_30"
9+
learning_rate 0.001
10+
batch_size 10
11+
epochs 100
12+
kappa 0.01
13+
intermediate_dim 2500
14+
latent_dim 30
15+
epsilon_std 1.0
16+
validation_frac 0.25
17+
project_id "E-GEOD-33245"
18+
metadata_colname 'ml_data_source'
19+
num_simulated 25
20+
col_to_rank "logFC"
21+
compare_genes 0

configs/config_pseudomonas_7704.tsv

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
local_dir "/home/alexandra/Documents/Data/Generic_expression_patterns/"
2+
dataset_name "pseudomonas_analysis"
3+
template_data_file "/home/alexandra/Documents/Data/Generic_expression_patterns/pseudomonas_template_data.tsv"
4+
compendium_data_file "/home/alexandra/Documents/Data/Generic_expression_patterns/Pa_compendium_02.22.2014.pcl"
5+
normalized_compendium_data_file "/home/alexandra/Documents/Data/Generic_expression_patterns/normalized_pseudomonas_compendium_data.tsv"
6+
shared_genes_file "/home/alexandra/Documents/Data/Generic_expression_patterns/shared_genes_pseudomonas.pickle"
7+
scaler_transform_file "/home/alexandra/Documents/Data/Generic_expression_patterns/scaler_transform_pseudomonas.pickle"
8+
NN_architecture "NN_2500_30"
9+
learning_rate 0.001
10+
batch_size 10
11+
epochs 100
12+
kappa 0.01
13+
intermediate_dim 2500
14+
latent_dim 30
15+
epsilon_std 1.0
16+
validation_frac 0.25
17+
project_id "E-GEOD-7704"
18+
metadata_colname 'ml_data_source'
19+
num_simulated 25
20+
col_to_rank "logFC"
21+
compare_genes 0
File renamed without changes.
File renamed without changes.

explore_data/viz_template_experiment.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -763,7 +763,7 @@
763763
"name": "python",
764764
"nbconvert_exporter": "python",
765765
"pygments_lexer": "ipython3",
766-
"version": "3.7.6"
766+
"version": "3.7.8"
767767
}
768768
},
769769
"nbformat": 4,

generic_expression_patterns_modules/DE_analysis.R

Lines changed: 61 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,12 @@
77

88
# library('limma')
99

10-
get_DE_stats <- function(metadata_file,
11-
experiment_id,
12-
expression_file,
13-
data_type,
14-
local_dir,
15-
run) {
10+
get_DE_stats_limma <- function(metadata_file,
11+
experiment_id,
12+
expression_file,
13+
data_type,
14+
local_dir,
15+
run) {
1616

1717
# This function performs DE analysis using expression data in expression_file
1818
# where samples are grouped based on metadata_file
@@ -27,6 +27,7 @@ get_DE_stats <- function(metadata_file,
2727
#
2828
# expression_file: str
2929
# File containing gene expression data
30+
# Expression data should be of the form sample x gene
3031
#
3132
# data_type: str
3233
# Either 'template' or 'simulated' to label saved output file
@@ -38,6 +39,7 @@ get_DE_stats <- function(metadata_file,
3839
# Used as identifier for different simulated experiments
3940

4041
# Read in data
42+
# Note the expression data is transposed to gene x sample in order to run Limma
4143
expression_data <- t(as.matrix(read.csv(expression_file, sep="\t", header=TRUE, row.names=1)))
4244
metadata <- as.matrix(read.csv(metadata_file, sep="\t", header=TRUE, row.names=1))
4345

@@ -84,34 +86,61 @@ get_DE_stats <- function(metadata_file,
8486

8587
}
8688

87-
create_volcano <- function(expression_file,
88-
experiment_id,
89-
pval,
90-
local_dir) {
89+
get_DE_stats_DESeq <- function(metadata_file,
90+
experiment_id,
91+
expression_file,
92+
data_type,
93+
local_dir,
94+
run) {
9195

92-
# This functioni generates a volcano plot using the output from
93-
# the DE analysis script 'get_DE_stats' and output it to local_dir
96+
# This function performs DE analysis using DESeq.
97+
# Expression data in expression_file are grouped based on metadata_file
98+
#
99+
# Arguments
100+
# ---------
101+
# metadata_file: str
102+
# File containing mapping between sample id and group
103+
#
104+
# experiment_id: str
105+
# Experiment id used to label saved output filee
106+
#
107+
# expression_file: str
108+
# File containing gene expression data
109+
#
110+
# data_type: str
111+
# Either 'template' or 'simulated' to label saved output file
112+
#
113+
# local_dir: str
114+
# Directory to save output files to
115+
#
116+
# run: str
117+
# Used as identifier for different simulated experiments
118+
119+
expression_data <- t(as.matrix(read.csv(expression_file, sep="\t", header=TRUE, row.names=1)))
120+
metadata <- as.matrix(read.csv(metadata_file, sep="\t", header=TRUE, row.names=1))
121+
122+
print("Checking sample ordering...")
123+
print(all.equal(colnames(expression_data), rownames(metadata)))
124+
125+
group <- interaction(metadata[,1])
126+
127+
mm <- model.matrix(~0 + group)
94128

95-
# Read in expression data
96-
res <- read.table(expression_file, header=TRUE)
129+
#print(head(expression_data))
97130

98-
threshold <- 0.05
131+
ddset <- DESeqDataSetFromMatrix(expression_data, colData=metadata, design = ~group)
132+
133+
deseq_object <- DESeq(ddset)
134+
135+
deseq_results <- results(deseq_object)
99136

100-
# Make a basic volcano plot
101-
f <- EnhancedVolcano(res,
102-
lab = rownames(res),
103-
x = 'logFC',
104-
y = 'adj.P.Val',
105-
xlim = c(-2,2),
106-
pCutoff = threshold,
107-
FCcutoff = 1,
108-
pointSize = 1.0,
109-
labSize = 3.0,
110-
xlab=bquote(~Log[2]~ 'fold change'),
111-
ylab=bquote(-~Log[10]~ 'FDR adj p-value')
112-
)
137+
deseq_results_df <- as.data.frame(deseq_results)
113138

114-
# Save
115-
out_file = paste(local_dir, "volcano_template_data_", experiment_id,".png", sep="")
116-
ggsave(out_file, plot = f, dpi=300)
117-
}
139+
# Save summary statistics of DEGs
140+
if (data_type == "template") {
141+
out_file = paste(local_dir, "DE_stats/DE_stats_template_data_", experiment_id,"_", run, ".txt", sep="")
142+
} else if (data_type == "simulated") {
143+
out_file = paste(local_dir, "DE_stats/DE_stats_simulated_data_", experiment_id,"_", run, ".txt", sep="")
144+
}
145+
write.table(deseq_results_df, file = out_file, row.names = T, sep = "\t", quote = F)
146+
}
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
## Run this once to setup environment
2+
## Used R 3.6.3
3+
#if (!requireNamespace("BiocManager", quietly = TRUE))
4+
# install.packages("BiocManager")
5+
6+
#BiocManager::install("clusterProfiler")
7+
8+
#library(clusterProfiler)
9+
10+
find_enriched_pathways <- function(DE_stats_file,
11+
pathway_DB,
12+
statistic){
13+
# Read in data
14+
DE_stats_data <- read.table(DE_stats_file, sep="\t", header=TRUE, row.names=NULL)
15+
16+
# Sort genes by feature 1
17+
18+
# feature 1: numeric vector
19+
if (statistic == 'logFC'){
20+
col_num = 2
21+
} else if (statistic == 'log2FoldChange'){
22+
col_num = 3
23+
} else if (statistic == 't'){
24+
col_num = 4
25+
} else if (statistic == 'p-value'){
26+
col_num = 5
27+
} else if (statistic == 'adj p-value' || statistic == 'pvalue'){
28+
col_num = 6
29+
} else if ( statistic == 'padj'){
30+
col_num = 7
31+
}
32+
rank_genes <- as.numeric(as.character(DE_stats_data[,col_num]))
33+
34+
# feature 2: named vector of gene ids
35+
names(rank_genes) <- as.character(DE_stats_data[,1])
36+
37+
## feature 3: decreasing order
38+
rank_genes <- sort(rank_genes, decreasing = TRUE)
39+
40+
pathway_DB_data <- gmtPathways(hallmark_DB_file)
41+
42+
#enrich_pathways <- GSEA(geneList=rank_genes,
43+
# TERM2GENE=pathway_DB_data,
44+
# nPerm=100000,
45+
# by='fgsea',
46+
# verbose=T)
47+
enrich_pathways <- fgsea(pathways=pathway_DB_data,
48+
stats=rank_genes,
49+
nperm=10000)
50+
51+
return(as.data.frame(enrich_pathways))
52+
}

0 commit comments

Comments
 (0)