GitHub - j-g-b/BTF: Bayesian Tensor Factorization

Bayesian Tensor Factorization for distilling historical information

Code for Bayesian inference in the tensor probability model described by "Smaller p-values in genomics studies using distilled historical information".

Installation

To install, run the following command from the R console:

devtools::install_github("j-g-b/BTF")

Running the simulation studies

The scripts in scripts/simulation_studies should run without the need for additional data files. To save the resulting figures, make sure to create a directory called figs before running.

Running the genomics data examples

(1) Download the samples

From this figshare link download the .zip file in the dataset called "Posterior samples." Unzip the file to extract all of the .csv files and put these in a separate directory. For example these could be stored in a directory called "samps".

(2) Choose a script

Select an example dataset from the scripts in this repo under scripts/genomics_examples. Then follow the figshare link and download the corresponding data file. For example, if running fabp_corsello_et_al.R, download the "Corsello et al. example" dataset. Place the script in the same directory as the downloaded dataset.

(3) Modify the `samps_dir`

At the top of the example script, there is a line that sets the name of the directory where the posterior samples are saved. Modify the example script to point to the place where the posterior samples were saved in step (1).

For example, the top lines of the fabp_kory_et_al.R script look like this

#
samps_dir <- "../../alt_samps"
#
library(tidyverse)
library(magrittr)
#
K562_table <- readr::read_csv("K562 SHMT1-null.csv") %>%
  dplyr::filter(!grepl("INTERGENIC", sgRNA)) %>%
  dplyr::mutate(gene = gsub("_.*", "", gsub("sg", "", sgRNA)),
                replicate = as.integer(gsub(".*_", "", sgRNA)),
                full_lfc = log2(((`full media` + 1)/sum(`full media` + 1, na.rm = T)) / ((`initial` + 1)/sum(`initial` + 1, na.rm = T))),
                minus_lfc = log2(((`minus serine` + 1)/sum(`minus serine` + 1, na.rm = T)) / ((`initial` + 1)/sum(`initial` + 1, na.rm = T)))) %>%
  dplyr::select(-sgRNA) %>%
  dplyr::group_by(gene) %>%
  dplyr::summarise(r = sum(!is.na(minus_lfc)),
                   s = sd(minus_lfc, na.rm = T),
                   m = mean(minus_lfc, na.rm = T),
                   r1 = sum(!is.na(full_lfc)),
                   s1 = sd(full_lfc, na.rm = T),
                   m1 = mean(full_lfc, na.rm = T)) %>%
  dplyr::filter(r > 3, r1 > 3)

Simply change the name of the directory to stored in samps_dir by modifying that line to read, for example

samps_dir <- "my_samps"

(4) Run the script

The example script is ready to run. Make sure to create a directory called figs in order to save the resulting figure properly.

Running Gibbs sampling on the Cancer Dependency Map data

(1) Download the data from DepMap

Visit the depmap.org downloads page to find the gene expression, mutation, CRISPR, and RNAi datasets. Download the following files:

CCLE_expression.csv
CCLE_mutations.csv
Achilles_gene_effect.csv
D2_combined_gene_dep_scores.csv
sample_info.csv

and put them into a directory called data.

(2) Download the gene groups from figshare

Download the files in the "Gene groups" dataset found at this figshare project and place them in the data directory.

(3) Prepare the data

Run the script PrepareInputs.R. This will create .rds objects of the DepMap data files.

(4) Run Gibbs sampling

Run the script generate_posterior_samples.R, perhaps first modifying the last command to change the name of the directory in which posterior samples are stored:

X <- BTF::run_btf(n_samples = S, tensor_list = TensorList,
                  d1 = d1, d2 = d2, save_dir = "my_choice_of_directory_name",
                  burn = 1000, thin = 40, matrix_type = c(2, 0, 0, 1))

Demo for running the Gibbs sampling algorithm on generic three-way tensor data

First, define a function to simulate a three-way array of data with dimensions N, M, and K.

simulate_tensor_data <- function(N, M, D, K){
  #
  U <- rnorm(N*D) %>%
        matrix(nrow = N)
  V <- rnorm(M*D) %>%
        matrix(nrow = M)
  R <- plyr::alply(1:K, 1, function(d){
        rnorm(D*D) %>% matrix(nrow = D)
       })
  SigmaSq <- 1/rgamma(K, 5/2, 1/2)
  #
  X <- plyr::llply(1:K, function(k){
        (U%*%R[[k]]%*%t(V)) %>%
          magrittr::set_rownames(paste0("CL", 1:N)) %>%
          magrittr::set_colnames(paste0("GN", 1:M)) %>%
          magrittr::add(matrix(rnorm(N*M, sd = sqrt(SigmaSq[k])), nrow = N))
       })
  #
  return(list(U = U, V = V, R = R, SigmaSq = SigmaSq, X = X))
}

Next choose dimensions and extract the components of the simulated three-way array X.

#
require(plyr)
require(magrittr)
require(tidyverse)
#
N <- 100
M <- 500
D <- 5
K <- 5
#
dataset <- simulate_tensor_data(N, M, D, K)
#
U <- dataset[["U"]]
V <- dataset[["V"]]
R <- dataset[["R"]]
SigmaSq <- dataset[["SigmaSq"]]
X <- dataset[["X"]]

Next, run the tensor factorization to get samples of the embeddings U and V:

#
Result <- BTF::run_btf(n_samples = 500, tensor_list = X, 
                       d1 = D, d2 = D, save_dir = "samps", 
                       burn = 25, thin = 5)
#
Samples <- BTF::collect_samples("samps/")

Getting posterior means from posterior samples

Due to rotation and scale ambiguity, the posterior samples of the model parameters need to be aligned before taking posterior summaries.

Suppose that the name of the directory where the posterior sample files have been written is stored in the variable samps_dir. Then matrices U and V containing the posterior means of the row and column features can be obtained using the following commands:

samps <- BTF::collect_samples(samps_dir, last = 100) %>% BTF::align_samples()
V <- apply(samps$V, c(2,3), mean)
U <- apply(samps$U, c(2,3), mean)

Demo for running FAB p-value procedure

The following code produces FAB p-values for data simulated from null effects.

#
N <- 100
M <- 5000
r <- 5
D <- 5
K <- 5
matrix_type <- c(0, 1, 0, 0, 2)
#
TensorList <- BTF::simulate_tensor_data(N, M, D, K, MatrixType = matrix_type, stn = 2)
#
sigmasq <- 1
R_true <- matrix(rnorm(D^2, sd = 1), ncol = D)
#
n <- 10
m <- 25
#
theta_true <- rep(0, n*m)
#
Y <- array(dim = c(n, m, r))
for(i in 1:r){
  Y[, , i] <- rnorm(length(theta_true), mean = theta_true, sd = sqrt(sigmasq))
}
S <- apply(Y, c(1, 2), sd)
Y <- apply(Y, c(1, 2), mean) %>%
  magrittr::set_rownames(paste0("", 1:n)) %>%
  magrittr::set_colnames(paste0("", 1:m))
#
pval_df <- BTF::fabp_lin_reg(Y = Y, S = S, R = matrix(r, n, m), U = TensorList$U[1:n, ], V = TensorList$V[1:m, ])

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
R		R
man		man
scripts		scripts
src		src
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R

R

man

man

scripts

scripts

src

src

DESCRIPTION

DESCRIPTION

NAMESPACE

NAMESPACE

README.md

README.md

Repository files navigation

Bayesian Tensor Factorization for distilling historical information

Installation

Running the simulation studies

Running the genomics data examples

(1) Download the samples

(2) Choose a script

(3) Modify the `samps_dir`

(4) Run the script

Running Gibbs sampling on the Cancer Dependency Map data

(1) Download the data from DepMap

(2) Download the gene groups from figshare

(3) Prepare the data

(4) Run Gibbs sampling

Demo for running the Gibbs sampling algorithm on generic three-way tensor data

Getting posterior means from posterior samples

Demo for running FAB p-value procedure

About

Releases

Packages

Languages

j-g-b/BTF

Folders and files

Latest commit

History

Repository files navigation

Bayesian Tensor Factorization for distilling historical information

Installation

Running the simulation studies

Running the genomics data examples

(1) Download the samples

(2) Choose a script

(3) Modify the samps_dir

(4) Run the script

Running Gibbs sampling on the Cancer Dependency Map data

(1) Download the data from DepMap

(2) Download the gene groups from figshare

(3) Prepare the data

(4) Run Gibbs sampling

Demo for running the Gibbs sampling algorithm on generic three-way tensor data

Getting posterior means from posterior samples

Demo for running FAB p-value procedure

About

Resources

Stars

Watchers

Forks

Languages

(3) Modify the `samps_dir`