Skip to content
/ OPEX Public

An optimal experimental design framework for accelerating knowledge discovery using gene expression data

License

Notifications You must be signed in to change notification settings

IBPA/OPEX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is OPEX?

OPEX is an optimal experimental design framework written in R to help biologists to select the most informative experiments to conduct given the experiments conducted up to now. This repo demonstrates the application of OPEX on collecting gene expression data of E. coli under the stress of various antibiotics and biocides.

Dependencies

Code architecture

The structure of the code is show as follows. The entry to this project is run.sh which runs the main.R. The folder, src stores the implementation of the functions and classes used in main.R. There are seven R scripts in src. The script, generate_setting.R is for generating settings for running a simulation. The script, Simulator.R defines a class named Simulate, which is the workhorse of running the simulation. Other scripts are helper modules of the Simulate class. For details of each script, see the document header of each file.

├── main.R
├── run_OPEX_on_your_dataset.R
├── run.sh
└── src
    ├── add_noise.R
    ├── generate_setting.R
    ├── max_dist.R
    ├── prepare_data.R
    ├── screen_index_helper.R
    ├── Simulator.R
    └── update_train_pool.R

Input data

The input data is a table, in which the first 14 columns define the culture conditions in each row and the other 1123 columns represents the gene expression profile for each condition. (Genes that did not have a sufficient sequencing depth were excluded).

A culture condition is defined by a binary vector, representing the presence (with 0) or absence (with 1) of 10 biocides and 4 antibiotics: Chlorexidine, Phenol, H2O2, Isopropanol, Bezalkonium_chloride, Ethanol, Glutaraldehyde, Percetic_acid, Sodium_hypochlorite, Povidone_iodine, Kanamycin, Rifampicin, Norfloxacin, Ampicillin.

How to reproduce

  • Step 1: generate a file that include the settings for running OPEX. The setting file is named after the sampling method. e.g. expert sampling is used in the following example.

    cd ./R/src
    Rscript generate_setting.R setting
    

    After running the above commands, a file named setting.csv is generated in ./output. The generate_setting.csv specifies the value for hyper-parameters: random_seed, exploration frequency, adaptive , start size, add, dataset id, noise, iter_num, and sampling method. For the meaning of these hyper-parameters, see the comments in the generate_setting.R file.

  • Step 2: Run the simulation using one of the setting in the file generated in Step 1. e.g. The first setting is used in the following example.

    cd ./R
    bash run.sh setting.csv 1
    

    To run all the settings, we used high performance computing. The script for submitting all the simulations is as follows:

#!/bin/bash
#SBATCH -p low
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem-per-cpu 1000
#SBATCH -t 1:00:00
#SBATCH -o output/slurm.%N.%j.out
#SBATCH -e output/slurm.%N.%j.err
#SBATCH --array=1-1800
Rscript main.R setting.csv $SLURM_ARRAY_TASK_ID 

Upon completion, a folder named setting will be created in ./output. The results generated by this simulation run is stored in the folder, expert_sample.

The result is a csv file named by the value of the hyper-parameters in the setting and contains the order of each culture condition selected by expert sampling.

How to run OPEX on your own tabular dataset

To OPEX on your own biological problem, two tabular datasets are needed. One is a dataset for training a model. The other is a pool of candidate experiments to run. Both datasets are a matrix. In the training dataset, the last column is the output and other columns are inputs. Each row denotes one datapoint. The pool dataset has one less column than the training set as the output column is missing.

The command to run OPEX is as follows:

Rscript run_OPEX_on_your_dataset.R <training_path> <pool_path> <batch_size>

training_path, pool_path are two strings representing the path of two csv files.

batch_size is an integer.

Support

If you have any questions about this project, please contact us at tagkopouloslab@ucdavis.edu

Licence

See the LICENSE file for license rights and limitations (Apache2.0).

Acknowledgement

This work was supported by an NSF award (#1743101).