Scalable Gaussian Processes for Economic Models

This repository includes the code for my master thesis entitled "Scalable Gaussian Processes for Economic Models".

There are three ways to dig into the repository:

A Demo Notebook: which illustrates how to run an experiment locally or on a High Performance Computing (HPC) environment and how to aggregate and inspect those results afterwards.

There is also a Google Colab variant of the notebook. This provides free access to a GPU enabled environment and automates the setup of the code base.
A Results Notebook: to inspect and reproduce results and plots from the thesis. All experiments on which the thesis is based is stored in a publicly accessible MongoDB database from which this notebook ensembles the figures and tables.

The thesis: which describes the model and results.

Note: we strong recommend using the Table of Content Jupyter notebook extension to navigate the files.

Workflow

There are roughly three steps to executing an experiment.

Define a experiments as a JSON serializable Python dictionary which specifies the model and environment.

execute({
  'tag': 'demo',
  'obj_func': {'name': 'Sinc'},
  'model': {'name': 'GPModel',
            'kwargs': {'learning_rate': 0.1}},
})

The experiment is executed on a HPC environment and collected in a centralized MongoDB database.
Inspect the results on the MongoDB database in a notebook as a Pandas DataFrame.
```
get_df(**{'config.tag': 'demo'})
```

Code Outline

The code (i.e. everything in src/) is roughly divided into four parts:

Experiment: This is where most of the thesis specific code resides. It includes a Runner which defines how test and training data is drawn and what plots to generate.
Models: Various models that fits to particular training example and then predict based on this. Most are probabilistic and yields a predictive mean and variance for test locations.
Environments: The environment from which the models should train and test on. This can either be synthetic functions for which the ground truth is known, simulations where a point evaluation is generated on the fly, and data set with fixed evaluation locations.

Environments

The following provides a high-level overview of the available environments:

Non-stationary: Sinusoidals with varying amplitude and length-scale.
Discontinous: Step functions and kinks.
Financial/Economic: simulated models such as the growth model and option pricing as well as a cleaned stock marked data sets.
UCI: Various (normalized) machine learning data ported from https://people.orie.cornell.edu/andrew/code/.
Natural Sound and Precipitation: Datasets of low dimensions with many observations ported from https://github.com/kd383/GPML_SLD.
Genz 1984: For integrands scalable to arbitrary dimensionality.
Optimization benchmarks: Synthetic functions such as Branin and Rosenbrock ported from GPyOpt.
Helpers: For automatically normalizing and creating embeddings.

Installation

Note: Because data.zip is a big file (~2 GB) you will need to install Git LFS to clone the repository.

conda create -n sgp python=3.6
source activate sgp
conda env update -f environment.yml
echo "MONGO_DB_PASSWORD = None" > src/env.py
unzip data.zip

(Note: We create the environment before populating it because of a conda bug where Python 3.6 is otherwise not accessible during installation.)

There are two additional requirements for the thesis-related experiments:

To record an experiment in the MongoDB you need to populate src/env.py with:
```
MONGO_DB_PASSWORD = None
```
(replace None with the admin password to enable write permission as well.)
For Sparse Grid requirements see further down.

Optional installations

Notebook requirements:

conda -y install notebook
pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension

pip install addict
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
pip install jupyter_nbextensions_configurator
jupyter nbextensions_configurator enable --user

We recommend enabling the table of content plugin from jupyter_contrib_nbextensions to navigate the included notebooks.

Adaptive Sparse Grid installation
```
cd SparseGridCode/TasmanianSparseGrids
make
cd ../pyipopt
./install.sh
echo " IPOPT and PYIPOPT is installed "
```
Note: We replaced basestring with str in SparseGridCode/TasmanianSparseGrids/InterfacePython/TasmanianSG.py to make the library python3 compatible.

For LocalLengthScaleGPModel

git clone https://github.com/jmetzen/gp_extras.git
cd gp_extras
python setup.py install

Pytorch On GPU enabled linux machine:

source $HOME/miniconda/bin/activate
conda install -y pytorch-cpu torchvision-cpu -c pytorch
conda install pytorch torchvision cudatoolkit=9.0 -c pytorch

Server setup

To setup on the cluster:

Run either 1) make push for EPFL, or 2) make sync-dtu for DTU.
ssh and install the Anaconda environment as described in Installation.
Remember to specify the MongoDB password also described in Installation.

Growth Model

The growth model has its own workflow since its requirements and output is so different. It is a value iteration scheme that usually runs for several days and can for that reason be restarted at a previous iteration.

To install the additional requirements:

source activate sgp
source setup_env.sh
sh install_growth.sh

To run the model modify run_growth_model.py and run_growth_model.sh as needed. Then execute the script on the server with:

make run-growth

Omniboard

To view the MongoDB records of the experiments in a browser interface run the following:

npm install -g omniboard
omniboard -m localhost:27017:lions
omniboard --mu "mongodb+srv://admin:<password>@lions-rbvzc.mongodb.net/test?retryWrites=true"
PASS=$(python -c 'from src import env; print(env.MONGO_DB_PASSWORD)'); omniboard --mu "mongodb+srv://admin:${PASS}@lions-rbvzc.mongodb.net/test?retryWrites=true"

Remember to replace <password>.

If MongoDB is living on a firewalled server (not currently the case):

ssh -fN -l root -i path/to/id_rsa -L 9999:localhost:27017 host.com
ssh -N -f -L localhost:8889:localhost:7000 user@server

Jupyter notebook

jupyter notebook

HPC environments

Currently we support EPFL and DTUs.

EPFL HPC

It requires to have simba configured in ~/.ssh/config.

Host simba.epfl.ch simba simba-fe
     Hostname simba.epfl.ch
     User <username>
     ForwardAgent yes
     ForwardX11 yes
     ForwardX11Timeout 596h
     DynamicForward 3333
Host simba-compute-01 simba-compute-02 simba-compute-03 simba-compute-04 simba-compute-05 simba-compute-06 simba-compute-07 simba-compute-08 simba-compute-09 simba-compute-10 simba-compute-11 simba-compute-12 simba-compute-13 simba-compute-14 simba-compute-15 simba-compute-16 simba-compute-17 simba-compute-18 simba-compute-gpu-1 simba-compute-gpu-2 simba-compute-gpu-3
    User <username>
    ForwardAgent yes
    ForwardX11 yes
    ForwardX11Timeout 596h
    DynamicForward 3333
    ServerAliveInterval    60
    TCPKeepAlive           yes
    ProxyJump              simba
Host *
    XAuthLocation /opt/X11/bin/xauth

Remember to replace <username>.

For debugging purposes you can submit a no-op script directly from the server:

ssh simba
sbatch path/to/hpc.sh 'python' 'runner.py' 'print_config' 'with' 'obj_func={"name": "Sinc"}'

DTU HPC

ssh <username>@login2.hpc.dtu.dk

Available commands:

bqueues list available server queues.
bstat list jobs.
qrsh run interactive job server.

Trick to view plots on the server:

ssh -Y <username>@login2.hpc.dtu.dk
eog path/to/file.png

Profiling notes

To plot the memory use:

pip install memory_profiler
sudo mprof run --include-children python debug_notebook.py
mprof plot --backend TkAgg

To print memory use pr. object type (Note that this requires modifying the source code):

pip install pympler

from pympler import muppy, summary
all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)

Acknowledgement

Thanks to Simon Scheidegger for providing most of the code for the heston based option pricing and the growth model.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.idea		.idea
.vscode		.vscode
IPOPT		IPOPT
SparseGridCode		SparseGridCode
artifacts		artifacts
configs		configs
notebooks_archive		notebooks_archive
pyipopt		pyipopt
src		src
tests		tests
thesis_figs		thesis_figs
thesis_figs_notebooks		thesis_figs_notebooks
thesis_tables		thesis_tables
.autoenv.zsh		.autoenv.zsh
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
data.zip		data.zip
demo.ipynb		demo.ipynb
environment.yml		environment.yml
explorer_helper.py		explorer_helper.py
explorer_notebook.ipynb		explorer_notebook.ipynb
hpc-dtu.sh.template		hpc-dtu.sh.template
hpc.sh		hpc.sh
install_growth.sh		install_growth.sh
notebook_header.py		notebook_header.py
results.ipynb		results.ipynb
run_growth_model.py		run_growth_model.py
run_growth_model.sh		run_growth_model.sh
runner.py		runner.py
server_notebook.ipynb		server_notebook.ipynb
setup_env.sh		setup_env.sh
thesis.pdf		thesis.pdf

tmpethick/thesis-code

Folders and files

Latest commit

History

Repository files navigation

Scalable Gaussian Processes for Economic Models

Workflow

Code Outline

Environments

Installation

Optional installations

Server setup

Growth Model

Omniboard

Jupyter notebook

HPC environments

EPFL HPC

DTU HPC

Profiling notes

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Languages