mit-quest · rak5216 · Feb 14, 2020 · Feb 14, 2020 · Feb 14, 2020 · Feb 14, 2020
diff --git a/.gitignore b/.gitignore
@@ -7,4 +7,5 @@ terraform.tfstate
 terraform.tfstate.backup
 .terraform.tfstate.lock.info
 .terraform
-.DS_Store
+.DS_Store
+.venv
diff --git a/Pipfile b/Pipfile
diff --git a/docs/assumed_knowledge.md b/docs/assumed_knowledge.md
@@ -8,3 +8,4 @@ The workflows contained in this repository assume:
 * You are familiar with image annotations and how they are used in image segmentation. If you are unfamiliar with this, see [here]() for more information. TODO: add link/link content 
 * You are familiar with how datasets are used in Machine Learning (for example, splitting your data into train, validation, and test). If you are unfamiliar with this, see [here]() for more information. TODO: add link/link content  
 * You are familiar with how use tmux on a remote machine and how we will use it to keep processes running even if the SSH window is closed or disconnected. If you are unfamiliar with this, see [here]() for more information. TODO: add link/link content  
+* The codebase is meant to be run on a virtual machine so it installs the python package user-wide. If you wish to run the code locally, we suggest using `virtualenv` (see [here](virtual_environment.md) for instructions).
diff --git a/docs/data_ingestion.md b/docs/data_ingestion.md
@@ -19,7 +19,7 @@ Infrastructure that will be used:
 1. When this completes, you should see your stack in `gs://<gcp_bucket_name>/raw-data/<zip_file>`.
 1. Use Terraform to start the appropriate GCP virtual machine (`terraform apply` or `terraform apply -lock=false`). 
 1. Once Terraform finishes, you can check the GCP virtual machine console to ensure a virtual machine has been created named `<project_name>-<user_name>` where `<project_name>` is the name of your GCP project and `<user_name>` is your GCP user name.
-1. SSH into the GCP virtual machine, start tmux (`tmux`), `cd` into the code directory (`cd necstlab-damage-segmentation`), and process a single zip file by running the command: `pipenv run python3 ingest_raw_data.py --gcp-bucket gs://<gcp_bucket_name> --zipped-stack gs://<gcp_bucket_name>/raw-data/<zip_file>`. Alternatively, to process an entire folder of zipped stacks, use `pipenv run python3 ingest_raw_data.py --gcp-bucket gs://<gcp_bucket_name>` (excluding the `--zipped-stack` argument), which will process all of the files in `gs://<gcp_bucket_name>/raw-data` (`ingest_raw_data.py` knows to process only `<gcp_bucket_name>/raw-data`).
+1. SSH into the GCP virtual machine, start tmux (`tmux`), `cd` into the code directory (`cd necstlab-damage-segmentation`), and process a single zip file by running the command: `python3 ingest_raw_data.py --gcp-bucket gs://<gcp_bucket_name> --zipped-stack gs://<gcp_bucket_name>/raw-data/<zip_file>`. Alternatively, to process an entire folder of zipped stacks, use `python3 ingest_raw_data.py --gcp-bucket gs://<gcp_bucket_name>` (excluding the `--zipped-stack` argument), which will process all of the files in `gs://<gcp_bucket_name>/raw-data` (`ingest_raw_data.py` knows to process only `<gcp_bucket_name>/raw-data`).
 1. When this completes, you should see your stack in `gs://<gcp_bucket_name>/processed-data/<stack_ID>`.
 1. Use Terraform to terminate the appropriate GCP virtual machine (`terraform destroy`). Once Terraform finishes, you can check the GCP virtual machine console to ensure a virtual machine has been destroyed.
 

diff --git a/docs/dataset_preparation.md b/docs/dataset_preparation.md
@@ -15,6 +15,6 @@ Infrastructure that will be used:
 1. Either edit the configuration file `configs/data_preparation.yaml` or create your own configuration file and place it in the `configs` folder.
 1. Use Terraform to start the appropriate GCP virtual machine (`terraform apply`). This will copy the current code base from your local machine to the GCP machine so make sure any changes to the configuration file are saved before this step is run.
 1. Once Terraform finishes, you can check the GCP virtual machine console to ensure a virtual machine has been created named `<project_name>-<user_name>` where `<project_name>` is the name of your GCP project and `<user_name>` is your GCP user name.
-1. To create a dataset, SSH into the virtual machine `<project_name>-<user_name>`, start tmux (`tmux`), `cd` into the code directory (`cd necstlab-damage-segmentation`), and run `pipenv run python3 prepare_dataset.py --gcp-bucket <gcp_bucket> --config-file configs/<config_filename>.yaml`. 
+1. To create a dataset, SSH into the virtual machine `<project_name>-<user_name>`, start tmux (`tmux`), `cd` into the code directory (`cd necstlab-damage-segmentation`), and run `python3 prepare_dataset.py --gcp-bucket <gcp_bucket> --config-file configs/<config_filename>.yaml`. 
 1. Once dataset preparation has finished, you should see the folder `<gcp_bucket>/datasets/<dataset_ID>` has been created and populated, where `<dataset_ID>` was defined in `configs/data_preparation.yaml`.
 1. Use Terraform to terminate the appropriate GCP virtual machine (`terraform destroy`). Once Terraform finishes, you can check the GCP virtual machine console to ensure a virtual machine has been destroyed.
diff --git a/docs/inference.md b/docs/inference.md
@@ -14,7 +14,7 @@ Infrastructure that will be used:
 1. If the unsegmented stacks are not in a GCP bucket, see the previous workflow `Copying the raw data into the cloud for storage and usage`.
 1. Use Terraform to start the appropriate GCP virtual machine (`terraform apply`).
 1. Once Terraform finishes, you can check the GCP virtual machine console to ensure a virtual machine has been created named `<project_name>-<user_name>` where `<project_name>` is the name of your GCP project and `<user_name>` is your GCP user name.
-1. To infer (segment) the damage of the stacks, SSH into the virtual machine `<project_name>-<user_name>`, start tmux (`tmux`), `cd` into the code directory (`cd necstlab-damage-segmentation`), and run `pipenv run python3 infer_segmentation.py --gcp-bucket <gcp_bucket> --stack-id <stack_id> --model-id <model_id>`. 
+1. To infer (segment) the damage of the stacks, SSH into the virtual machine `<project_name>-<user_name>`, start tmux (`tmux`), `cd` into the code directory (`cd necstlab-damage-segmentation`), and run `python3 infer_segmentation.py --gcp-bucket <gcp_bucket> --stack-id <stack_id> --model-id <model_id>`. 
 1. Once inference has finished, you should see the folder `<gcp_bucket>/inferences/<inference_ID>` has been created and populated, where `<inference_ID>` is `<stack_id>_<model_id>`.
 1. Use Terraform to terminate the appropriate GCP virtual machine (`terraform destroy`). Once Terraform finishes, you can check the GCP virtual machine console to ensure a virtual machine has been destroyed.
 

diff --git a/docs/testing.md b/docs/testing.md
@@ -13,6 +13,6 @@ Infrastructure that will be used:
 1. If the stacks are not in a GCP bucket, see the previous workflow `Copying the raw data into the cloud for storage and usage`.
 1. Use Terraform to start the appropriate GCP virtual machine (`terraform apply`).
 1. Once Terraform finishes, you can check the GCP virtual machine console to ensure a virtual machine has been created named `<project_name>-<user_name>` where `<project_name>` is the name of your GCP project and `<user_name>` is your GCP user name.
-1. To create a dataset, SSH into the virtual machine `<project_name>-<user_name>`, start tmux (`tmux`), `cd` into the code directory (`cd necstlab-damage-segmentation`), and run `pipenv run python3 test_segmentation_model.py --gcp-bucket <gcp_bucket> --dataset-id <dataset_id> --model-id <model_id>`.
+1. To create a dataset, SSH into the virtual machine `<project_name>-<user_name>`, start tmux (`tmux`), `cd` into the code directory (`cd necstlab-damage-segmentation`), and run `python3 test_segmentation_model.py --gcp-bucket <gcp_bucket> --dataset-id <dataset_id> --model-id <model_id>`.
 1. Once dataset preparation has finished, you should see the folder `<gcp_bucket>/tests/<test_ID>` has been created and populated, where `<test_ID>`  is `<dataset_id>_<model_id>`.
 1. Use Terraform to terminate the appropriate GCP virtual machine (`terraform destroy`). Once Terraform finishes, you can check the GCP virtual machine console to ensure a virtual machine has been destroyed. 
diff --git a/docs/training.md b/docs/training.md
@@ -14,7 +14,7 @@ Infrastructure that will be used:
 1. Either edit the configuration file `configs/train_config.yaml` or create your own configuration file and place it in the `configs` folder.
 1. Use Terraform to start the appropriate GCP virtual machine (`terraform apply`). This will copy the current code base from your local machine to the GCP machine so make sure any changes to the configuration file are saved before this step is run.
 1. Once Terraform finishes, you can check the GCP virtual machine console to ensure a virtual machine has been created named `<project_name>-<user_name>` where `<project_name>` is the name of your GCP project and `<user_name>` is your GCP user name.
-1. To create a dataset, SSH into the virtual machine `<project_name>-<user_name>`, start tmux (`tmux`), `cd` into the code directory (`cd necstlab-damage-segmentation`), and run `pipenv run python3 train_segmentation_model.py  --gcp-bucket <gcp_bucket> --config-file configs/<config_filename>.yaml`. 
+1. To create a dataset, SSH into the virtual machine `<project_name>-<user_name>`, start tmux (`tmux`), `cd` into the code directory (`cd necstlab-damage-segmentation`), and run `python3 train_segmentation_model.py  --gcp-bucket <gcp_bucket> --config-file configs/<config_filename>.yaml`. 
 1. Once dataset preparation has finished, you should see the folder `<gcp_bucket>/models/<model_ID>-<timestamp>` has been created and populated, where `<model_ID>` was defined in `configs/train_config.yaml`.
 1. Use Terraform to terminate the appropriate GCP virtual machine (`terraform destroy`). Once Terraform finishes, you can check the GCP virtual machine console to ensure a virtual machine has been destroyed. 
 

diff --git a/docs/virtual_environment.md b/docs/virtual_environment.md
@@ -0,0 +1,10 @@
+To set up a virtual environment:
+- Install it: `pip install virtualenv`
+- Create the virtual environment: `virtualenv --always-copy --system-site-packages --python=python3 .venv`
+- Install the needed packages: `.venv/bin/pip install -q -r requirements.txt`
+
+To use the virtual environment, enter it: `source .venv/bin/activate`
+
+To exit the virtual environment use: `deactivate`
+
+To delete the virtual environment just delete the `.venv` folder: `rm -r .venv`
diff --git a/gcp.tf b/gcp.tf
@@ -71,7 +71,10 @@ resource "google_compute_instance" "vm" {
   }
 
   provisioner "remote-exec" {
-    script = "./scripts/resource-creation.sh"
+    inline = [
+      "echo 'Running resource creation script... (this may take 10+ minutes)'",
+      "bash ~/${var.repository_name}/scripts/resource-creation.sh > resource-creation.log"
+    ]
     connection {
       user = "${var.username}"
       type = "ssh"

diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,20 @@
+numpy
+tensorflow-gpu
+opencv-python
+scikit-image
+sklearn
+progress
+Keras
+ipython
+segmentation-models
+pytz
+tensorboard
+pillow
+pandas
+google-cloud-storage
+pyyaml
+jupyter
+crcmod
+gitpython
+matplotlib
+ipykernel
diff --git a/scripts/resource-creation.sh b/scripts/resource-creation.sh
@@ -21,7 +21,6 @@ sudo dpkg -i libcudnn7_7.6.5.32-1+cuda10.1_amd64.deb
 wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/libcudnn7-dev_7.6.5.32-1+cuda10.1_amd64.deb
 sudo dpkg -i libcudnn7-dev_7.6.5.32-1+cuda10.1_amd64.deb
 
-
 # install needed packages
 sudo apt-get install -y cmake \
 	git \
@@ -34,8 +33,9 @@ sudo apt-get install -y cmake \
 	tree \
 	p7zip-full
 
-sudo pip3 uninstall crcmod
-sudo pip3 install pipenv
-sudo pip3 install --no-cache-dir -U crcmod
-
-cd necstlab-damage-segmentation && pipenv install
+pip3 install --upgrade pip
+pip3 install --upgrade setuptools
+pip3 uninstall crcmod -y
+pip3 install --no-cache-dir crcmod
+pip3 install --upgrade pyasn1
+cd necstlab-damage-segmentation && pip3 install -r requirements.txt
diff --git a/scripts/run_all_large.sh b/scripts/run_all_large.sh
@@ -1,9 +1,9 @@
 #!/bin/bash
 
 
-pipenv run python3 ingest_raw_data.py --gcp-bucket gs://necstlab-sandbox
-pipenv run python3 prepare_dataset.py --gcp-bucket gs://necstlab-sandbox --config-file configs/dataset-large.yaml
-pipenv run python3 train_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --config-file configs/train-large.yaml
-pipenv run python3 test_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --dataset-id dataset-large --model-id segmentation-model-large_20190924T180419Z
-pipenv run python3 infer_segmentation.py --gcp-bucket gs://necstlab-sandbox --model-id segmentation-model-large_20190924T180419Z --stack-id THIN_REF_S2_P1_L3_2496_1563_2159
-pipenv run python3 infer_segmentation.py --gcp-bucket gs://necstlab-sandbox --model-id segmentation-model-large_20190924T180419Z --stack-id 8bit_AS4_S2_P1_L6_2560_1750_2160
+python3 ingest_raw_data.py --gcp-bucket gs://necstlab-sandbox
+python3 prepare_dataset.py --gcp-bucket gs://necstlab-sandbox --config-file configs/dataset-large.yaml
+python3 train_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --config-file configs/train-large.yaml
+python3 test_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --dataset-id dataset-large --model-id segmentation-model-large_20190924T180419Z
+python3 infer_segmentation.py --gcp-bucket gs://necstlab-sandbox --model-id segmentation-model-large_20190924T180419Z --stack-id THIN_REF_S2_P1_L3_2496_1563_2159
+python3 infer_segmentation.py --gcp-bucket gs://necstlab-sandbox --model-id segmentation-model-large_20190924T180419Z --stack-id 8bit_AS4_S2_P1_L6_2560_1750_2160
diff --git a/scripts/run_all_small.sh b/scripts/run_all_small.sh
@@ -1,9 +1,9 @@
 #!/bin/bash
 
 
-pipenv run python3 ingest_raw_data.py --gcp-bucket gs://necstlab-sandbox
-pipenv run python3 prepare_dataset.py --gcp-bucket gs://necstlab-sandbox --config-file configs/dataset-small.yaml
-pipenv run python3 train_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --config-file configs/train-small.yaml
-pipenv run python3 test_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --dataset-id dataset-small --model-id segmentation-model-small_20190924T191717Z
-pipenv run python3 infer_segmentation.py --gcp-bucket gs://necstlab-sandbox --model-id segmentation-model-small_20190924T191717Z --stack-id THIN_REF_S2_P1_L3_2496_1563_2159
-pipenv run python3 infer_segmentation.py --gcp-bucket gs://necstlab-sandbox --model-id segmentation-model-small_20190924T191717Z --stack-id 8bit_AS4_S2_P1_L6_2560_1750_2160
+python3 ingest_raw_data.py --gcp-bucket gs://necstlab-sandbox
+python3 prepare_dataset.py --gcp-bucket gs://necstlab-sandbox --config-file configs/dataset-small.yaml
+python3 train_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --config-file configs/train-small.yaml
+python3 test_segmentation_model.py --gcp-bucket gs://necstlab-sandbox --dataset-id dataset-small --model-id segmentation-model-small_20190924T191717Z
+python3 infer_segmentation.py --gcp-bucket gs://necstlab-sandbox --model-id segmentation-model-small_20190924T191717Z --stack-id THIN_REF_S2_P1_L3_2496_1563_2159
+python3 infer_segmentation.py --gcp-bucket gs://necstlab-sandbox --model-id segmentation-model-small_20190924T191717Z --stack-id 8bit_AS4_S2_P1_L6_2560_1750_2160