Columbus: A Cloud based Distributed Scientific Workflow Engine for Large Multi-Dimensional Spatiotemporal Datasets
Columbus is a workflow engine written in python 2.7. It supports execution of workflows that can be represented as a Directed Acyclic Graph(DAG). The fundamental elements of the engine include components, workflows, and combiners. A workflow is represented as a directed acyclic graph with nodes being components or combiners and edges represent the data flow. Key features of the system include:
- Creation/deletion of components, combiners, and workflows
- Sharing workflows with other users in the system
- Data source support for Google Bigquery, Google Drive, Galileo Spacetime
- Creation of constraints for supported data sources
- Web-mapping visualizations for supported output types
- Charting visualizations for supported output types
- Dashboard showing Workflows and Users summary
- Sharing web-mapping visualizations
- Support for drawing geometries and saving them
- Support for importing Polygons from Google Fusion Tables
- Integrated data selection mechanism per workflow
- Downloading results for supported output types
- Administration panel to control distributed workers
For detailed application usage, click here. To understand the output types and script usage, click here. To see the full list of API calls that can be made in the scripts, click here. To understand the architecture, design details, and execution of workflows, refer the publication*.
* Publication material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
- Linux based OS (preferably Debian Jessie)
- Python 2.7
- Database (preferably MySQL)
- Google cloud storage (read-write access service account required)
- Storage systems (atleast one)
- Google Bigquery (read access service account required)
- Google Drive
- Galileo Spacetime
- GIS Computations(atleast one)
- Google Earth Engine (full-access service account required)
- Any software stack installed on machines acting as workers
- Columbus Worker (included in the requires sub directory)
- Apache HTTP server
The section describes the installation of the application on Google cloud compute engine instance running Debian 8 and deployment to Apache HTTP Server using mod-wsgi. The steps assume that python 2.7 is installed on the machine and that database is MySQL.
-
Update apt-get
$ sudo apt-get update
-
Install pip
$ curl -O https://bootstrap.pypa.io/get-pip.py $ sudo python get-pip.py
-
Install virtualenv
$ sudo pip install virtualenv
-
Install build-essentials
$ sudo apt-get install build-essential
-
Install mysql-dev
$ sudo apt-get install libmysqlclient-dev
-
Install python-dev
$ sudo apt-get install python-dev
-
Install libssl-dev and libffi-dev
$ sudo apt-get install libssl-dev libffi-dev
-
Install libxml and libxslt
$ sudo apt-get install libxml2-dev libxslt-dev
-
create a directory for virtual env and change the owner to the user
$ sudo mkdir /home/prod $ sudo chown $USER /home/prod
-
Create a virtual environment(you should be able to create without sudo otherwise navigate to
/home/prod
and try again)$ virtualenv --no-site-packages -p /usr/bin/python2.7 /home/prod/venv
-
Copy the columbus source code to
/home/prod/venv
. Your directory structure should be as follows:/home/prod/venv/columbus ├─── apache ├─── columbus ├─── pyedf ├─── requires ├─── secured ├─── static ├─── templates ├─── manage.py ├─── requirements.txt
-
Navigate to the virtual env and activate it. The prompt should change to (venv).
$ cd /home/prod/venv $ source bin/activate (venv) /home/prod/venv$ pip install columbus/requires/columbusworker-0.1.0.tar.gz
-
Install all the requirements using the following command in (venv) prompt
(venv) $ cd /home/prod/venv/columbus (venv) $ pip install -r requirements.txt
-
Deactivate virtualenv using the following command in venv prompt
(venv) $ deactivate
-
Install and start apache outside virtual env
$ sudo apt-get install apache2 $ sudo apache2ctl start
-
Install mod-wsgi
$ sudo apt-get install libapache2-mod-wsgi
-
Test mod-wsgi installation
-
Make a directory "test" in /home/prod/venv
$ cd /home/prod/venv $ mkdir test
-
Create a file named "testapp.wsgi" in /home/prod/venv/test and paste the following contents in it
def application(environ, start_response): status = '200 OK' output = 'Hello World!' response_headers = [('Content-type', 'text/plain'),('Content-Length', str(len(output)))] start_response(status, response_headers) return [output]
-
Make note of the user and group of the directory /home/prod/venv
$ ls -l /home/prod Output: drwxr-xr-x 8 johnsoncharles26 johnsoncharles26 4096 Jun 15 22:38 venv First occurence of johnsoncharles26 is the user and the second occurence is the group
-
Open the apache2 conf file:
$ cd /etc/apache2 $ sudo vi apache2.conf
-
Paste the following contents after the last
</Directory>
tag. You can look up the WSGIDaemonProcess documentation here. Replace $USER with actual username.WSGIDaemonProcess columbus-daemon user=$USER group=$USER threads=25 python-path=/home/prod/venv/lib/python2.7/site-packages WSGIProcessGroup columbus-daemon WSGIScriptAlias /testapp "/home/pord/venv/test/testapp.wsgi" <Directory /home/prod/venv/test> AllowOverride none Order deny,allow Allow from all Require all granted </Directory>
-
Restart the server.
$ sudo apache2ctl restart
-
Navigate to
compute-engine-instance-ip/testapp
in a browser. You should see Hello World!
-
-
If using Galileo Spacetime, deploy the Galileo Webservice by following the instructions here. If you wish to deploy the service on the same host as the Columbus and would like to access it from the same domain, then:
-
Enable mod_proxy
$ sudo a2enmod proxy $ sudo apache2ctl restart $ sudo a2enmod proxy_http $ sudo apache2ctl restart
-
Make the following changes to your
apache2.conf
file.$ cd /etc/apache2 $ sudo vi apache2.conf
Assuming
columbus.xyz
as the domain at which Columbus would be deployed,tomcat.columbus.xyz
as the domain at which the Galileo webservice would be accessed, and Galileo webservice is deployed on Tomcat on the port8080
.<VirtualHost *:80> ServerName tomcat.columbus.xyz ProxyPass / http://www.columbus.xyz:8080/ ProxyPassReverse / http://www.columbus.xyz:8080/ </VirtualHost>
-
-
If mod_wsgi installation was successful please follow the steps listed in application setup and update the
columbus/prod_settings.py
andapache/django.wsgi
files appropriately. -
Finally, make the following changes to the
apache2.conf
file to deploy the Columbus application. Replace$USER
with the actual username.<VirtualHost *:80> ServerName www.columbus.xyz ServerAlias columbus.xyz ServerAdmin username@columbus.xyz # User appropriate email address DocumentRoot /home/prod/venv/columbus <Directory /home/prod/venv/columbus> AllowOverride none Order deny,allow Allow from all Require all granted </Directory> Alias /static/ /home/prod/venv/columbus/static/ <Directory /home/prod/venv/columbus/static> AllowOverride none Order deny,allow Allow from all Require all granted </Directory> WSGIDaemonProcess columbus-daemon user=$USER group=$USER processes=1 threads=1000 python-path=/home/prod/venv/lib/python2.7/site-packages WSGIProcessGroup columbus-daemon WSGIScriptAlias / "/home/prod/venv/columbus/apache/django.wsgi" <Directory /home/prod/venv/columbus/apache> AllowOverride none Order deny,allow Allow from all Require all granted </Directory> </VirtualHost>
-
Navigate to
www.columbus.xyz
and you should be able to access the Columbus application.
This section lists the steps needed to setup the application. Instructions are also included as comments in the prod_settings.py
file.
-
Open the
apache/django.wsgi
file in the distribution and make sure that the paths listed for the following are correct else update them appropriately.site.addsitedir('/home/prod/venv/lib/python2.7/site-packages') sys.path.append('/home/prod/venv/columbus') sys.path.append('/home/prod/venv/columbus/columbus') activate_this = '/home/prod/venv/bin/activate_this.py'
-
Open the
columbus/prod_settings.py
file and make the following changes.-
If using Galileo Spacetime, update the following variable to point to the Galileo Webservice.
WEBSERVICE_HOST = 'http://tomcat.columbus.xyz/galileo-web-service'
-
Update the supervisor port number if the default is already in use. This is the port on which Columbus master listens for workers to connect and communicate with them.
SUPERVISOR_PORT = 56789
-
Update the container size per your requirements. The value is treated as in MB. This is the maximum memory allowed for any process to execute a Target of any workflow. Set this to an optimal value depending on your needs. When a target requires more memory than that is set here, the worker retries the Target by doubling the container size every time.
CONTAINER_SIZE_MB = 1024 # 1024 MB containers for any target
-
Update the user directory path (without a trailing slash) where the serialization files or pickles are stored on both master and workers. This path must exist on both master and worker machines with read and write permissions to the user running the application.
USER_DIRPATH = '/home/prod/storage'
-
Update the Google cloud storage bucket name (without
gs://
) for fault tolerance and data transfers between master and workers. This is required for Columbus to execute workflows.USER_GCSPATH = 'gcs-bucket-name'
-
Update the temporary directory path (without trailing slash). This is the place where files are stored for temporary purposes such as while uploading data to the cloud or during the creation of fusion tables. Files will not be cleared by the application from this path, so create a system task to clear the files in this path periodically.
TEMP_DIRPATH = '/tmp'
-
Place all the service account files in the directory named
secured
in the distribution. Service accounts are mandatory for Google cloud storage, Google fusion tables, and Google drive. Additionally, you would need to obtain a client secret file to allow the application to access end users Google drive. If you use Google Bigquery or Google Earth Engine, you would need service accounts to those.# Place all the service account files in this directory SECURED_DIR = os.path.join(BASE_DIR, 'secured') # Do not change. Used for internal purposes. REQUIRES_DIR = os.path.join(BASE_DIR, 'requires') # service account credentials from Google dev console for Google Earth Engine. Required if GIS computations are # performed in the Targets. EE_CREDENTIALS = os.path.join(SECURED_DIR, 'columbus-earth-engine.json') # service account credentials from Google dev console for Google Bigquery. Required if Bigquery serves as one of # the data source options. BQ_CREDENTIALS = os.path.join(SECURED_DIR, 'earth-outreach-bigquery.json') # service account credentials from Google dev console for Google Cloud Storage. Required for fault tolerance and data # transfers. The service account listed here must have full permissions to the bucket listed for the property # USER_GCSPATH above CS_CREDENTIALS = os.path.join(SECURED_DIR, 'columbus-earth-engine.json') # service account credentials from Google dev console for Google Fusion Tables and Google Drive. Required to enable # web-mapping visualizations. FT_CREDENTIALS = os.path.join(SECURED_DIR, 'columbus-earth-engine.json') # client secret to gain access to end users google drive. Required to obtain permission to client's Google Drive if # the same is serving as one of the data source options. GD_CREDENTIALS = os.path.join(SECURED_DIR, 'columbus-client-secret.json')
-
Update the list of worker host names. Hostnames should be fully qualified and master must be able to login to all those workers using passwordless SSH.
# WORKERS CONFIGURATION WORKERS = [socket.getfqdn()] # default is ~/columbus. If specified path must be fully qualified WORKER_VIRTUAL_ENV = None # port number to SSH into the worker machines WORKER_SSH_PORT = 22 # username to SSH into the worker machines WORKER_SSH_USER = 'johnsoncharles26' # password if any to SSH into the worker machines. If passwordless SSH is enabled and the private key has a passphrase # this must be that passphrase. WORKER_SSH_PASSWORD = None # fully qualified path for the priavte key file. if not specified ~/.ssh/id_rsa is tried WORKER_SSH_PRIVATE_KEY = None
-
Update the scheduling strategy for the execution of workflows. It must be one of
local
,remote
,hybrid
. If you are unsure, use the defaults.# Scheduler Configuration # learn about the scheduling strategies from the Columbus paper. If not sure, leave the defaults PIPELINE_SCHEDULING_STRATEGY = "hybrid" # waiting-running target ratio used only for hybrid scheduling strategy. # Default is 1, meaning targets are sent to the same worker as long as the number # of targets waiting is less than or equal to the number of running targets of any user HYBRID_SCHEDULING_WR_RATIO = 1
-
Update the temporary or staging cloud storage bucket to use while communicating with Google Earth Engine. Required only if GEE will be used.
# Cloud Storage Bucket to use for temporary file storing while communicating with Google Earth Engine. # The service account key specified for EE_CREDENTIALS must have full access to this bucket. CS_TEMP_BUCKET = 'staging.columbus-csu.appspot.com'
-
Update the
SECRET_KEY
for security purposes and remember to turn theDEBUG
parameter toFalse
after deploying the application successfully.# SECURITY WARNING: keep the secret key used in production secret! # Change the key to something else after deployment SECRET_KEY = '3bg_5!omle5)+60!(qndj2!#yi+d%2oug2ydo(*^nup+9if0$k' # Remove the following debug params after successful deployment DEBUG = True
-
Update the database information. If using Google cloud SQL, do not forget to whitelist the IP of the host that will access this database.
DATABASES = { 'default': { 'ENGINE': 'django.db.backends.mysql', # assuming the database is MySQL 'NAME': 'database-name', # name of the database to use 'USER': 'user-name', # user name to use while connecting to the database 'PASSWORD': 'password', # password to use while connecting the database 'HOST': 'mysql-ip-address', # public IP of the database server 'PORT': '3306' # port number of the database server } }
-
Update the list of domain names that point to this server
# list of domain names to which django server should serve. Must be specified when DEBUG = False ALLOWED_HOSTS = ['www.columbus.xyz', 'columbus.xyz']
-
Update the time zone to the time zone of the server if
UTC
is not preferred.# Internationalization # https://docs.djangoproject.com/en/1.8/topics/i18n/ LANGUAGE_CODE = 'en-us' TIME_ZONE = 'UTC' USE_I18N = True USE_L10N = True USE_TZ = True
-
Update email settings. The defaults assume that the SendGrid service is used.
# Change the admin name and email address ADMINS = [ ('Johnson Kachikaran', 'jcharles@cs.colostate.edu'), ] # Refer to configuring sendgrid using Postfix on Google Compute Engine here # https://cloud.google.com/compute/docs/tutorials/sending-mail/using-sendgrid EMAIL_HOST = 'smtp.sendgrid.net' EMAIL_HOST_USER = 'sendgrid-username' EMAIL_HOST_PASSWORD = 'sendgrid-password' EMAIL_PORT = 2525 EMAIL_USE_TLS = True # Use appropriate prefix to add to the subject line of all the emails sent from the application EMAIL_SUBJECT_PREFIX = '[Columbus] ' # Email address of the sender to use while sending emails from the application. # Typically, noreply@columbus.xyz EMAIL_SENDER = 'Sender Name <senders email address including angular brackets>' # Change the manager name and email address MANAGERS = ( ('Johnson Kachikaran', 'jcharles@cs.colostate.edu'), ) SEND_BROKEN_LINK_EMAILS = True
-
Update the logger settings to meet your needs. Default file sizes are 10MB and a maximum of 10 files are stored as backup
# Logger settings LOGGING = { 'version': 1, 'disable_existing_loggers': False, 'formatters': { 'verbose': { 'format': "[%(asctime)s] %(levelname)s [%(name)s:%(lineno)s] %(message)s", 'datefmt': "%d/%b/%Y %H:%M:%S" }, 'simple': { 'format': '%(levelname)s %(message)s' }, }, 'handlers': { 'pyedf_handler': { 'class': 'logging.handlers.RotatingFileHandler', 'filename': os.path.join(BASE_DIR, 'pyedf.log'), 'maxBytes': 1024 * 1024 * 10, # 10MB default 'backupCount': 10, 'formatter': 'verbose' }, 'django_handler': { 'class': 'logging.handlers.RotatingFileHandler', 'filename': os.path.join(BASE_DIR, 'django.log'), 'maxBytes': 1024 * 1024 * 10, # 10MB default 'backupCount': 10, 'formatter': 'verbose' } }, 'loggers': { 'django': { 'handlers': ['django_handler'], 'propagate': True, 'level': 'ERROR', }, 'pyedf': { 'handlers': ['pyedf_handler'], 'level': 'INFO', }, } }
-
-
Setup database and load the data.
$ cd /home/prod/venv $ source bin/activate (venv) /home/prod/venv$ cd columbus (venv) /home/prod/venv/columbus$ python manage.py makemigrations (venv) /home/prod/venv/columbus$ python manage.py migrate (venv) /home/prod/venv/columbus$ python manage.py loaddata typemodel.json
-
Create a superuser to have administrator access to the Django admin portal. You can access the admin portal after successful deployment at
www.columbus.xyz/admin
assumingcolumbus.xyz
is the domain name.(venv) /home/prod/venv/columbus$ python manage.py createsuperuser # Deactivate the virtual environment after creating the superuser (venv) /home/prod/venv/columbus$ deactivate
Copyright (c) 2017 Johnson Kachikaran, Colorado State University
Licensed under MIT License (the "License"). The License is included in the software distribution and you may also view the same here.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.