Big Data Tools

Simple and easy to use provisioning, deployment and administration tools for Big Data / NoSQL environments

Provisioning is available for both Vagrant/VirtualBox and AWS EC2 along with deployment tools for the following using Fabric:

Prerequisites (Mac OSX)

Miniconda - Python package manager

wget https://repo.continuum.io/miniconda/Miniconda2-latest-MacOSX-x86_64.sh -O miniconda.sh
chmod +x miniconda.sh
./miniconda.sh -b

Add the following to your ~/.profile:

export PATH="~/miniconda2/bin:$PATH"

The run the following commands to create and activate environment:

source ~/.profile
conda env create
source activate bigdata-tools

Provisioning

You can use the provided Vagrant or AWS EC2 provisioning tools to quickly and easily create virtual machines or create on your own.

Deployment

Once you have configured and launched your virtual hosts, you are now ready to deploy software to those machines.

1. Configuration

There is a sample deployment configuration file for Vagrant and AWS for you to use as a starting point:

cp deploy/config/vagrant.yml.sample deploy/config/vagrant.yml

OR

cp deploy/aws.yml.sample deploy/aws.yml`

Update the configuration file based on the software you want to deploy. All software follows the same pattern which includes the host name, public IP, private IP, software and software arguments as show below:

- name: <host name>
   public-ip: <public ip address>
   private-ip: <private ip address>
   software:
     - name: <software to install>
       <argument key>: <argument values>
       <argument key>: <argument values>
       <argument key>: <argument values>
     - name: <software to install>
       <argument key>: <argument values>

Additionally, you can use the all-hosts sections, to install the same software on all hosts. The all-hosts and hosts configuration settings are merged automatically for you when deploying.

The arguments for each tool deployed are show in the next section with the name matching the software name.

Software Specific Configuration

cassandra

argument	default value	description
cluster-name	`'Test Cluster'`	unique name of your Cassandra cluster
data-file-directory	`/var/lib/cassandra/data`	data file directory
commit-log-directory	`/var/lib/cassandra/commit_log`	commit log directory
saved-caches-directory	`/var/lib/cassandra/saved_caches`	saved caches directory
endpoint-snitch	`SimpleSnitch`	determines data center and racks for nodes
seeds	host's private ip	comma-separated list of cassandra nodes as `host:port`
listen-address	host's private ip	ip address used to connect to node
rpc-address	host's private ip	listen ip address for client connections

cassandra-lucene-index

No arguments required.

crate

argument	default value	description
cluster-name	`my-cluster`	name of your crate cluster
data-dir		array of data directories (see sample config)
heap-size		Java heap size
security-group-name		name of security for EC2 instance to discover crate nodes (optional)
product-tag		EC2 instance product tag to discover crate nodes (optional)
aws-access-key		AWS access key (optional)
aws-secret-key		AWS secret key (optional)

citusdb

argument	default value	description
db-name		name of database
db-user		name of database user
db-password		database password
data-dir		data directory where Postgres data is stored

java-8

No arguments required.

kafka-broker

argument	default value	description
version	`0.10.0.1`	kafka version
zookeeper-hosts	`localhost:2181`	comma-separated connect string of `host:port` nodes
broker-id	`1`	unique identifier for each broker
log-directories	`/var/lib/kafka-logs`	comma-separated directories for Kafka data

Note: zookeeper is required for kafka-broker and should be installed first. This can be specified by how the entries are ordered in your YAML configuration file.

kafka-manager

argument	default value	description
zookeeper-hosts	`localhost:2181`	comma-separated connect string of `host:port` nodes

Note: zookeeper is required for kafka-broker and should be installed first. This can be specified by how the entries are ordered in your YAML configuration file.

redis

argument	default value	description
version	`3.2.6`	redis version
port	`6379`	redis port
data-directory	`/var/lib/redis/`	location of redis data on disk

riak-kv

No arguments required.

zookeeper

argument	default value	description
port	`2181`	zookeeper port
nodes		array of ZK server nodes (see sample config)

2. Deployment

Once the configuration file has been edited, you are ready to deploy software to your hosts. This is done using fabric:

# fab deploy:"<relative location of your deployment config file"

fab deploy:"deploy/config/vagrant.yml"

3. Administration

For a list of administration tools available:

fab -l

Most commands will required the path to your deployment config file, for example:

fab cassandra.nodetool:"deploy/config/aws-cassandra-cluster.yml","status"

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
aws		aws
bootstrap		bootstrap
deploy/config		deploy/config
software		software
vagrant		vagrant
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
environment.yml		environment.yml
fabfile.py		fabfile.py
fabfile.pyc		fabfile.pyc
fabric-completion.sh		fabric-completion.sh
helper.py		helper.py
helper.pyc		helper.pyc

License

kylebush/bigdata-tools

Folders and files

Latest commit

History

Repository files navigation

Big Data Tools

Simple and easy to use provisioning, deployment and administration tools for Big Data / NoSQL environments

Prerequisites (Mac OSX)

Miniconda - Python package manager

Provisioning

Deployment

1. Configuration

Software Specific Configuration

cassandra

cassandra-lucene-index

crate

citusdb

java-8

kafka-broker

kafka-manager

redis

riak-kv

zookeeper

2. Deployment

3. Administration

About

Resources

License

Stars

Watchers

Forks

Languages