This is a Standalone cluster which includes the big data tools required by BNDF. This cluster is built and configured with Docker. Extension and scale-up to multi-node cluster could be easily done with Docker Swarm or other container orchestration tools like Kubernetes.
Tools that is configured in this cluster could be summarized in the following table.
Tool | Dependencies | DependenciesVersion | Version |
---|---|---|---|
Apache Hadoop | Java | 8, 11 | 2.7.7, 3.2.1 |
Apache Hive | Apache Hadoop, PostgreSQL | 2.7.7, 12 | 2.3.7 |
Apache Spark | Apache Hive, Apache Hadoop | 2.3.7, 3.2.1 | 3.0.0 |
Apache Zeppelin | Apache Spark | 3.0.0 | 0.9.0 |
MongoDB | - | - | 4.2.6 |
Netdata | - | - | latest |
Service | Port |
---|---|
Spark Master | 8080 |
Spark Job WebUi | 4042 |
HDFS Namenode | 9874 |
HDFS Datanode | 9864 |
Zeppelin | 8085 |
Zeppelin Jobs WebUi | 4040 |
MongoDB | 27017 |
Netdata | 19999 |
Hive | 10000 |
Service WebUi is accessible with http://MACHINE_IP:Port, where MACHINE_IP is either localhost or remote server ipv4 address.
Docker, and Docker Compose should be installed in order to create the cluster. Generally all operating systems have support for docker.
$ git clone https://github.com/M0h3eN/bndfcluster.git
$ cd bndfcluster
Directories are configurable, and their path could be changed by the user.
- volumes directory include configs and data of the services, and it should be places on a disk with abundant capacity.
- sample-data corresponds the input data directory.
- jars directory includes extra jar files that user needs.
- appJars is the directory that includes BNDF jar file.
Cluster could be created by the create-hdfs-spark-cluster.sh
scripts. This script takes two parameter, VOLUMES_PATH and DATA_PATH respectively, which corresponds to volumes and sample-data directories.
$ sudo ./create-hdfs-spark-cluster.sh ./volumes ./sample-data
This will create cluster with default paths. It could take some time for the first time, since it will pull all required docker images from dockerhub. The cluster status could be checked by running
$ sudo docker ps
Sample data to run BNDF could be get through get-data.sh
script. It takes DATA_PATH parameter.
$ ./get-data.sh ./sample-data
$ ./get-bndf-jar.sh ./appJars
RecordingDataLoader Module could be run through run-recording-data-loader.sh
script. It takes five parameter with the following order
- VOLUMES_PATH
- DATA_PATH
- SPARK_EXECUTOR_MEMORY
- SPARK_EXECUTOR_CORES
- SPARK_DRIVER_MEMORY
$ sudo ./run-recording-data-loader.sh ./volumes ./sample-data 35 18 10
This would run RecordingDataLoader module with default path configuration, 35 GB spark executor memory, 18 spark executor cores and 10 GB spark driver memory.
Sorting Module could be run through run-sorting.sh
script. It takes six parameter with the following order
- VOLUMES_PATH
- DATA_PATH
- SPARK_EXECUTOR_MEMORY
- SPARK_EXECUTOR_CORES
- SPARK_DRIVER_MEMORY
- Experiment/Session Name
The Experiment/Session list could be get after running run-recording-data-loader.sh
. It is accessible in either Meta Data Database or by running get-sessionOrExperiment-list.sh
cluster.
$ sudo ./get-sessionOrExperiment-list.sh
Experiment_Kopo_2018-04-25_J9_8600
Experiment_Kopo_2018-04-25_J9_8900
For example for sorting Experiment_Kopo_2018-04-25_J9_8600
experiment:
$ sudo ./run-sorting.sh ./volumes ./sample-data 35 18 10 Experiment_Kopo_2018-04-25_J9_8600