Skip to content

Latest commit

 

History

History
91 lines (59 loc) · 4.19 KB

graph-analytics.md

File metadata and controls

91 lines (59 loc) · 4.19 KB

Graph Analytics

Pulls on DockerHub Stars on DockerHub

The Graph Analytics benchmark relies on the Spark framework to perform graph analytics on large-scale datasets. Apache provides a graph processing library, GraphX, designed to run on top of Spark. This benchmark performs various algorithms including PageRank on a Twitter dataset.

Dockerfiles

Supported tags and their respective Dockerfile tags:

  • latest contains the algorithm and application entrypoint.

Datasets

The benchmark uses a graph dataset generated from Twitter. To create the dataset image:

$ docker create --name twitter-data cloudsuite/twitter-dataset-graph

Running/Tweaking the Benchmark

The benchmark can run three graph algorithms using GraphX: PageRank, connected components, and triangle count.

To run the benchmark, run the following command:

$ docker run --rm --volumes-from twitter-data -e WORKLOAD_NAME=pr cloudsuite/graph-analytics \
    --driver-memory 8g --executor-memory 8g

Note that any arguments passed to the container will be directed to spark-submit, the interface to submit jobs to Spark. In the given command, to ensure that Spark has enough memory allocated to execute the benchmark in memory, --driver-memory and --executor-memory arguments are passed to spark-submit.

The environment variable WORKLOAD_NAME sets the graph algorithm that the container executes. Use pr, cc, and tc for PageRank, connected components, and triangle count respectively. PageRank is selected by default if not set.

All of these analytic workloads require huge memory to finish. As a reference, running tc on a single CPU core requires 8GB each. If you allocate more cores, more memory is necessary. You will see the OutOfMemoryError exception if you do not give enough memory. We recommend giving more than 16GB of memory for each core to minimize GC activities, which can influence your measurements.

Multi-node deployment

This section explains how to run the benchmark using multiple Spark workers (each running in a Docker container) that can be spread across multiple nodes in a cluster.

First, create a dataset image on every physical node where Spark workers will be running.

$ docker create --name twitter-data cloudsuite/twitter-dataset-graph

Start Spark master and Spark workers. You can start the master node with the following command:

$ docker run -dP --net host --name spark-master \
    cloudsuite/spark:3.3.2 master

By default, the container uses the hostname as the listening IP for the connections to the worker nodes. Therefore, ensure all worker machines can access the master machine using the master hostname if the listening IP is kept by default. You can also override the listening address by setting the environment variable SPARK_MASTER_IP using the container option -e SPARK_MASTER_IP=X.X.X.X.

The workers get access to the dataset with --volumes-from twitter-data.

# Set up worker 1
$ docker run -dP --net host --volumes-from twitter-data --name spark-worker-01 \
    cloudsuite/spark:3.3.2 worker spark://SPARK-MASTER:7077

# Set up worker 2
$ docker run -dP --net host --volumes-from twitter-data --name spark-worker-02 \
    cloudsuite/spark:3.3.2 worker spark://SPARK-MASTER:7077
    
# ...

SPARK_MASTER is Spark master's listening address.

Finally, run the benchmark as the client to activate the Spark master:

$ docker run --rm --net host --volumes-from twitter-data -e WORKLOAD_NAME=pr \
    cloudsuite/graph-analytics \
    --driver-memory 8g --executor-memory 8g \
    --master spark://SPARK-MASTER:7077