GitHub - Showndarya/Apache-Flink-Cloud-Bursting: Addressing transient workload spikes with Cloud bursting in Apache Flink

Addressing transient workload spikes with Cloud Bursting in Apache Flink

Problem statement:

Problem: This project aims to provide a solution for handling sudden spikes in workload in Apache Flink applications. This is an important problem because Apache Flink is a distributed stream processing framework that is designed to handle large amounts of data in real-time. However, when the input rate exceeds the processing capacity, the system becomes overwhelmed, leading to increased latency, decreased performance, or even failures.
Technique: The cloud bursting technique is an alternative to back-pressure, which is a mechanism that slows down the input rate when the system is overwhelmed. Instead, Cloud Bursting allows the excess workload to be offloaded to a cloud provider, such as AWS, where it can be processed in parallel. This can help ensure that the system continues to operate at a high level of performance, even when there is a sudden spike in workload.

Source:

Nexmark is a benchmarking suite for queries over continuous data streams. This project is inspired by the NEXMark research paper and Apache Beam Nexmark.

These are multiple queries over a three entities model representing on online auction system:

Person represents a person submitting an item for auction and/or making a bid on an auction.
Auction represents an item under auction.
Bid represents a bid for an item under auction.

Basic Architecture:

Demo:

Please view the recording here.

Run instructions:

Environment -
- Java 8
- Flink version - 1.16.0
- IntelliJ
Using Intellij -
1. Build your Maven project using the command mvn clean package, which will generate a JAR file for the Flink pipeline, or create a Maven run configuration in Intellij to run the command clean package
2. Install all dependencies using Maven.
3. Create another application run configuration with main class as operator.FlinkPipeline. Select the Add dependencies with 'provided' scope to classpath option. Set the working directory as team-6/Code/deliverable-2.
4. Build the project using the application run configuration to generate the jar files in the target folder.
Using Command Line (MacOS) -
1. Install Maven brew install maven
2. Create ~/.mavenrc file and add line export JAVA_HOME=<path>. You can get the path using /usr/libexec/java\_home. We have built the project with java8 as the version, so that's the version we can assure you.
3. Go to Code/deliverable-2 directory and run mvn clean package
4. This should create the flink-java-project-0.1.jar file under the target folder.
To run on Flink web UI -
1. Start the Flink cluster ( ./start-cluster.sh )
2. Go to http://localhost:8081 in your browser
3. Upload the jar file to the Flink web UI under the submit a job tab
To switch between Java and Python AWS Lambda Functions -
1. In the file Code/deliverable-2/src/main/java/lambda/LambdaInvokerUsingURLPayload.java, update line:18 with configs/JavaConfig.json for invoking AWS Lambda written in Java or configs/PythonConfig.json for invoking AWS Lambda written in Java.
2. In the file Code/deliverable-2/src/main/java/operator/InvokeOperator.java, Change line:131 to getResultFromJsonPython while using configs/PythonConfig.json or to getResultFromJsonJava while using configs/JavaConfig.json.
To test for input rates greater than the experiments documented -
1. In the file Code/deliverable-2/src/main/java/operator/InvokeOperator.java, comment line:110 to line:119 and uncomment line line:120 to use a standard input threshhold of 0.4 in the logic.

Source Configurations:

To change configurations at the source, navigate to the file NexmarkConfiguration.java under team-6/Code/deliverable-2.

For instance, for changing the input rate change the configuration variable firstEventRate and nextEventRate and change the value of variable ratePeriodSec for changing the duration of the spike.

Lambda Deployments:

Although the lambda functions are already deployed with an exposed API gateway, the following are the steps to deploy the lambda functions in your account.

Setup IAM:
1. Go to AWS account > IAM
2. Click on Add user and add a new user with Programmatic Access
3. Add permissions for AWSLambdaFullAccess
4. Save the Access ID and Secret Access Key for the new user.
Python:
1. Run npm i -g serverless in terminal.
2. Navigate to team-6/Code/PythonCloudBurstOperator/cloud-burst-operator-python
3. Run serverless config credentials --provider aws --key <access-id> --secret <secret-access-key>
4. Run serverless deploy
Java:
1. Install AWS SDK tools for Java Intellij.
2. Ensure Docker is installed and is properly running.
3. Navigate to team-6/Code/CloudBurstOperator
4. Right-click on the project root in Intellij and click on Serverless Application sync option.

Experimental Setup:

Pipelines:

Apart from the Figure 1 solution pipeline, for comparison, we also ran our experiments on 2 baseline pipelines.
- Baseline 1
- Baseline 2
Measurements:
- Parallelism:
  1. The input rate is set to 500 events/sec and 100000 events are generated from the source.
  2. The job parallelism is set to 1,2,4,8 and the completion is measured.
  3. Configuration values kept constant:
    - Input Rate = 1000 events/s
    - Duration of Experiment = 5mins
    - Metric Call Latency = 800 ms
    - Max Records Sent = 100000
- Backpressure duration:
  1. The duration of 5 mins.
  2. Input rates - 100, 250, 500, 750, 1000events/sec.
  3. Configuration values kept constant:
    - Duration of Experiment = 5mins
    - Parallelism = 1
    - Max Records Sent = unlimited,
    - Batch size change-every 8 secs
    - Metric Call Latency = 1000 ms
- Metric calculation interval:
  1. The input rate is set to 500 events/sec and 100000 events are generated from the source.
  2. Metric calculation delay values tested - 700ms, 800ms, 1000ms, 1100ms, 120ms.
  3. The completion time and the accumulated backpressure is tested
  4. We found 800ms to be the best with our experiments.
  5. Configuration values kept constant:
    - Input Rate = 1000 events/s
    - Duration of Experiment = 5mins
    - Parallelism = 1
    - Max Records Sent = unlimited
    - Batch size change-every 8 secs
- Lambda:
  1. The pipeline is run from the source to the invoker operator, removing all the other operators.
  2. Batch size:
    - Java: 1-10; step size=1
    - Python: 1-10; step size=1 and 20-100; step size=10
  3. 5 experiments are run on each batch size
  4. Latency is measured for each batch and throughput is measured for all the data.
- Latency:
  1. Add the current timestamp when the event enters the pipeline. Before writing to sink - subtract the current timestamp with a timestamp in the event to get the latency for that record.
  2. Performed on all 3 pipelines - vanilla, modified pipeline (without lambda), proposed solution
  3. Input rates - 100, 250,500,750,1000,1250,1500 - tested for 5 minutes
  4. 3 experiments are run on each input rate.
  5. For the proposed solution, configuration values:
    - MEASURE_INTERVAL_MS = 800ms
    - INPUT_THRESHOLD = 500
    - BUSY_TIME_THRESHOLD = 0.06

Experiments Result Plots:

	Latency plot for all 3 pipelines [1]
	Latency plot scaled just for the proposed solution (AWS Lambda written in Python) [1]
	Latency plot scaled just for the proposed solution (AWS Lambda written in Java) [1]
	Plot to measure backpressure accumulation on various input rates for all 3 pipelines. The proposed solution self-rectifies backpressure encountered in the subsequent interval cycles [2]
	Plot to measure metric delay interval vs the latency for the proposed solution [2]
	Plot to measure pipeline parallelism vs the time to pipeline completion for the proposed solution [2]
	Plot to measure the time taken by the AWS Lambda function written in Java to process events of different batch sizes as a box plot [3]
	Plot to measure the average time taken by the AWS Lambda function written in Java to process events of different batch sizes as a box plot [3]
	Plot to measure the time taken by the AWS Lambda function written in Python to process events of different batch sizes as a box plot [4]
	Plot to measure the average time taken by the AWS Lambda function written in Python to process events of different batch sizes as a box plot [4]
	Plot to measure the average throughput of the AWS Lambda function written in Java to process events of different batch sizes as a box plot [3]
	Plot to measure the average throughput of the AWS Lambda function written in Python to process events of different batch sizes as a box plot [4]

[1] Refer to experiments/latency for sample data collection files. For complete data collected during latency experiments, go here
[2] Refer to experiments/metricdelay-backpressure-parallelism for sample data collection files.
[3] Refer to experiments/invoker/java for sample data collection files.
[4] Refer to experiments/invoker/python for sample data collection files.
Refer to assets/ for image files of all the plots for a closer view.

Future work:

Handling stateful operations/operators.
Security in AWS Lambda
Fault tolerance - Currently project relies on Flink and AWS out-of-the-box fault tolerance, but the plan is to incorporate systems like Kafka in the pipeline to ensure no packets are lost between Flink and AWS Lambda.

Name		Name	Last commit message	Last commit date
Latest commit History 222 Commits
Code		Code
Notes		Notes
assets		assets
experiments		experiments
outputs		outputs
presentations		presentations
DESIGN.md		DESIGN.md
NEXMARK.md		NEXMARK.md
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code

Code

Notes

Notes

assets

assets

experiments

experiments

outputs

outputs

presentations

presentations

DESIGN.md

DESIGN.md

NEXMARK.md

NEXMARK.md

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Addressing transient workload spikes with Cloud Bursting in Apache Flink

About

Releases

Packages

Contributors 4

Languages

Showndarya/Apache-Flink-Cloud-Bursting

Folders and files

Latest commit

History

Repository files navigation

Addressing transient workload spikes with Cloud Bursting in Apache Flink

About

Topics

Resources

Stars

Watchers

Forks

Languages