PyctureStream

Description
Demo-architecture for a scalable distribute image stream processing using Kafka, Tensorflow and Spark Streaming. Our protoyp contains a client, that streams webcam-images into Kafka, and a analytics-component for detecting objects on those images using tensorflow on spark. The results are sent back to the client and also aggregated and reported in a Dashboard build with Plotly Dash.

Context
Master programme Data Science & Business Analytics
Lecture BI- and Big-Data-Architectures
At University of Media, Stuttgart (DE)

Goal / Task
Come up with a fictional use-case for Big Data, design an architecture for it, state reasons for the architectural decisions and implement a Proof-of-Concept.

Authors
Marcus and me (dynobo)

Timeline
Feb. 2018 - Mar. 2018

Repo
Code & Docu for the Proof-of-Concept;Slides of our final presentation; Project-Documentation (in german); Screencast of the Workflow in action.

Table of Contents

Prepare Infrastructure
- Install Cloudera Quickstart in VirtualBox
Setup Software
- In Cloudera VM
- On Local Machine
Run the Pipeline
Testing
Links & Ressources

Prepare Infrastructure

Prerequisites: Local computer with enough ressources (8 GB, better 16 GB RAM, 10-15 GB free space) and a current version of Oracle VirtualBox.

Install Cloudera Quickstart in VirtualBox

1. Install Image

Download Cloudera Quickstart VM - CDH 5.12 for Virtual Box
Import the Applicance in VirtualBox

2. Change VM-Settings in VirtualBox

Give the VM as much ressources as possible! (My Specs: 32GB RAM, i5 QuadCore ~4Ghz, SSD)
Make sure, "Network" is set to "NAT"
Add "Port Forwarding" (we'll use 90 as prefix for vm-ports):
- SSH: Host 127.0.0.1:9022 to Guest 10.0.2.15:22
- HUE: Host 127.0.0.1:9033 to Guest 10.0.2.15:8888 (we want 9088 for jupyter)
- JUPYTER: Host 127.0.0.1:9088 to Guest 10.0.2.15:8889
- KAFKA: Host 127.0.0.1:9092 to Guest 10.0.2.15:9092

Setup Software

In Cloudera VM

SSH into Cloudera VM, e.g. via: ssh cloudera@127.0.0.1 -p 9022 (Default User+PW: cloudera)
In VM, run setup_cloudera_vm.sh to install additional packages and do various configuration in (no interaction needed):

wget https://raw.githubusercontent.com/dynobo/PyctureStream/master/setup_cloudera_vm.sh && chmod +x ./setup_cloudera_vm.sh && ./setup_cloudera_vm.sh

When the setup is finished reboot the VM.

On Local Machine

You might need Anaconda Distro with Python 3.6+ for the code that runs locally
The dependencies are listen in environment.yaml. You can use this file to recreate the environment with Anaconda:

conda env create -f environment.yml

Run the Pipeline

Run the different components of the pipeline in parallel (e.g. using multiple consoles):

Video Capturing (Client)

Kafka Producer, that captures images from a connected Webcam (make sure, you have one attached!) and sends them together with metadata into the pycturestream-Topic in Kafka. It also stores the latest image in input.jpg, for monitoring purposes.

Run client/stream_webcam_to_kafka.py on Host.

Object Detection (Processing)

Consumes the pycturestream-Topic, detects the objects in the pictures, and pushs the results into resultstream-Topic. Leverages Tensorflow and Spark.

Run processing/detect_objects.py in VM

Output Results (Client)

Kafka Consumer of resultstream-Topic, to display the results (detected objects, probabilities) as console-output and store the latest image in output.jpg.

Run client/receive_results_from_kafka.py on Host, open client/monitor_imagestream.html in Browser to monitor input and output images.

Dashboard (Reporting)

Kafka Consumer of resultstream-Topic to transforms the data and store the detected objects (without images) in events.json.

Run reporting/store_results_from_kafka.py on Host

The Dashboard itself was build with Dash, and displays the data of the events.json. (The dashboard was implemented by Marcus.)

Open reporting/dashboard.ipynb in Jupyter Notebook on Host or run it directly in non-interactive mode: jupyter nbconvert --to notebook --execute dashboard.ipynb

Testing

A set of test procedures, useful for debugging and identifying problems.

Test WebCam

Make sure, your webcam is connected. Maybe test with other software, if it's generally working.
Run the test script, which tests the first 4 video devices for certain parameters, e.g.:

python ./client/test_webcam.py

Interpreting results:

VIDEOIO ERROR: V4L: index 1 is not correct!         <--- camera might not be connected
Unable to stop the stream: Device or resource busy  <--- camera in use by other program

Test Kafka

Create a topic:

kafka-topics --create --zookeeper localhost:2181 --topic wordcounttopic --partitions 1 --replication-factor 1

Open console consumer:

/usr/bin/kafka-console-consumer --zookeeper localhost:2181 --topic wordcounttopic

Open console producer:

kafka-console-producer --broker-list localhost:9092 --topic wordcounttopic

Produce some events, and see, if consumer receives them

Test Spark Streaming + Kafka

Create a topic (if not already done):

kafka-topics --create --zookeeper localhost:2181 --topic wordcounttopic --partitions 1 --replication-factor 1

Open ./notebooks/kafka_wordcount.ipynb in Jupyter of VM (as Consumer)
Open console producer:

kafka-console-producer --broker-list localhost:9092 --topic wordcounttopic

Produce some events, and see, if consumer receives & processes them correctly
Check Job-Browser in Hue (http://127.0.0.1:9033/) for status of Spark

Links & Ressources

Some of the ressource that helped me in the implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
client		client
notebooks		notebooks
processing		processing
reporting		reporting
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
documentation.pdf		documentation.pdf
environment.yaml		environment.yaml
poc_in_action.mkv		poc_in_action.mkv
setup_cloudera_vm.sh		setup_cloudera_vm.sh
slides.pdf		slides.pdf
start_jupyter.sh		start_jupyter.sh

dynobo/PyctureStream

Folders and files

Latest commit

History

Repository files navigation