GitHub - achugr/flink-comms-processing: Comms processing (ETL) with Apache Flink.

About The Project

Motivation

Main motivation is to try Apache Flink in the context of ETL task and to understand system abilities in general.

Use cases

Events archiving

Consume events and prepare personal archives for long-term storing purposes and effective extraction.

Archives are organized by time range (weekly archives, for example)
Archives have main person (i.e. all in/out communication of person for data range)
Events should be searchable, so that we know what are participants and dates of the events we are potentially interested Having this we could effectively extract all communication of concrete employee for concrete dates from deep archives like AWS Glacier.

Getting Started

Prerequisites

This sample requires

java (checked with 1.8)
docker, docker compose
gradle

Installation

Clone the repo
Download Enron email dataset
Open the project in IDE like IntellijIdea
Run containers sudo docker-compose up -d from docker folder. This would start
- Minio - for result archives, flink checkpoints and big payloads storing. S3-compatible
- Elasticsearch - for data index
- DejaVu - viewer for elasticsearch
Run CommsEtlJob class with java option -DSOURCE_FOLDER=./enron to target the directory with extracted Enron dataset
Check flink dashboard at http://localhost:8081/
Check the built index at http://localhost:1359/?appname=data&url=http://localhost:9200&mode=edit
Check prepared archives at http://localhost:9000/minio/archived/

Future work

Add tests...Basically implemented logic is very simple, but would be great to have at least integration test to check overall job correctness.
Introduce simple dataset generator and check with more events in cluster mode
Add more event types (with events hierarchy and all these things)
Handle attachments
Make some heavy parts of payload (like heavy attachments) lazy - fetch them from object storage only by request
Add some analytics pipelines (try detect anomalies, build a social graph etc - dataset doesn't contain really significant emails, but still, the idea why this dataset is public is clear Enron scandal)

Results

Enron dataset (about 1.7G unarchived) can be processed locally in less than 10 minutes. It doesn't say much though )

Main outcomes

Flink looks pretty good for such type of tasks, however the fact that events are of arbitrary size could bring complexities, but approach like reference based messaging solves this problem. Even complex and task-specific logic could be introduced (calling different storages during processing, creating many data flows with own sinks etc), but one day it might become a source of huge performance and logical issues. On the other hand limitations of the paradigm would force you to think twice before introducing complex, potentially unnecessary logic, or avoid bringing to the pipeline things that can be done outside of the pipeline scope.

Contributing

Contribution is welcome: starting from refactoring and improvements and ending with new jobs for analysis of archived events.

License

Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
conf		conf
docker		docker
gradle/wrapper		gradle/wrapper
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

License

achugr/flink-comms-processing

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

About The Project

Motivation

Use cases

Events archiving

Getting Started

Prerequisites

Installation

Future work

Results

Main outcomes

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages