Skip to content

💾 A collection of Apache Spark and Apache Flink scripts used to get familiar with the processing of big data.

License

Notifications You must be signed in to change notification settings

johanneshagspiel/big-data-scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apache Spark Logo Apache Flink Logo


MIT License Top Language Latest Release

Big Data Scripts

This repository contains a collection of Apache Spark scripts used to get familiar with the basics of batch processing of big data and a collection of Apache Flink scripts used to get familiar with the basics of stream processing of big data.

Features

The Apache Spark scripts cover a range of topics such as:

  • manipulating RDDs via:
    • functional programming principles like pattern matching
    • regex
    • functions like:
      • map
      • flatMap
      • reduceByKey
      • flatten
      • filter
  • manipulating DataFrames via:
    • Spark SQL
    • custom aggregation functions using Window

The Apache Flink scripts cover a range of topics such as:

  • basic manipulation of DataStreams via functions like:
    • map
    • filter
    • flatMap
  • working with stateful streams via keyBy
  • dealing with infinite streams via:
    • different kinds of window assigners like TumblingEventTimeWindows or SlidingEventTimeWindows
    • keyed and non-keyed windows
    • new ProcessWindowFunction

Tools

Purpose Name
Programming language Scala
Cluster computing framework Apache Spark, Apache Flink

Installation Process

It is assumed that both a Java JDK and an IDE such as IntelliJ are installed and that the users operating system is Windows.

  • Install the Scala support plugin for your IDE.
  • Import the corresponding sub folder of this repository as a Maven project and resolve all dependencies.

Licence

These Big Data scripts are published under the MIT licence, which can be found in the LICENSE file. For this repository, the terms laid out there shall not apply to any individual that is currently enrolled at a higher education institution as a student. Those individuals shall not interact with any other part of this repository besides this README in any way by, for example cloning it or looking at its source code or have someone else interact with this repository in any way.

References

The Apache Spark logo was taken from Wikipedia and the Apache Flink logo from .

About

💾 A collection of Apache Spark and Apache Flink scripts used to get familiar with the processing of big data.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages