Skip to content

arthurpessoa/kotlin-beam-starter

Repository files navigation

Contributors Forks Stargazers Issues MIT License

kotlin-beam-starter

If you want to clone this repository to start your own project, you can choose the license you prefer and feel free to delete anything related to the license you are dropping.

Built With

Kotlin ApacheBeam ApacheSpark ApacheFlink ApacheKafka Gradle

Getting Started

Requirements

Build

./gradlew clean build

Source file structure

This project aims to have a multiple-module/pipeline, to achieve that, every pipeline should be something like this example-spark-runner This project have a few files:

  • The main /src/main/kotlin
    • App.kt Defines the main() entrypoint
    • /schema Contains all schema definitions
    • /pipeline Contains the pipeline definition
    • /ptranform Contains all the pipeline transformations (ptransforms)
  • The test src/test/kotlin defines the unit tests
  • The test src/integrationTest/kotlin defines the integration tests

FAQ

What examples this starter includes?

✅ : Done! ⚠️ : TODO


  • Apache Spark Runner ✅
  • Read/Write .CSV files ✅
  • AWS S3 integration ✅
  • Kafka integration ⚠️
  • Oracle JDBC Integration ✅

How can i use another runner?

To keep this template small, it only includes the Direct Runner for runtime tests, and Spark Runner as a integration test example. For a comparison of what each runner currently supports, look at the Beam Capability Matrix. To add a new runner, visit the runner's page for instructions on how to include it.

What X means? (Glossary)

Pipeline: A Pipeline encapsulates your entire data processing task, from start to finish. This includes reading input data, transforming that data, and writing output data. All Beam driver programs must create a Pipeline. When you create the Pipeline, you must also specify the execution options that tell the Pipeline where and how to run.

PCollection: A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism. Your pipeline typically creates an initial PCollection by reading data from an external data source, but you can also create a PCollection from in-memory data within your driver program. From there, PCollections are the inputs and outputs for each step in your pipeline.

PTransform: A PTransform represents a data processing operation, or a step, in your pipeline. Every PTransform takes one or more PCollection objects as input, performs a processing function that you provide on the elements of that PCollection, and produces zero or more output PCollection objects.

Contributing

Thank you for your interest in contributing! All contributions are welcome! 🎉🎊

License

This software is distributed under the terms of both the MIT license and the Apache License (Version 2.0).

See LICENSE for details.