Apache Gobblin

Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.

Capabilities

Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
Data Organization within the lake (e.g. compaction, partitioning, deduplication)
Lifecycle Management of data within the lake (e.g. data retention)
Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)

Highlights

Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
Supports stream and batch execution modes
Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.

Common Patterns used in production

Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
Enforcing Data retention policies and GDPR deletion on HDFS / ADLS

Apache Gobblin is NOT

A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.

Requirements

Java >= 1.8

If building the distribution with tests turned on:

Maven version 3.5.3

Instructions to download gradle wrapper

If you are going to build Gobblin from the source distribution, run the following command for downloading the gradle-wrapper.jar from Gobblin git repository to gradle/wrapper directory (replace GOBBLIN_VERSION in the URL with the version you downloaded).

wget --no-check-certificate -P gradle/wrapper https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar

(or)

curl --insecure -L https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar > gradle/wrapper/gradle-wrapper.jar

Alternatively, you can download it manually from: https://github.com/apache/gobblin/blob/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar

Make sure that you download it to gradle/wrapper directory.

Instructions to run Apache RAT (Release Audit Tool)

Extract the archive file to your local directory.
Run ./gradlew rat. Report will be generated under build/rat/rat-report.html

Instructions to build the distribution

Extract the archive file to your local directory.
Skip tests and build the distribution: Run ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain The distribution will be created in build/gobblin-distribution/distributions directory. (or)
Run tests and build the distribution (requires Maven): Run ./gradlew build The distribution will be created in build/gobblin-distribution/distributions directory.

Name		Name	Last commit message	Last commit date
Latest commit History 6,349 Commits
.github		.github
bin		bin
buildSrc/src/main/groovy/org/apache/gobblin/gradle		buildSrc/src/main/groovy/org/apache/gobblin/gradle
conf		conf
config/checkstyle		config/checkstyle
dev		dev
gobblin-admin		gobblin-admin
gobblin-all		gobblin-all
gobblin-api		gobblin-api
gobblin-audit		gobblin-audit
gobblin-aws		gobblin-aws
gobblin-binary-management		gobblin-binary-management
gobblin-cluster		gobblin-cluster
gobblin-compaction		gobblin-compaction
gobblin-completeness		gobblin-completeness
gobblin-config-management		gobblin-config-management
gobblin-core-base		gobblin-core-base
gobblin-core		gobblin-core
gobblin-data-management		gobblin-data-management
gobblin-distribution		gobblin-distribution
gobblin-docker		gobblin-docker
gobblin-docs		gobblin-docs
gobblin-example		gobblin-example
gobblin-hive-registration		gobblin-hive-registration
gobblin-iceberg		gobblin-iceberg
gobblin-kubernetes/gobblin-service		gobblin-kubernetes/gobblin-service
gobblin-metastore		gobblin-metastore
gobblin-metrics-libs		gobblin-metrics-libs
gobblin-modules		gobblin-modules
gobblin-oozie/src/test/resources		gobblin-oozie/src/test/resources
gobblin-rest-service		gobblin-rest-service
gobblin-restli		gobblin-restli
gobblin-runtime-hadoop		gobblin-runtime-hadoop
gobblin-runtime		gobblin-runtime
gobblin-salesforce		gobblin-salesforce
gobblin-service		gobblin-service
gobblin-temporal		gobblin-temporal
gobblin-test-harness		gobblin-test-harness
gobblin-test-utils		gobblin-test-utils
gobblin-test/resource		gobblin-test/resource
gobblin-tunnel		gobblin-tunnel
gobblin-utility		gobblin-utility
gobblin-yarn		gobblin-yarn
gradle		gradle
ligradle/findbugs		ligradle/findbugs
maven-nexus		maven-nexus
maven-sonatype		maven-sonatype
.asf.yaml		.asf.yaml
.codecov_bash		.codecov_bash
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
FlowTriggerHandlerTest.java		FlowTriggerHandlerTest.java
HEADER		HEADER
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.gradle		build.gradle
defaultEnvironment.gradle		defaultEnvironment.gradle
gobblin-flavored-build.gradle		gobblin-flavored-build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
mkdocs.yml		mkdocs.yml
query_github_issues.py		query_github_issues.py
readthedocs.yml		readthedocs.yml
settings.gradle		settings.gradle

License

apache/gobblin

Folders and files

Latest commit

History

Repository files navigation

Apache Gobblin

Capabilities

Highlights

Common Patterns used in production

Apache Gobblin is NOT