Data Integration Library

LinkedIn Data Integration Library (DIL) is a collection of generic data integration components that can be mix-and-matched to form powerful ready-to-use connectors, which can then be used by data integration frameworks like Apache Gobblin or event processing frameworks like Apache Kafka to ingress or egress data between cloud services or APIs.

Highlights

Generic components: data transmission protocol components and data format components are generically designed without one depending on another, greatly relieved the challenges in handling the variety of cloud APIs and services.
Multistage architecture: data integration is never a one-step process, the library inherently supports multi-staged integration processes so that complex data integration scenarios can be handled with simple generic components.
Bidirectional transmission: ingress and egress are just business logic in DIL, both work the same way and use the same set of configurations, as ingress to one end is egress to the other end.
Extensible compression and encryption: users can easily add pluggable and extensible data compression and encryption algorithms.
Flexible pagination: DIL supports a wide range of pagination methods to break large payloads to small chunks.

Common Patterns used in production

Asynchronous bulk ingestion from Rest APIs, like Salesforce.com, to Data Lake (HDFS, S3, ADLS)
Data upload to Rest APIs, like Google API, with tracking of responses
Ingest data from one Rest API and egress to another (Rest API) on cloud

Requirements

JDK 1.8

If building the distribution with tests turned on:

Maven version 3.5.3

Instructions to build the distribution

Extract the archive file to your local directory.
Set JAVA_HOME to use JDK 1.8 (JDK 11+ not supported)
Build

./gradlew build

Instructions to contribute

To contribute, please use submit Pull Request (PR) for committers to merge.

Create your own fork on GitHub off the main repository
Clone your fork to your local computer
- git clone https://github.com/<<your-github-login>>/data-integration-library.git
Add upstream and verify
- git remote add upstream https://github.com/linkedin/data-integration-library.git
- git remote -v
Change, test, commit, and push to your fork
- git status
- git add .
- git commit -m "comments"
- git push origin master
Create Pull Request on GitHub with the following details
- Title
- Detailed description
- Document the tests done
- Links to the updated documents
Publish to local Maven repository
- ./gradlew publishToMavenLocal
Refresh your fork
- if upstream has no conflict with your fork, you can go to your forked repository, and use "Fetch upstream" function to sync up your fork.
- if upstream has conflicts with your fork, GitHub will ask you to create a pull request to merge.
  - if the conflicts are too significant, it is better to just copy everything from upstream (the main repository) to your fork; that can be done with the following procedure:
    - Follow step 2 and step 3 above
    - git fetch upstream
    - git reset --hard upstream/master
    - git push origin +master
    - check your fork should be in sync with the main repository

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
.github		.github
cdi-core		cdi-core
docs		docs
gradle		gradle
quality		quality
.codecov_bash		.codecov_bash
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTION.md		CONTRIBUTION.md
HEADER		HEADER
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle
version.properties		version.properties

License

linkedin/data-integration-library

Folders and files

Latest commit

History

Repository files navigation

Data Integration Library

Highlights

Common Patterns used in production

Requirements

Instructions to build the distribution

Instructions to contribute

User Guides

About

Topics

Resources

License

Stars

Watchers

Forks

Languages