Skip to content

Data Weaver is a data processing and ETL (Extract, Transform, Load) tool built on Apache Spark. It allows you to define data pipelines using YAML configuration files and execute them using Spark for data transformation and integration.

netsirius/data-weaver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Weaver

Data Weaver is a data processing and ETL (Extract, Transform, Load) tool built on Apache Spark. It allows you to define data pipelines using YAML configuration files and execute them using Spark for data transformation and integration.

Table of Contents

Getting Started

Prerequisites

Before using Data Weaver, make sure you have the following prerequisites installed:

Installation

  1. Clone the Data Weaver repository to your local machine:

    git clone https://github.com/yourusername/data-weaver.git
    

Usage

Defining Data Pipelines

Data pipelines are defined using YAML configuration files. You can create your pipeline configurations and place them in a directory of your choice. Each configuration should define data sources, transformations, and sinks.

Here's an example of a simple pipeline configuration:

name: ExamplePipeline
tag: example
dataSources:
  - id: testSource
    type: MySQL
    query: >
      SELECT name
      FROM test_table
    config:
      readMode: ReadOnce # ReadOnce, Incremental..
      connection: testConnection # Connection name related to the defined connections inside application.conf
transformations:
  - id: transform1
    type: SQLTransformation
    sources:
      - source1 # Source name related to the defined data sources inside pipeline.yaml
    query: >
      SELECT name as id
      FROM testSource
      WHERE column1 = 'value'
  - id: transform2
    type: ScalaTransformation
    sources:
      - transform1 # Source name related to the defined data sources or transformations inside pipeline.yaml
    action: dropDuplicates
sinks:
  - id: sink1
    type: BigQuery
    config:
      saveMode: Append # Append, Overwrite, Merge...
      profile: testProfile # Profile name related to the defined profiles inside application.conf

Running Data Pipelines

To run data pipelines, you can use the Data Weaver command-line interface (CLI). Here's how to execute a pipeline:

weaver run --pipelines /path/to/pipelines/folder --tag 1d

Configuration

You can configure Data Weaver by editing the flow.conf file located in the config directory. This configuration file contains various settings for Data Weaver, including Spark configuration.

About

Data Weaver is a data processing and ETL (Extract, Transform, Load) tool built on Apache Spark. It allows you to define data pipelines using YAML configuration files and execute them using Spark for data transformation and integration.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published