Skip to content

KennethanCeyer/awesome-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Data Pipeline Awesome

Awesome list for Data Pipeline

Awesome Data Pipeline - Awesome list for data pipeline

Data Pipeline is:
A series that moves data from source to destination efficiently and automatically.

Contents

Components

Workflow Management

Data Ingestion

  • Apache Flume - (Apache foundation / Data Ingestion / Open Source / Free).
  • Stitch - (Talend / ETL / Subscription fee).
  • Logstash - (Elastic / Data Ingestion / Cloud or On-prem / Hybrid fee).
  • Filebeat - (Elastic / Data Ingestion / Cloud or On-prem / Hybrid fee).
  • Fluentd - (CNCF foundation / Open Source / Free or License fee).
  • Datadog - (Datadog / Cloud / APM / Subscription fee).
  • New Relic - (New Relic / Cloud / APM / Subscription fee).

Data Lake

Data Warehouse

  • Aapache Hive - (Apache foundation / Hadoop-friendly / MapReduce / Free).
  • Snowflake - (Multi-cloud / SQL-friendly / Subscription fee).
  • AWS Redshift - (AWS Cloud / SQL-friendly / Subscription fee).
  • Azure Synapse Analytics - (Azure Cloud / SQL-friendly / Subscription fee).
  • GCP BigQuery - (Google Cloud / SQL-friendly / On-demand fee).
  • IBM DB2 - (IBM / On-prem / SQL-friendly / Subscription fee).

Data Store

  • Apache Druid - (Apache foundation / Real-time datastore / Free).
  • Apache Pinot - (Apache foundation / Real-time datastore / Free).
  • AWS Aurora - (AWS Cloud / Rich-cloud datastore / Subscription fee).
  • GCP Cloud Spanner - (Google Cloud / HA datastore that breaks away from CAP / Subscription fee).
  • Azure Cosmos DB - (Azure Cloud / NoSQL datastore / Subscription fee).

Query Engine

  • Presto - (Facebook / Open Source / SQL-friendly / Free or License fee).
  • Apache Impala - (Apache foundation / Cloudera / Open Source / SQL-friendly / Free or License fee).
  • AWS Athena - (AWS Cloud / SQL-friendly / On-demand fee).
  • AWS Redshift Spectrum - (AWS Cloud / SQL-friendly / On-demand fee).

Streaming

  • Apache Kafka - (Apache foundation / Confluent / Linkedin / Message Broker / Open Source / Free or License fee).
  • RabbitMQ - (VMWare / Messaging Queue / Free or License fee).
  • AWS Kinesis - (AWS Cloud / Message Broker / Subscription fee).
  • AWS SQS - (AWS Cloud / Messaging Queue / Subscription fee).
  • GCP PubSub - (Google Cloud / Message Borker / Subscription fee).
  • Azure Event Hub - (Azure Cloud / Messsage Borker / Subscription fee).

Data Transformation

  • Apache Spark - (Apache foundation / Databricks / In-memory processing / Open Source / Free or License fee).
  • Apache Beam - (Apache foundation / Google / Data processing / Open Source / Free or License fee).
  • Apache Storm - (Apache foundation / Backtype / Twitter / Stream processing / Open Source / Free).
  • Apache Flink - (Apache foundation / Stream processing / Open Source / Free).
  • AWS Glue - (AWS Cloud / Integrated Data System / ETL / On-demand fee).

Data Analysis

  • Apache Superset - (Apache foundation / Airbnb / Business Intelligence (BI) / Open Source / Free).
  • Apache Airpal - (Apache foundation / Airbnb / Query Editor / Open Source / Free).
  • Apache HUE - (Apache foundation / Cloudera / Query Editor / Open Source / Free).
  • Kibana - (Elastic / Dashboard / Hybrid fee).
  • Databricks Notebook - (Databricks / Notebook / Hybrid fee).
  • Jupyter Notebook - (Jupyter / Notebook / Open Source / Free).
  • Pandas - (NumFOCUS / Data processing / Open Source / Free).
  • Plotly - (Plotly / Data visualization / Hybrid fee).

Data Format

  • Apache Parquet - (Apache foundation / Data Format / Open Source / Free).
  • Apache ORC - (Apache foundation / Hortonworks / Facebook / Data Format / Open Source / Free).
  • Apache Avro - (Apache foundation / Data Format / Open Source / Free).
  • Apache Kudu - (Apache foundation / Cloudera / Data Format / Open Source / Free).
  • Apache Arrow - (Apache foundation / Data Format / Open Source / Free).
  • Delta - (Databricks / Data Format / Free or License fee).
  • JSON - (Data Format / Free).
  • CSV - (Data Format / Free).
  • TSV - (Data Format / Free).
  • HDF5 - (The HDF Group / Data Format / Open Source (licensed by HDF5) / Free).

Business Intelligence

  • Apache Zeppelin - (Apache foundation / Business Intelligence (BI) / Open Source / Free or License fee).
  • Tableau - (Salesforce / Business Intelligence (BI) / Hybrid fee).
  • Redash - (Redash Inc / Databricks / Business Intelligence (BI) / Hybrid fee).
  • Looker - (Looker Data Sciences Inc / Business Intelligence (BI) / Subscription fee).
  • Data Studio - (Google Cloud / Business Intelligence (BI) / Free).
  • PowerBI - (Microsoft / Business Intelligence (BI) / Subscription fee).

AI/ML

  • H2O - (H2O.ai / Model Evaluation / Subscription fee).
  • Feast - (Tecton / Gojek / Feature Store / Open Source / Free).
  • Vertex AI - (Google Cloud / Hybrid Features for AI / Subscription fee).
  • Data Robot - (DataRobot Inc / Feature Engineering / Subscription fee).
  • WandB - (Weights & Biases / Model Evaluation / Subscription fee).

Community

Vendors

Open Source / Foundation

Materials

Books

Dummies Guide