Awesome Open Source Data Engineering

A curated list of open source tools used in analytical stacks and data engineering ecosystem For more information about the above compiled landscape for 2024, please read the published blog post on Substack or Medium

STORAGE SYSTEMS

Relational DBMS

PostgreSQL - Advanced object-relational database management system
MySQL - One of the most popular open Source Databases
MariaDB - A popular MySQL server fork
Supabase - An open source Firebase alternative
SQlite - Most popular embedded database engine

Distributed SQL DBMS

Citus - A popular distributed PostgreSQL as an extension
CockroachDB - A cloud-native distributed SQL database
YugabyteDB - A cloud-native distributed SQL database
TiDB - A cloud-native, distributed, MySQL-Compatible database
OceanBase - A scalable distributed relational database
ShardingSphere - A Distributed SQL transaction & query engine
Neon - A serverless open-source alternative to AWS Aurora Postgres

Cache Store

Redis - A popular key-value based cache store
Memcached - A high performance multithreadedkey-value cache store
Dragonfly - A modern cache store compatible with Redis and Memcached APIs

In-memory SQL Database

Apache Ignite - A distributed, ACID-compliant in-memory DBMS
ReadySet - A MySQL and Postgres wire-compatible caching layer
VoltDB - A distributed, horizontally-scalable, ACID-compliant database

Document Store

MongoDB - A cross-platform, document-oriented NoSQL database
RavenDB - An ACID NoSQL document database
RethinkDB - A distributed document-oriented database for real-time applications
CouchDB - A Scalable document-oriented NoSQL database
Couchbase - A modern cloud-native NoSQL distributed database
FerretDB - A truly Open Source MongoDB alternative!

NoSQL Multi-model

OrientDB - A Multi-model DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models
ArrangoDB - A Multi-model database with flexible data models for documents, graphs, and key-values
SurrealDB - A scalable, distributed, collaborative, document-graph database
EdgeDB - A graph-relational database with declarative schema

Graph Database

Neo4j - A high performance leading graph database
JunasGraph - A highly scalable distributed graph database
HugeGraph - A fast-speed and highly-scalable graph database
NebulaGraph - A distributed, horizontal scalability, fast open-source graph database
Cayley - Inspired by the graph database behind Google's Knowledge Graph
Dgraph - A horizontally scalable and distributed GraphQL database with a graph backend

Distributed Key-value Store

Riak - A decentralized key-value datastore from Basho Technologies
FoundationDB - A distributed, transactional key-value store from Apple
etcd - A distributed reliable key-value store written in Go
TiKV - A distributed transactional key-value database, originally created to complement TiDB
Immudb - A database with built-in cryptographic proof and verification

Wide-column Key-value Store

Apache Cassandra - A highly-scalable LSM-Tree based partitioned row store
Apache Hbase - A distributed wide column-oriented store modeled after Google' Bigtable
Scylla - LSM-Tree based wide-column API-compatible with Apache Cassandra and Amazon DynamoDB
Apache Accumulo - A distributed key-value store with scalable data storage and retrieval, on top of Hadoop

Embedded Key-value Store

LevelDB - A fast key-value storage library written at Google
RocksDB - An embeddable, persistent key-value store developed by Meta (Facebook)
MyRocks - A RocksDB storage engine for MySQL
BadgerDB - An embeddable, fast key-value database written in pure Go

Search Engine

Apache Solr - A fast distributed search database built on Apache Lucene
Elastic Search - A distributed, RESTful search engine optimized for speed
Sphinx - A fulltext search engine with high speed of indexation
Meilisearch - A fast search API with great integration support
OpenSearch - A community-driven, open source fork of Elasticsearch and Kibana
Quickwit - A fast cloud-native search engine for observability data

Streaming Database

RasingWave - A scalable Postgres for stream processing, analytics, and management
Materialize - A real-time data warehouse purpose-built for operational workloads
EventStoreDB - An event-native database designed for event sourcing and event-driven architectures
KsqlDB - A database for building stream processing applications on top of Apache Kafka

Time-Series Database

Influxdb - A scalable datastore for metrics, events, and real-time analytics
TimeScaleDB - A fast ingest time-series SQL database packaged as a PostgreSQL extension
Apache IoTDB - An Internet of Things database with seamless integration with the Hadoop and Spark ecology
Netflix Atlas - An n-memory dimensional time series database developed and open sourced by Netflix
QuestDB - A time-series database for fast ingest and SQL queries
TDEngine - A high-performance, cloud native time-series database optimized for Internet of Things (IoT)
KairosDB - A scalable time series database written in Java

Columnar OLAP Database

Apache Kudu - A column-oriented data store for the Apache Hadoop ecosystem
Greeenplum - A column-oriented massively parallel PostgreSQL for analytics
MonetDB - A high-performance columnar database originally developed by the CWI database research group
DuckDB - An in-process SQL OLAP Database Management System
Databend - An lastic, workload-aware cloud-native data warehouse built in Rust
ByConity - A cloud-native data warehouse forked from ClickHouse
hydra - A fast column-oriented Postgres extension

Real-time OLAP Engine

ClickHouse - A real-time column-oriented database originally developed at Yandex
Apache Pinot - A a real-time distributed OLAP datastore open sourced by LinkedIn
Apache Druid - A high performance real-time OLAP engine developed and open sourced by Metamarkets
Apache Kylin - A distributed OLAP engine designed to provide multi-dimensional analysis on Hadoop
Apache Doris - A high-performance and real-time analytical database based on MPP architecture
StarRocks - A sub-second OLAP database supporting multi-dimensional analytics (Linux Foundation project)

DATA LAKE PLATFORM

Distributed File System

Apache Hadoop HDFS - A highly scalable distributed block-based file system
GlusterFS - A scalable distributed storage that can scale to several petabytes
JuiceFS - A distributed POSIX file system built on top of Redis and S3
Lustre - A distributed parallel file system purpose-built to provide global POSIX-compliant namespace

Distributed Object Store

Apache Ozone - A scalable, redundant, and distributed object store for Apache Hadoop
Ceph - A distributed object, block, and file storage platform
Minio - A high performance object storage being API compatible with Amazon S3

Serialisation Framework

Apache Parquet - An efficient columnar binary storage format that supports nested data
Apache Avro - An efficient and fast row-based binary serialisation framework
Apache ORC - A self-describing type-aware columnar file format designed for Hadoop

Open Table Format

Apache Hudi - An open table format desined to support incremental data ingestion on cloud and Hadoop
Apache Iceberg - A high-performance table format for large analytic tables developed at Netflix
Delta Lake - A storage framework for building Lakehouse architecture developed by Databricks
Apache Paimon - An Apache inclubating project to support streaming high-speed data ingestion
OneTable - A unified framework supporting interoperability across multiple open-source table formats

DATA INTEGRATION

Data Integration Platform

Airbyte - A data integration platform for ETL / ELT data pipelines with wide range of connectors
Apache Nifi - A reliable, scalable low-code data integration platform with good enterprise support
Apache Camel - An embeddable integration framework supporting many enterprise integration patterns
Apache Gobblin - A distributed data integration framework built by LinkedIn supporting both streaming and batch data
Apache Inlong - An integration framework for supporting massive data, originally built at Tencent
Meltano - A declarative code-first data integration engine
Apache SeaTunnel - A high-performance, distributed data integration tool supporting vairous ingestion patterns

CDC Tool

Debezium - A change data capture framework supporting variety of databases
Kafka Connect - A streaming data integration framework and runtime on top of Apache Kafka supporting CDC
Flink CDC Connectors - CDC Connectors for Apache Flink engine supporting different databases
Brooklin - A distributed platform for streaming data between various heterogeneous source and destination systems
RudderStack - A headless Customer Data Platform to build data pipelines, open alternative to Segment

Log & Event Collection

CloudQuery - An ETL tool for syncing data from cloud APIs to variety of supported destinations
Snowplow - A cloud-native engine for collecting behavioral data and load into various cloud storage systems
EventMesh - A serverless event middlewar for collecting and loading event data into various targets
Apache Flume - A scalable distributed log aggregation service
Steampipe - A zero-ETL solution for getting data directly from APIs and services

Event Hub

Apache Kafka - A highly scalable distributed event store and streaming platform
NSQ - A realtime distributed messaging platform designed to operate at scale
Apache Pulsar - A scalable distributed pub-sub messaging system
Apache RocketMQ - A a cloud native messaging and streaming platform
Redpanda - A high performance Kafka API compatible streaming data platform
Memphis - A scalable data streaming platform for building event-driven applications

DATA PROCESSING AND COMPUTATION

Unified Processing

Apache Beam - A unified programming model supporting execution on popular distributed processing backends
Apache Spark - A unified analytics engine for large-scale data processing
Dinky - A unified streaming & batch computation platform based on Apache Flink

Batch processing

Hadoop MapReduce - A highly scalable distributed batch processing framework from Apache Hadoop project
Apache Tez - A distributed data processing pipeline built for Apache Hive and Hadoop

Stream Processing

Apache Flink - A scalable high throughput stream processing framework
Apache Samza - A distributed stream processing framework which uses Kafka and Hadoop, originally developed by LinkedIn
Apache Storm - A distributed realtime computation system based on Actor Model framework
Benthos - A high performance declarative stream processing engine
Akka - A highly concurrent, distributed, message-driven processing system based on Actor Model
Bytewax - A Python stream processing framework with a Rust distributed processing engine

Parallel Python Execution

Vaex - A high performance Python library for big tabular datasets.
Dask - A flexible parallel computing library for analytics
Polars - A multithreaded Dataframe with vectorized query engine, written in Rust
PySpark - An interface for Apache Spark in Python
RAY - A unified framework with distributed runtime for scaling python applications
Apache Arrow - An efficient in-memory data format

WORKFLOW MANAGEMENT & DATAOPS

Workflow Orchestration

Apache Airflow - A plaform for creating and scheduling workflows as directed acyclic graphs (DAGs) of tasks
Prefect - A Python based workflow orchestration tool
Argo - A container-native workflow engine for orchestrating parallel jobs on Kubernetes
Azkaban - A batch workflow job scheduler created at LinkedIn to run Hadoop jobs
Cadence - A distributed, scalable available orchestration supporting different language client libraries
Dagster - A cloud-native data pipeline orchestrator written in Python
Apache DolpinScheduler - A low-code high performance workflow orchestration platform
Luigi - A python library for building complex pipelines of batch jobs
Flyte - A scalable and flexible workflow orchestration platform for both data and ML workloads
Kestra - A declarative language-agnostic worfklow orchestration and scheduling platform
Mage.ai - A platform for integrating, cheduling and managing data pipelines
Temporal - A resilient workflow management system, originated as a fork of Uber's Cadence
Windmill - A fast workflow engine, and open-source alternative to Airplane and Retool

Data Quality

Data-diff - A tool for comparing tables within or across databases
Great Expectations - A data validation and profiling tool written in Python

Data Versioning

LakeFS - A data version control for data stored in data lakes
Project Nessie - A transactional Catalog for Data Lakes with Git-like semantics

Data Modeling

dbt - A data modeling and transformation tool for data pipelines
SQLMesh - A data transformation and modeling framework that is backwards compatible with dbt.

DATA INFRASTRUCTURE

Resource Scheduling

Apache Yarn - The default Resource Scheduler for Apache Hadoop clusters
Apache Mesos - A resource scheduling and cluster resource abstraction framework developed by Ph.D. students at UC Berkeley
Kubernetes - A production-grade container scheduling and management tool
Docker - The popular OS-level virtualization and containerization software

Cluster Administration

Apache Ambari - A tool for provisioning, managing, and monitoring of Apache Hadoop clusters
Apache Helix - A generic cluster management framework developed at LinkedIn

Security

Apache Knox - A gateway and SSO service for managing access to Hadoop clusters
Apache Ranger - A security and governance platform for Hadoop and other popular services
Kerberos - A popular enterprise network authentication protocol

Metrics Store

Influxdb - A scalable datastore for metrics and events
Mimir - A scalable long-term metrics storage for Prometheus, developed by Grafana Labs
OpenTSDB - A distributed, scalable Time Series Database written on top of Apache Hbase
M3 - A distributed TSDB and metrics storage and aggregator

Observability Framework

Prometheus - A popular metric collection and management tool
ELK - A poular observability stack comprsing of Elasticsearch, Kibana, Beats, and Logstash
Graphite - An established infrastructure monitoring and observability system
OpenTelemetry - A collection of APIs, SDKs, and tools for managing and monitoring metrics
VictoriaMetrics - An scalable monitoring solution with a time series database
Zabbix - A real-time infrastructure and application monitoring service

Monitoring Dashboard

Grafana - A popular open and composable observability and data visualization platform
Kibana - The visualistion and search dashboard for Elasticsearch
RConsole - A UI for monitoring and managing Apache Kafka and Redpanda workloads.

Log & Metrics Pipeline

Fluentd - A metric collection, buffering and router service
Fluent Bit - A fast log processor and forwarder, and part of the Fluentd ecosystem
Logstash - A server-side log and metric transport and processor, as part of the ELK stack
Telegraf - A plugin-driven server agent for collecting & reporting metrics developed by Influxdata
Vector - A high-performance, end-to-end (agent & aggregator) observability data pipeline
StatsD - A network daemon for collection, aggregation and routing of metrics

METADATA MANAGEMENT

Metadata Platform

Amundsen - A data discovery and metadata engine developed by Lyft engineers
Apache Atlas - A data observability platform for Apache Hadoop ecosystem
DataHub - A metadata platform for the modern data stack developed at Netflix
Marquez - A metadata service for the collection, aggregation, and visualization of metadata
ckan - A data management system for cataloging, managing and accessing data
Open Metadata - A unified platform for discovery and governance, using a central metadata repository

Open Standards

Open Lineage - An open standard for lineage metadata collection
Open Metadata - A unified metadata platform providing open stadards for managing metadata
Egeria - Open metadata and governance standards to facilitate metadata exchange

Schema Service

Hive Metastore - A popular schema management and metastore service as part of the Apache hive project
Confluent Schema Registry - A schema registry for Kafka, developed by Confluent

ANALYTICS & VISUALISATION

BI & Dashboard

Apache Superset - A poular open source data visualization and data exploration platform
Metabase - A simple data visualisation and exploration dashboard
Redash - A tool to explore, query, visualize, and share data with many data source connectors
Streamlit - A python tool to package and share data as web apps

Query & Collaboration

Hue - A query and data exploration tool with Hadoop ecosystem support, developed by Cloudera
Apache Zeppelin - A web-base Notebook for interactive data analytics and collaboration for Hadoop
Querybook - A simple query and notebook UI developed by Pinterest
Jupyter - A popular interactive web-based notebook application

MPP Query Engine

Apache Hive - A data warehousing and MPP engine on top of Hadoop
Apache Implala - A MPP engine mainly for Hadoop clusters, developed by Cloudera
Presto - A distributed SQL query engine for big data
Trino - The former PrestoSQL distributed SQL query engine
Apache Drill - A distributed MPP query engine against NoSQL and Hadoop data storage systems

Semantic Layer

Alluxio - A data orchestration and virtual distributed storage system
Cube - A semantic layer for building data applications supporting popular databse engines
Apache Linkis - A computation middleware to facilitate connection and orchestration between applications and data engines

ML/AI PLATFORM

Vector Storage

milvus - A cloud-native vector database, storage for AI applications
qdrant - A high-performance, scalable Vector database for AI
chroma - An AI-native embedding database for building LLM apps
marqo - An end-to-end vector search engine for both text and images
LanceDB - A serverless vector database for AI applications written in Rust
weaviate - A scalable, cloud-native supporting storage of both objects and vectors
deeplake - A storage format optimized AI database for deep-learning applications
Vespa - A storage to organize vectors, tensors, text and structured data
vald - A scalable distributed approximate nearest neighbor (ANN) dense vector search engine
pgvector - A vector similarity search as a Postgres extension

MLOps

mlflow - A a platform to streamline machine learning development and lifecycle management
Metaflow - A tool to build and manage ML/AI, and data science projects, developed at Netflix
SkyPilot - A framework for running LLMs, AI, and batch jobs on any cloud
Jina - A tool to build multimodal AI applications with cloud-native stack
NNI - An autoML toolkit for automate machine learning lifecycle, from Microsoft
BentoML - A framework for building reliable and scalable AI applications
Determined AI - An ML platform that simplifies distributed training, tuning and experiment tracking
RAY - A unified framework for scaling AI and Python applications
kubeflow - A cloud-native platform for ML operations - pipelines, training and deployment

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md

pracdata/awesome-open-source-data-engineering

Folders and files

Latest commit

History

README.md