Skip to content

pracdata/awesome-open-source-data-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 

Repository files navigation

Awesome Open Source Data Engineering Awesome

A curated list of open source tools used in analytical stacks and data engineering ecosystem Open Source Data Engineering Landscape 2024 For more information about the above compiled landscape for 2024, please read the published blog post on Substack or Medium

Table of contents

STORAGE SYSTEMS

Relational DBMS

  • PostgreSQL - Advanced object-relational database management system
  • MySQL - One of the most popular open Source Databases
  • MariaDB - A popular MySQL server fork
  • Supabase - An open source Firebase alternative
  • SQlite - Most popular embedded database engine

Distributed SQL DBMS

  • Citus - A popular distributed PostgreSQL as an extension
  • CockroachDB - A cloud-native distributed SQL database
  • YugabyteDB - A cloud-native distributed SQL database
  • TiDB - A cloud-native, distributed, MySQL-Compatible database
  • OceanBase - A scalable distributed relational database
  • ShardingSphere - A Distributed SQL transaction & query engine
  • Neon - A serverless open-source alternative to AWS Aurora Postgres

Cache Store

  • Redis - A popular key-value based cache store
  • Memcached - A high performance multithreadedkey-value cache store
  • Dragonfly - A modern cache store compatible with Redis and Memcached APIs

In-memory SQL Database

  • Apache Ignite - A distributed, ACID-compliant in-memory DBMS
  • ReadySet - A MySQL and Postgres wire-compatible caching layer
  • VoltDB - A distributed, horizontally-scalable, ACID-compliant database

Document Store

  • MongoDB - A cross-platform, document-oriented NoSQL database
  • RavenDB - An ACID NoSQL document database
  • RethinkDB - A distributed document-oriented database for real-time applications
  • CouchDB - A Scalable document-oriented NoSQL database
  • Couchbase - A modern cloud-native NoSQL distributed database
  • FerretDB - A truly Open Source MongoDB alternative!

NoSQL Multi-model

  • OrientDB - A Multi-model DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models
  • ArrangoDB - A Multi-model database with flexible data models for documents, graphs, and key-values
  • SurrealDB - A scalable, distributed, collaborative, document-graph database
  • EdgeDB - A graph-relational database with declarative schema

Graph Database

  • Neo4j - A high performance leading graph database
  • JunasGraph - A highly scalable distributed graph database
  • HugeGraph - A fast-speed and highly-scalable graph database
  • NebulaGraph - A distributed, horizontal scalability, fast open-source graph database
  • Cayley - Inspired by the graph database behind Google's Knowledge Graph
  • Dgraph - A horizontally scalable and distributed GraphQL database with a graph backend

Distributed Key-value Store

  • Riak - A decentralized key-value datastore from Basho Technologies
  • FoundationDB - A distributed, transactional key-value store from Apple
  • etcd - A distributed reliable key-value store written in Go
  • TiKV - A distributed transactional key-value database, originally created to complement TiDB
  • Immudb - A database with built-in cryptographic proof and verification

Wide-column Key-value Store

  • Apache Cassandra - A highly-scalable LSM-Tree based partitioned row store
  • Apache Hbase - A distributed wide column-oriented store modeled after Google' Bigtable
  • Scylla - LSM-Tree based wide-column API-compatible with Apache Cassandra and Amazon DynamoDB
  • Apache Accumulo - A distributed key-value store with scalable data storage and retrieval, on top of Hadoop

Embedded Key-value Store

  • LevelDB - A fast key-value storage library written at Google
  • RocksDB - An embeddable, persistent key-value store developed by Meta (Facebook)
  • MyRocks - A RocksDB storage engine for MySQL
  • BadgerDB - An embeddable, fast key-value database written in pure Go

Search Engine

  • Apache Solr - A fast distributed search database built on Apache Lucene
  • Elastic Search - A distributed, RESTful search engine optimized for speed
  • Sphinx - A fulltext search engine with high speed of indexation
  • Meilisearch - A fast search API with great integration support
  • OpenSearch - A community-driven, open source fork of Elasticsearch and Kibana
  • Quickwit - A fast cloud-native search engine for observability data

Streaming Database

  • RasingWave - A scalable Postgres for stream processing, analytics, and management
  • Materialize - A real-time data warehouse purpose-built for operational workloads
  • EventStoreDB - An event-native database designed for event sourcing and event-driven architectures
  • KsqlDB - A database for building stream processing applications on top of Apache Kafka

Time-Series Database

  • Influxdb - A scalable datastore for metrics, events, and real-time analytics
  • TimeScaleDB - A fast ingest time-series SQL database packaged as a PostgreSQL extension
  • Apache IoTDB - An Internet of Things database with seamless integration with the Hadoop and Spark ecology
  • Netflix Atlas - An n-memory dimensional time series database developed and open sourced by Netflix
  • QuestDB - A time-series database for fast ingest and SQL queries
  • TDEngine - A high-performance, cloud native time-series database optimized for Internet of Things (IoT)
  • KairosDB - A scalable time series database written in Java

Columnar OLAP Database

  • Apache Kudu - A column-oriented data store for the Apache Hadoop ecosystem
  • Greeenplum - A column-oriented massively parallel PostgreSQL for analytics
  • MonetDB - A high-performance columnar database originally developed by the CWI database research group
  • DuckDB - An in-process SQL OLAP Database Management System
  • Databend - An lastic, workload-aware cloud-native data warehouse built in Rust
  • ByConity - A cloud-native data warehouse forked from ClickHouse
  • hydra - A fast column-oriented Postgres extension

Real-time OLAP Engine

  • ClickHouse - A real-time column-oriented database originally developed at Yandex
  • Apache Pinot - A a real-time distributed OLAP datastore open sourced by LinkedIn
  • Apache Druid - A high performance real-time OLAP engine developed and open sourced by Metamarkets
  • Apache Kylin - A distributed OLAP engine designed to provide multi-dimensional analysis on Hadoop
  • Apache Doris - A high-performance and real-time analytical database based on MPP architecture
  • StarRocks - A sub-second OLAP database supporting multi-dimensional analytics (Linux Foundation project)

DATA LAKE PLATFORM

Distributed File System

  • Apache Hadoop HDFS - A highly scalable distributed block-based file system
  • GlusterFS - A scalable distributed storage that can scale to several petabytes
  • JuiceFS - A distributed POSIX file system built on top of Redis and S3
  • Lustre - A distributed parallel file system purpose-built to provide global POSIX-compliant namespace

Distributed Object Store

  • Apache Ozone - A scalable, redundant, and distributed object store for Apache Hadoop
  • Ceph - A distributed object, block, and file storage platform
  • Minio - A high performance object storage being API compatible with Amazon S3

Serialisation Framework

  • Apache Parquet - An efficient columnar binary storage format that supports nested data
  • Apache Avro - An efficient and fast row-based binary serialisation framework
  • Apache ORC - A self-describing type-aware columnar file format designed for Hadoop

Open Table Format

  • Apache Hudi - An open table format desined to support incremental data ingestion on cloud and Hadoop
  • Apache Iceberg - A high-performance table format for large analytic tables developed at Netflix
  • Delta Lake - A storage framework for building Lakehouse architecture developed by Databricks
  • Apache Paimon - An Apache inclubating project to support streaming high-speed data ingestion
  • OneTable - A unified framework supporting interoperability across multiple open-source table formats

DATA INTEGRATION

Data Integration Platform

  • Airbyte - A data integration platform for ETL / ELT data pipelines with wide range of connectors
  • Apache Nifi - A reliable, scalable low-code data integration platform with good enterprise support
  • Apache Camel - An embeddable integration framework supporting many enterprise integration patterns
  • Apache Gobblin - A distributed data integration framework built by LinkedIn supporting both streaming and batch data
  • Apache Inlong - An integration framework for supporting massive data, originally built at Tencent
  • Meltano - A declarative code-first data integration engine
  • Apache SeaTunnel - A high-performance, distributed data integration tool supporting vairous ingestion patterns

CDC Tool

  • Debezium - A change data capture framework supporting variety of databases
  • Kafka Connect - A streaming data integration framework and runtime on top of Apache Kafka supporting CDC
  • Flink CDC Connectors - CDC Connectors for Apache Flink engine supporting different databases
  • Brooklin - A distributed platform for streaming data between various heterogeneous source and destination systems
  • RudderStack - A headless Customer Data Platform to build data pipelines, open alternative to Segment

Log & Event Collection

  • CloudQuery - An ETL tool for syncing data from cloud APIs to variety of supported destinations
  • Snowplow - A cloud-native engine for collecting behavioral data and load into various cloud storage systems
  • EventMesh - A serverless event middlewar for collecting and loading event data into various targets
  • Apache Flume - A scalable distributed log aggregation service
  • Steampipe - A zero-ETL solution for getting data directly from APIs and services

Event Hub

  • Apache Kafka - A highly scalable distributed event store and streaming platform
  • NSQ - A realtime distributed messaging platform designed to operate at scale
  • Apache Pulsar - A scalable distributed pub-sub messaging system
  • Apache RocketMQ - A a cloud native messaging and streaming platform
  • Redpanda - A high performance Kafka API compatible streaming data platform
  • Memphis - A scalable data streaming platform for building event-driven applications

DATA PROCESSING AND COMPUTATION

Unified Processing

  • Apache Beam - A unified programming model supporting execution on popular distributed processing backends
  • Apache Spark - A unified analytics engine for large-scale data processing
  • Dinky - A unified streaming & batch computation platform based on Apache Flink

Batch processing

  • Hadoop MapReduce - A highly scalable distributed batch processing framework from Apache Hadoop project
  • Apache Tez - A distributed data processing pipeline built for Apache Hive and Hadoop

Stream Processing

  • Apache Flink - A scalable high throughput stream processing framework
  • Apache Samza - A distributed stream processing framework which uses Kafka and Hadoop, originally developed by LinkedIn
  • Apache Storm - A distributed realtime computation system based on Actor Model framework
  • Benthos - A high performance declarative stream processing engine
  • Akka - A highly concurrent, distributed, message-driven processing system based on Actor Model
  • Bytewax - A Python stream processing framework with a Rust distributed processing engine

Parallel Python Execution

  • Vaex - A high performance Python library for big tabular datasets.
  • Dask - A flexible parallel computing library for analytics
  • Polars - A multithreaded Dataframe with vectorized query engine, written in Rust
  • PySpark - An interface for Apache Spark in Python
  • RAY - A unified framework with distributed runtime for scaling python applications
  • Apache Arrow - An efficient in-memory data format

WORKFLOW MANAGEMENT & DATAOPS

Workflow Orchestration

  • Apache Airflow - A plaform for creating and scheduling workflows as directed acyclic graphs (DAGs) of tasks
  • Prefect - A Python based workflow orchestration tool
  • Argo - A container-native workflow engine for orchestrating parallel jobs on Kubernetes
  • Azkaban - A batch workflow job scheduler created at LinkedIn to run Hadoop jobs
  • Cadence - A distributed, scalable available orchestration supporting different language client libraries
  • Dagster - A cloud-native data pipeline orchestrator written in Python
  • Apache DolpinScheduler - A low-code high performance workflow orchestration platform
  • Luigi - A python library for building complex pipelines of batch jobs
  • Flyte - A scalable and flexible workflow orchestration platform for both data and ML workloads
  • Kestra - A declarative language-agnostic worfklow orchestration and scheduling platform
  • Mage.ai - A platform for integrating, cheduling and managing data pipelines
  • Temporal - A resilient workflow management system, originated as a fork of Uber's Cadence
  • Windmill - A fast workflow engine, and open-source alternative to Airplane and Retool

Data Quality

  • Data-diff - A tool for comparing tables within or across databases
  • Great Expectations - A data validation and profiling tool written in Python

Data Versioning

  • LakeFS - A data version control for data stored in data lakes
  • Project Nessie - A transactional Catalog for Data Lakes with Git-like semantics

Data Modeling

  • dbt - A data modeling and transformation tool for data pipelines
  • SQLMesh - A data transformation and modeling framework that is backwards compatible with dbt.

DATA INFRASTRUCTURE

Resource Scheduling

  • Apache Yarn - The default Resource Scheduler for Apache Hadoop clusters
  • Apache Mesos - A resource scheduling and cluster resource abstraction framework developed by Ph.D. students at UC Berkeley
  • Kubernetes - A production-grade container scheduling and management tool
  • Docker - The popular OS-level virtualization and containerization software

Cluster Administration

  • Apache Ambari - A tool for provisioning, managing, and monitoring of Apache Hadoop clusters
  • Apache Helix - A generic cluster management framework developed at LinkedIn

Security

  • Apache Knox - A gateway and SSO service for managing access to Hadoop clusters
  • Apache Ranger - A security and governance platform for Hadoop and other popular services
  • Kerberos - A popular enterprise network authentication protocol

Metrics Store

  • Influxdb - A scalable datastore for metrics and events
  • Mimir - A scalable long-term metrics storage for Prometheus, developed by Grafana Labs
  • OpenTSDB - A distributed, scalable Time Series Database written on top of Apache Hbase
  • M3 - A distributed TSDB and metrics storage and aggregator

Observability Framework

  • Prometheus - A popular metric collection and management tool
  • ELK - A poular observability stack comprsing of Elasticsearch, Kibana, Beats, and Logstash
  • Graphite - An established infrastructure monitoring and observability system
  • OpenTelemetry - A collection of APIs, SDKs, and tools for managing and monitoring metrics
  • VictoriaMetrics - An scalable monitoring solution with a time series database
  • Zabbix - A real-time infrastructure and application monitoring service

Monitoring Dashboard

  • Grafana - A popular open and composable observability and data visualization platform
  • Kibana - The visualistion and search dashboard for Elasticsearch
  • RConsole - A UI for monitoring and managing Apache Kafka and Redpanda workloads.

Log & Metrics Pipeline

  • Fluentd - A metric collection, buffering and router service
  • Fluent Bit - A fast log processor and forwarder, and part of the Fluentd ecosystem
  • Logstash - A server-side log and metric transport and processor, as part of the ELK stack
  • Telegraf - A plugin-driven server agent for collecting & reporting metrics developed by Influxdata
  • Vector - A high-performance, end-to-end (agent & aggregator) observability data pipeline
  • StatsD - A network daemon for collection, aggregation and routing of metrics

METADATA MANAGEMENT

Metadata Platform

  • Amundsen - A data discovery and metadata engine developed by Lyft engineers
  • Apache Atlas - A data observability platform for Apache Hadoop ecosystem
  • DataHub - A metadata platform for the modern data stack developed at Netflix
  • Marquez - A metadata service for the collection, aggregation, and visualization of metadata
  • ckan - A data management system for cataloging, managing and accessing data
  • Open Metadata - A unified platform for discovery and governance, using a central metadata repository

Open Standards

  • Open Lineage - An open standard for lineage metadata collection
  • Open Metadata - A unified metadata platform providing open stadards for managing metadata
  • Egeria - Open metadata and governance standards to facilitate metadata exchange

Schema Service

ANALYTICS & VISUALISATION

BI & Dashboard

  • Apache Superset - A poular open source data visualization and data exploration platform
  • Metabase - A simple data visualisation and exploration dashboard
  • Redash - A tool to explore, query, visualize, and share data with many data source connectors
  • Streamlit - A python tool to package and share data as web apps

Query & Collaboration

  • Hue - A query and data exploration tool with Hadoop ecosystem support, developed by Cloudera
  • Apache Zeppelin - A web-base Notebook for interactive data analytics and collaboration for Hadoop
  • Querybook - A simple query and notebook UI developed by Pinterest
  • Jupyter - A popular interactive web-based notebook application

MPP Query Engine

  • Apache Hive - A data warehousing and MPP engine on top of Hadoop
  • Apache Implala - A MPP engine mainly for Hadoop clusters, developed by Cloudera
  • Presto - A distributed SQL query engine for big data
  • Trino - The former PrestoSQL distributed SQL query engine
  • Apache Drill - A distributed MPP query engine against NoSQL and Hadoop data storage systems

Semantic Layer

  • Alluxio - A data orchestration and virtual distributed storage system
  • Cube - A semantic layer for building data applications supporting popular databse engines
  • Apache Linkis - A computation middleware to facilitate connection and orchestration between applications and data engines

ML/AI PLATFORM

Vector Storage

  • milvus - A cloud-native vector database, storage for AI applications
  • qdrant - A high-performance, scalable Vector database for AI
  • chroma - An AI-native embedding database for building LLM apps
  • marqo - An end-to-end vector search engine for both text and images
  • LanceDB - A serverless vector database for AI applications written in Rust
  • weaviate - A scalable, cloud-native supporting storage of both objects and vectors
  • deeplake - A storage format optimized AI database for deep-learning applications
  • Vespa - A storage to organize vectors, tensors, text and structured data
  • vald - A scalable distributed approximate nearest neighbor (ANN) dense vector search engine
  • pgvector - A vector similarity search as a Postgres extension

MLOps

  • mlflow - A a platform to streamline machine learning development and lifecycle management
  • Metaflow - A tool to build and manage ML/AI, and data science projects, developed at Netflix
  • SkyPilot - A framework for running LLMs, AI, and batch jobs on any cloud
  • Jina - A tool to build multimodal AI applications with cloud-native stack
  • NNI - An autoML toolkit for automate machine learning lifecycle, from Microsoft
  • BentoML - A framework for building reliable and scalable AI applications
  • Determined AI - An ML platform that simplifies distributed training, tuning and experiment tracking
  • RAY - A unified framework for scaling AI and Python applications
  • kubeflow - A cloud-native platform for ML operations - pipelines, training and deployment