Skip to content

hellomaxime/data-platform-on-kubernetes

Repository files navigation

Data platform on Kubernetes

This project aims to deploy a complete data platform on kubernetes, many services are available to build end-to-end data engineering projects from ingestion to visualization.

Prerequisites

  • docker
  • kubernetes (minikube cluster for local development)
  • kubectl
  • helm

Available services

  • Data ingestion
    • Nifi
  • Data integration
    • Airbyte
  • Message queue
    • Kafka
    • RabbitMQ
  • Change Data Capture
    • Debezium
  • Database
    • Cassandra
    • Druid
    • MongoDB
    • MySQL/Phpmyadmin
    • PostgreSQL/pgAdmin
  • Data warehouse
    • ClickHouse
  • Datalake
    • MinIO
  • Data transformation
    • dbt
    • Flink
    • Spark
  • Data quality
    • Great Expectations
  • Distributed SQL query engine
    • Trino
  • Visualization
    • Metabase
    • Superset
  • Machine learning
    • Kubeflow
  • Orchestration
    • Airflow
    • Argo Workflows
  • Monitoring
    • Grafana/Prometheus
  • Notebook
    • JupyterHub

Data formats

  • Delta Lake
  • Apache Iceberg (soon)

How to deploy the data platform on kubernetes

Before deploying in the cluster, choose services you want to start in .config file. (y|n)

Deploy the data plaftorm
./start.sh

You may need to wait a few minutes for all services to start, you can check pods status with the following command : kubectl get all -A.

Turn off the data plaftorm
./stop.sh

Helpful:

some services are accessible through an URL
example : http://dataplatform.<service-name>.io/

access another service from inside
<service-name>.<namespace>.svc.cluster.local:<service-port>

get helm default values
helm show values <repo/chart> > values.yaml

config file
set .config file to choose services you want to enable/disable

minikube ingress addons
minikube addons enable ingress

kubernetes dashboard
minikube dashboard --url