This Repo contain details related to Data Engineering tech stacks in GCP
-
Updated
Jun 1, 2024 - Jupyter Notebook
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
This Repo contain details related to Data Engineering tech stacks in GCP
YTsaurus is a scalable and fault-tolerant open-source big data platform.
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
DataPulse is a platform for developers to build, schedule and monitor data pipelines.
Big data computing platform based on Spark <至轻云-打造大数据计算平台>
This project implements an end-to-end techstack for a data platform, can be used on production.
An open source, standard data file format for graph data storage and retrieval.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
Spark Accelerator framework ; It enables secondary indices to remote data stores.
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
Cloud-based AI / ML workflow and data application development framework
This construct builds some elements for you to quickly launch an EMR Serverless application. After submitting the Emr Serverless job, you could also launch an EMR notebook via cluster template to check the outcome from the EMR Serverless application.
Created by Matei Zaharia
Released May 26, 2014