#

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Here are 8,304 public repositories matching this topic...

vigneshSs-07 / Cloud-AI-Analytics

This Repo contain details related to Data Engineering tech stacks in GCP

bigquery spark bigdata google-cloud-platform cloudsql datalab clouddataflow apachebeam

Updated Jun 1, 2024
Jupyter Notebook

AbsaOSS / spline

Data Lineage Tracking And Visualization Solution

visualization tracking scala spark hadoop bigdata lineage

Updated Jun 1, 2024
Scala

ytsaurus / ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform.

sql big-data spark clickhouse distributed-database lakehouse olap-database ytsaurus

Updated Jun 1, 2024
C++

apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

kubernetes sql spark hive hadoop jdbc thrift data-lake hacktoberfest spark-sql

Updated Jun 1, 2024
Scala

xuwenyihust / DataPulse

DataPulse is a platform for developers to build, schedule and monitor data pipelines.

kubernetes workflow spark jupyter-notebook gcp orchestration data-engineering data-platform mlflow delta-lake

Updated Jun 1, 2024
JavaScript

apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator

rust spark arrow datafusion

Updated Jun 1, 2024
Rust

spark-yun

isxcode / spark-yun

Big data computing platform based on Spark <至轻云-打造大数据计算平台>

docker platform spark data-analysis

Updated Jun 1, 2024
Java

apache / spark

Apache Spark - A unified analytics engine for large-scale data processing

python java r scala sql big-data spark jdbc

Updated Jun 1, 2024
Scala

apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery real-time sql database spark hive hadoop etl snowflake olap query-engine redshift dbt elt iceberg hudi delta-lake lakehouse

Updated Jun 1, 2024
Java

huwngnosleep / complete_lakehouse_techstack

This project implements an end-to-end techstack for a data platform, can be used on production.

kafka spark hadoop etl bigdata data-warehouse data-platform lambda-architecture data-lakehouse

Updated Jun 1, 2024
Python

apache / incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

big-data spark etl graph pyspark graph-analysis data-orchestration graph-storage

Updated Jun 1, 2024
C++

DataTalksClub / data-engineering-zoomcamp

Free Data Engineering course!

docker kafka spark data-engineering dbt prefect

Updated Jun 1, 2024
Jupyter Notebook

mage-ai / mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

python data-science data machine-learning sql spark pipeline etl pipelines orchestration artificial-intelligence data-engineering data-integration dbt elt transformation data-pipelines reverse-etl

Updated Jun 1, 2024
Python

opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.

spark compute opensearch secondary-index

Updated Jun 1, 2024
Scala

tobymao / sqlglot

Python SQL Parser and Transpiler

Updated Jun 1, 2024
Python

getredash / redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

visualization javascript mysql python bigquery bi spark dashboard athena analytics postgresql business-intelligence redash redshift databricks hacktoberfest spark-sql

Updated Jun 1, 2024
Python

univalence / zio-spark

A functional wrapper around Spark to make it works with ZIO

scala spark zio zio-spark

Updated Jun 1, 2024
Scala

databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

python spark faker pyspark spark-streaming data-generation databricks synthetic-data datagen datagenerator deltalake datageneration delta-live-tables

Updated Jun 1, 2024
Python

amzn / rheoceros

Cloud-based AI / ML workflow and data application development framework

flow aws data-science machine-learning cloud ai spark aws-lambda serverless aws-emr pyspark feature-engineering scala-spark event-based aws-glue sagemaker-notebook low-code-framework sagemaker-notebook-instance bring-your-own-account

Updated Jun 1, 2024
Python

HsiehShuJeng / cdk-emrserverless-with-delta-lake

This construct builds some elements for you to quickly launch an EMR Serverless application. After submitting the Emr Serverless job, you could also launch an EMR notebook via cluster template to check the outcome from the EMR Serverless application.

python java golang aws spark serverless dotnet javacript aws-cloudformation emr-notebooks delta-lake aws-service-catalog cdk-constructs projen emr-studio emr-serverless

Updated Jun 1, 2024
TypeScript

Created by Matei Zaharia

Released May 26, 2014

Followers: 417 followers
Repository: apache/spark
Website: spark.apache.org
Wikipedia: Wikipedia

Related Topics