apachespark

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

python docker airflow kafka cassandra-database apachespark postgesql

Updated Mar 22, 2024
Python

mehrdadalmasi2020 / ApacheSpark_ApacheZeppelin_SQL_Shell

Star

Run your first analysis project on Apache Zeppelin using Scala (Spark), Shell, and SQL

visualization shell scala notebook sparksql zeppelin-notebook apachespark

Updated Feb 16, 2024
Scala

holdenk / sparkProjectTemplate.g8

Sponsor

Star

Template for Spark Projects

spark g8 apachespark

Updated May 21, 2024
Scala

propelledanalytics / SparkSQL.jl

Star

SparkSQL.jl enables Julia programs to work with Apache Spark data using just SQL.

spark julia-language julialang apachespark

Updated Jan 29, 2024
Julia

martandsingh / ApacheSpark

Star

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

sql database spark hive hadoop etl pyspark data-engineering spark-streaming data-analysis databricks datalake spark-sql timetravel apachespark etl-pipeline deltalake

Updated Dec 28, 2023
Python

urvashiforreal / Retail-Data-Analysis

Star

Developed a real-time streaming analytics pipeline using Apache Spark to calculate and store KPIs for e-commerce sales data, including total volume of sales, orders per minute, rate of return, and average transaction size. Used Spark Streaming to read data from Kafka, Spark SQL to calculate KPIs, and Spark DataFrame to write KPIs to JSON files.

sparksql sparkstreaming apachespark sparkdataframe

Updated Oct 15, 2023
Python

ashkrit / sparkmicroservices

Star

Microservices for Spark application

microservice apachespark

Updated Jul 16, 2023
Java

CarolinaNicasio / APACHESPARK-PYSPARK-2023

Star

PySpark es una biblioteca de procesamiento de datos distribuidos en Python que permite procesar grandes volúmenes de datos en clústeres utilizando el framework Apache Spark, ofreciendo un alto rendimiento y un conjunto de herramientas integradas para el análisis y manejo de datos a gran escala.

python data-science spark apache python3 pyspark dataframe rdd apachespark github-actions

Updated Jun 27, 2023

sarathchandrikak / ETL-Bank-Transcation

Star

Data Analysis of bank transaction data

sql pyspark redshift sqoop apachespark s3bucket

Updated Jun 6, 2023
Jupyter Notebook

YFC-ophey / big-data-group-project

Star

US superstore opening analysis

hive hadoop hdfs sparksql tableau datavisualization zeppelin-notebook apachespark businessanalysis

Updated Apr 6, 2023

geazi-anc / dracula

Star

a brief analysis to the most common words in Dracula, by Bram Stoker

python spark jupyter dracula pyspark dataanalysis apachespark

Updated Jan 11, 2023
Jupyter Notebook

divithraju / divith-raju-Immigration-Data-Engineering

Star

A Capstone Project that covers several aspects of Data Engineering (Data Exploration, Cleaning, Modeling, Pipelining, Processing)

sql bigdata pandas dataset datapipeline datalake dataprocessing dataengineering capstone-project apachespark datacleaning bigdataproject datamodeling datawherehouse dataschema bigdataprocessing

Updated Dec 25, 2022
Jupyter Notebook

bartosz25 / data-ai-summit-2020

Star

You will find here the demo codes for my Data+AI 2020 talk about customizing Apache Spark state store.

apache-spark data-engineering streaming-data structured-streaming apachespark

Updated Nov 12, 2022

SmartDataAnalytics / MA-INF-4223-DBDA-Lab

Star

Repository for Lab “Distributed Big Data Analytics” (MA-INF 4223), University of Bonn

machine-learning university rdf semantics bigdata teaching bonn apachespark sansa

Updated Aug 11, 2022
Jupyter Notebook

mayankrawat / CSVJoin

Star

Use this project to join data from multiple csv files. Currently in this project we support one to one and one to many join. Along with this you can find how to use kafka producer efficiently with spark.

Updated Jul 1, 2022
Java

Improve this page

Add a description, image, and links to the apachespark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the apachespark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apachespark

Here are 39 public repositories matching this topic...

apache / hudi

DataExpert-io / data-engineer-handbook

tspannhw / FLiPStackWeekly

josephmachado / docker_for_data_engineers

AbdelmajidLh / spark-functionality-repo

ZeroTwoDataRW / DE-Stream-Project-Random-Generated-User-Data

mehrdadalmasi2020 / ApacheSpark_ApacheZeppelin_SQL_Shell

holdenk / sparkProjectTemplate.g8

propelledanalytics / SparkSQL.jl

martandsingh / ApacheSpark

urvashiforreal / Retail-Data-Analysis

ashkrit / sparkmicroservices

CarolinaNicasio / APACHESPARK-PYSPARK-2023

sarathchandrikak / ETL-Bank-Transcation

YFC-ophey / big-data-group-project

geazi-anc / dracula

divithraju / divith-raju-Immigration-Data-Engineering

bartosz25 / data-ai-summit-2020

SmartDataAnalytics / MA-INF-4223-DBDA-Lab

mayankrawat / CSVJoin

Improve this page

Add this topic to your repo