Trying best case apache spark working environment for robust data pipelines
-
Updated
Apr 1, 2023 - Python
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Trying best case apache spark working environment for robust data pipelines
Writing dummy snippets of code to read, manipulate, and build a simple ML model with PySpark.
🛸 This project showcases an Extract, Load, Transform (ELT) pipeline built with Python, Apache Spark, Delta Lake, and Docker. The objective of the project is to scrape UFO sighting data from NUFORC and process it through the Medallion architecture to create a star schema in the Gold layer that is ready for analysis.
Distributed processing challenge
Parctice with Spark on Azure Databricks
Real-time analysis pipeline
parser for XML files using pyspark
A forecasting project based on Apache-Spark and implemented with Naive Bayes theorem.
Data Analysis exploiting Spark 2.2.0 Datasets
Gallery of Apache Zeppelin notebooks using Enth-Spark-AI.
BigData concept for Lambda Architecture design pattern, Apache Kafka -> for streaming/data pipeline integration, Apache Spark -> for distributed data processing, RESTful, services -> Akka HTTP
An image for running Scala Jupyter notebooks and Apache Spark in the cloud on OpenShift
Created by Matei Zaharia
Released May 26, 2014