#

pyspark

Here are 3,371 public repositories matching this topic...

mitchelllisle / sparkdantic

✨ A Pydantic to PySpark schema library

schema pyspark pydantic

Updated May 22, 2024
Python

apache / incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

big-data spark etl graph pyspark graph-analysis data-orchestration graph-storage

Updated May 22, 2024
C++

ibis-project / ibis

the portable Python dataframe library

Updated May 21, 2024
Python

KevinShindel / MachineLearning

Pandas, Sci-kit, SparkML

scikit-learn pandas pyspark

Updated May 21, 2024
Jupyter Notebook

slevine / pyspark-pandas-vs-pandas

Dataframe Performance Comparison - Polars, Pandas on Spark, and Pandas

python spark pandas pyspark polars

Updated May 21, 2024
Jupyter Notebook

jupyter-incubator / sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

magic spark kernel jupyter notebook cluster pandas-dataframe jupyter-notebook sql-query pyspark kerberos livy

Updated May 21, 2024
Python

SynapseML

microsoft / SynapseML

Simple and Distributed Machine Learning

Updated May 21, 2024
Scala

longNguyen010203 / Youtube-ETLT-Pipeline

💜🌈📊 A Data Engineering Project that implements an ETL data pipeline using Dagster, Apache Spark, Streamlit, MinIO, Apache Superset, Dbt 🌺

mysql processing docker dockerfile machine-learning spark docker-compose postgresql pyspark data-engineering minio dbt data-engineer etl-pipeline data-engineering-pipeline cleaning-data dagster

Updated May 21, 2024
Jupyter Notebook

frizzleqq / pyspark-deltalake

Example of local pyspark setup including DeltaLake for unit-testing

spark pytest pyspark delta-lake

Updated May 21, 2024
Python

logicalclocks / hopsworks

Hopsworks - Data-Intensive AI platform with a Feature Store

python aws data-science machine-learning serverless azure gcp ml pyspark feature-engineering governance model-serving mlops feature-store feature-management hopsworks kserve

Updated May 21, 2024
Java

ev2900 / Glue_Aggregate_Small_Files

PySpark script to aggregate small parquet files in a prefix into larger files. Designed to be run on AWS Glue

aws s3 glue pyspark small-files

Updated May 21, 2024
Python

ev2900 / Glue_Examples

PySpark code samples designed for AWS Glue

aws glue pyspark aws-glue

Updated May 21, 2024
Python

longNguyen010203 / Zillow-Home-Value-Prediction

🌈📊📈 The Zillow Home Value Prediction project employs linear regression models on Kaggle datasets to forecast house prices. 📉💰Using Apache Spark (PySpark) within a Docker setup enables efficient data preprocessing, exploration, analysis, visualization, and model building with distributed computing for parallel computation.

visualization docker machine-learning apache-spark analysis docker-compose linear-regression models parallel-computing distributed-computing jupyter-notebook pyspark jupyterlab preprocessing feature-engineering prediction-model

Updated May 21, 2024
Jupyter Notebook

apache / linkis

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

Updated May 21, 2024
Java

raghul3 / IPL_Data_Analysis

Large dataSet of IPL Data till 2017 analysis using PySpark.

s3-bucket pyspark spark-sql databricks-notebooks

Updated May 21, 2024
Jupyter Notebook

databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

python spark faker pyspark spark-streaming data-generation databricks synthetic-data datagen datagenerator deltalake datageneration delta-live-tables

Updated May 21, 2024
Python

mahmoudparsian / big-data-mapreduce-course

Big Data Modeling, MapReduce, Spark, PySpark @ Santa Clara University

Updated May 21, 2024
HTML

Azure / azure-cosmosdb-spark

Apache Spark Connector for Azure Cosmos DB

spark apache-spark connector jupyter-notebook pyspark databricks changefeed lambda-architecture azure-cosmos-db databricks-notebooks cosmos-db azure-databricks

Updated May 20, 2024
Scala

drisskhattabi6 / Real-Time-Twitter-Sentiment-Analysis

This repo contains Big Data Project, its about "Real Time Twitter Sentiment Analysis via Kafka, Spark Streaming, MongoDB and Django Dashboard".

docker django kafka big-data spark mongodb sentiment-analysis pyspark spark-streaming kafka-producer real-time-processing sentiment-classification etl-pipeline tweets-classification big-data-projects django-dashboard

Updated May 20, 2024
Jupyter Notebook

spark-nlp

JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing

Updated May 22, 2024
Scala

Improve this page

Add a description, image, and links to the pyspark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pyspark topic, visit your repo's landing page and select "manage topics."