Ophelia a PySpark analytics wrapper.
-
Updated
May 14, 2024 - Python
Ophelia a PySpark analytics wrapper.
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Solving Big Data Problems using Spark framework in Java. Running the Project on HDFS clusters (BigData@Polito) to get the results.
This package contains the code for calculating external clustering validity indices in Spark. The package includes Chi Index among others.
机器学习教程,本教程包含基于numpy、sklearn与tensorflow机器学习,也会包含利用spark、flink加快模型训练等用法。本着能够较全的引导读者入门机器学习。
Spark library for generalized K-Means clustering. Supports general Bregman divergences. Suitable for clustering probabilistic data, time series data, high dimensional data, and very large data.
The current repository contains all the code developed during the Big Data processing and Analytics laboratories. Data are processed and analyzed using Hadoop and Spark
Abandoned in favor of FastAPI and new repo.
Application that trains a classifier and predicts flight arrival delays based on past information. Uses the libraries pyspark.ml and pyspark.sql, performs feature engineering, cross-validation and tests various ML algorithms.
🐍💥Python and Spark for Big Data
Big Data projects for beginners
Изучение Apache Spark. Библиотека PySpark
Maven project cover scala language: sparkml, spark_streaming, spark_dataframe, ... + java language: threadpool, kafka, jpa, timer, request api
Development of an AutoML System to Predict the Compressive Strength of Concrete
This project implemented a lambda architecture for analyzing domestic flight data in the US from 2009 to 2020. It used Apache Spark for batch processing, Spark Streaming for real-time analysis, and SVM models to predict flight cancellations and delays, with Docker for cluster management and Grafana for real-time visualization.
Developed a model/Spark ML pipeline stream to identify potential customers that may purchase top up services in the future.
This repository includes a web application that is connected to a product recommendation system developed with the comprehensive Amazon Review Data (2018) dataset, consisting of nearly 233.1 million records and occupying approximately 128 gigabytes (GB) of data storage, using MongoDB, PySpark, and Apache Kafka.
Add a description, image, and links to the spark-mllib topic page so that developers can more easily learn about it.
To associate your repository with the spark-mllib topic, visit your repo's landing page and select "manage topics."