Skip to content

NVIDIA/spark-rapids-examples

Repository files navigation

spark-rapids-examples

This is the RAPIDS Accelerator for Apache Spark examples repo. RAPIDS Accelerator for Apache Spark accelerates Spark applications with no code changes. You can download the latest version of RAPIDS Accelerator here. This repo contains examples and applications that showcases the performance and benefits of using RAPIDS Accelerator in data processing and machine learning pipelines. There are broadly four categories of examples in this repo:

  1. SQL/Dataframe
  2. Spark XGBoost
  3. Deep Learning/Machine Learning
  4. RAPIDS UDF
  5. Databricks Tools demo notebooks

For more information on each of the examples please look into respective categories.

Here is the list of notebooks in this repo:

Category Notebook Name Description
1 SQL/DF Microbenchmark Spark SQL operations such as expand, hash aggregate, windowing, and cross joins with up to 20x performance benefits
2 SQL/DF Customer Churn Data federation for modeling customer Churn with a sample telco customer data
3 XGBoost Agaricus (Scala) Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the agaricus dataset
4 XGBoost Mortgage (Scala) End-to-end ETL + XGBoost example to predict mortgage default with Fannie Mae Single-Family Loan Performance Data
5 XGBoost Taxi (Scala) End-to-end ETL + XGBoost example to predict taxi trip fare amount with NYC taxi trips data set
6 ML/DL Criteo Training ETL and deep learning training of the Criteo 1TB Click Logs dataset
7 ML/DL PCA End-to-End Spark MLlib based PCA example to train and transform with a synthetic dataset
8 UDF cuSpatial - Point in Polygon Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset

Here is the list of Apache Spark applications (Scala and PySpark) that can be built for running on GPU with RAPIDS Accelerator in this repo:

Category Notebook Name Description
1 XGBoost Agaricus (Scala) Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the agaricus dataset
2 XGBoost Mortgage (Scala) End-to-end ETL + XGBoost example to predict mortgage default with Fannie Mae Single-Family Loan Performance Data
3 XGBoost Taxi (Scala) End-to-end ETL + XGBoost example to predict taxi trip fare amount with NYC taxi trips data set
4 ML/DL PCA End-to-End Spark MLlib based PCA example to train and transform with a synthetic dataset
5 UDF cuSpatial - Point in Polygon Spark cuSpatial example for Point in Polygon function using NYC Taxi pickup location dataset
6 UDF URL Decode Decodes URL-encoded strings using the Java APIs of RAPIDS cudf
7 UDF URL Encode URL-encodes strings using the Java APIs of RAPIDS cudf
8 UDF CosineSimilarity Computes the cosine similarity between two float vectors using native code
9 UDF StringWordCount Implements a Hive simple UDF using native code to count words in strings