Skip to content

Dataproc Scala Examples is an effort to assist in the creation of Spark jobs written in Scala to run on Dataproc.

License

Notifications You must be signed in to change notification settings

GoogleCloudPlatform/dataproc-scala-examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dataproc Scala Examples

Dataproc Scala Examples is an effort to assist in the creation of Spark jobs written in Scala to run on Dataproc. Google is providing different pre-implemented Spark jobs and technical guides to run them on GCP.

This guide is based on the WordCount ETL example with common sources and sinks (Kafka, GCS, BigQuery, etc).
It is intended to catalyze your development to run Spark jobs written in Scala on Dataproc.

It is demonstrated how to run Spark jobs using Dataproc Submit, Serverless, Workflow and how to orchestrate them with Cloud Composer.

Templates

If you are looking to use Dataproc Templates, please refer to this repository.

Quickstarts

Check out the quickstart documentation for quickstarts.

Dataproc Scala Examples - Versions

Scala = 2.12.14
Spark = 3.1.2
sbt = 1.6.1
Python = 3.8.12
Airflow = 2.2.3
Composer = composer-2.0.6-airflow-2.2.3
Dataproc = 2.0-debian10

Note: if using Dataproc Serverless (detailed in the guides as one of the options to run jobs), please recompile the jobs using Spark version 3.2.0

Before you start

  • Be aware that the data format used in this guide for data in GCS is Parquet.
  • This guide is configured to run the main class, despite Dataproc having the option to specify another class to run.

Setup your Scala environment

Follow the setup instructions for installing, testing and compiling the project.

Use the Cloud Shell environment in GCP

Open in Cloud Shell


Dataproc Spark Use Cases

  1. Create Mock Dataset
    • Creates input and output mock WordCount datasets in GCS and BQ to use in other examples
  2. Streaming - Kafka to GCS
    • Runs a Spark Structured Streaming WordCount example from Kafka to GCS
  3. Batch - GCS to GCS
    • Runs a Spark WordCount example from GCS to GCS
      • Appendix: Load from GCS to BQ
      • Appendix: Create BQ External table pointing to GCS data
  4. Batch - GCS to BQ
    • Runs a Spark WordCount example from GCS to BQ

Orquestrate with Cloud Composer

This part of the guide provides example DAGs to run on Cloud Composer to orquestrate the jobs from section above.

A) Batch - Dataproc Submit - Creating and Deleting Cluster
B) Batch - Dataproc Workflow
C) Batch - Dataproc Serverless
D) Load from GCS to BQ


References

GCP Resources

Spark Resources

Composer Resources

Initialisms

GPC = Google Cloud Plataform  
GCS = Google Cloud Storage  
BQ = BigQuery  
DAG = Direct Acyclic Graph

Contributing

See the contributing instructions to get started contributing.

License

All solutions within this repository are provided under the Apache 2.0 license. Please see the LICENSE file for more detailed terms and conditions.

Disclaimer

This repository and its contents are not an official Google Product.

About

Dataproc Scala Examples is an effort to assist in the creation of Spark jobs written in Scala to run on Dataproc.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published