Skip to content

GoogleCloudDataproc/cloud-dataproc

Repository files navigation

Google Cloud Dataproc

This repository contains code and documentation for use with Google Cloud Dataproc.

Samples in this Repository

  • codelabs/opencv-haarcascade provides the source code for the OpenCV Dataproc Codelab, which demonstrates a Spark job that adds facial detection to a set of images.
  • codelabs/spark-bigquery provides the source code for the PySpark for Preprocessing BigQuery Data Codelab, which demonstrates using PySpark on Cloud Dataproc to process data from BigQuery.
  • codelabs/spark-nlp provides the source code for the PySpark for Natural Language Processing Codelab, which demonstrates using spark-nlp library for Natural Language Processing.
  • notebooks/python provides example Jupyter notebooks to demonstrate using PySpark with the BigQuery Storage Connector and the Spark GCS Connector
  • spark-tensorflow provides an example of using Spark as a preprocessing toolchain for Tensorflow jobs. Optionally, it demonstrates the spark-tensorflow-connector to convert CSV files to TFRecords.
  • spark-translate provides a simple demo Spark application that translates words using Google's Translation API and running on Cloud Dataproc.

See each directories README for more information.

Additional Dataproc Repositories

You can find more Dataproc resources in these github repositories:

Dataproc projects

Connectors

Kubernetes Operators

Examples

For more information

For more information, review the Dataproc documentation. You can also pose questions to the Stack Overflow community with the tag google-cloud-dataproc. See our other Google Cloud Platform github repos for sample applications and scaffolding for other frameworks and use cases.

Contributing changes

Licensing