Dataproc Templates

Dataproc templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations. The technology under the hood which makes these operations possible is the serverless spark functionality based on Google Cloud Dataproc service.

Google is providing this collection of pre-implemented Dataproc templates as a reference and to provide easy customization for developers wanting to extend their functionality.

Templates

Getting Started

Requirements

Java 8
Maven 3

Clone this repository:

 git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git

Configure required properties at resources/template.properties
Obtain authentication credentials.

Create local credentials by running the following command and following the oauth2 flow (read more about the command [here][auth_command]):
```
 gcloud auth application-default login
```
Or manually set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to a service account key JSON file path.

Learn more at [Setting Up Authentication for Server to Server Production Applications][ADC].

Note: Application Default Credentials is able to implicitly find the credentials as long as the application is running on Compute Engine, Kubernetes Engine, App Engine, or Cloud Functions.
Format Code [Optional]

From either the root directory or v2/ directory, run:
```
mvn spotless:apply
```
This will format the code and add a license header. To verify that the code is formatted correctly, run:
```
mvn spotless:check
```
The directory to run the commands from is based on whether the changes are under v2/ or not.
Building the Project

Build the entire project using the maven compile command.
```
mvn clean install
```

Executing a Template File

Once the template is staged on Google Cloud Storage, it can then be executed using the gcloud CLI tool. The runtime parameters required by the template can be passed in the parameters field via comma-separated list of paramName=Value.

Set required variables.

[Required]
export PROJECT=my-gcp-project
export REGION=gcp-region
export SUBNET=subnet-id (Example projects/<gcp-project>/regions/<region>/subnetworks/test-subnet1)
export GCS_STAGING_BUCKET=gs://my-bucket/temp

[Optional]
export HISTORY_SERVER_CLUSTER=permanent-history-server-id (Id of Dataproc cluster running permanent history server to access historic logs. Example projects/<project-id>/regions/<region>/clusters/per-hs)

Execute required template.

bin/start.sh GCP_PROJECT=${PROJECT} \
REGION=${REGION}  \
SUBNET=${SUBNET}   \
GCS_STAGING_BUCKET=${GCS_STAGING_BUCKET} \
TEMPLATE_NAME=HIVETOGCS \
--properties=spark.hadoop.hive.metastore.uris=hrift://hostname/ip:9083

Executing Hive to GCS template. Detailed instructions at README.md

bin/start.sh GCP_PROJECT=${PROJECT} \
  REGION=${REGION}  \
  SUBNET=${SUBNET}   \
  GCS_STAGING_BUCKET=${GCS_STAGING_BUCKET} \
  HISTORY_SERVER_CLUSTER=${HISTORY_SERVER_CLUSTER} \ #  [Optional]
  TEMPLATE_NAME=HIVETOGCS \
  --properties=spark.hadoop.hive.metastore.uris=hrift://hostname/ip:9083

Executing Hive to BigQuery template. Detailed instructions at README.md

bin/start.sh GCP_PROJECT=${PROJECT} \
  REGION=${REGION}  \
  SUBNET=${SUBNET}   \
  GCS_STAGING_BUCKET=${GCS_STAGING_BUCKET} \
  HISTORY_SERVER_CLUSTER=${HISTORY_SERVER_CLUSTER} \ #  [Optional]
  TEMPLATE_NAME=HIVETOBIGQUERY \
  --properties=spark.hadoop.hive.metastore.uris=hrift://hostname/ip:9083

Executing Spanner to GCS template. Detailed instructions at README.md

bin/start.sh GCP_PROJECT=${PROJECT} \
  REGION=${REGION}  \
  SUBNET=${SUBNET}   \
  GCS_STAGING_BUCKET=${GCS_STAGING_BUCKET} \
  HISTORY_SERVER_CLUSTER=${HISTORY_SERVER_CLUSTER} \ #  [Optional]
  TEMPLATE_NAME=SPANNERTOGCS

Executing PubSub to BigQuery template.

bin/start.sh GCP_PROJECT=${PROJECT} \
  REGION=${REGION}  \
  SUBNET=${SUBNET}   \
  GCS_STAGING_BUCKET=${GCS_STAGING_BUCKET} \
  HISTORY_SERVER_CLUSTER=${HISTORY_SERVER_CLUSTER} \ #  [Optional]
  TEMPLATE_NAME=PUBSUBTOBQ

Executing GCS to BigQuery template.

bin/start.sh GCP_PROJECT=${PROJECT} \
  REGION=${REGION}  \
  SUBNET=${SUBNET}   \
  GCS_STAGING_BUCKET=${GCS_STAGING_BUCKET} \
  HISTORY_SERVER_CLUSTER=${HISTORY_SERVER_CLUSTER} \ #  [Optional]
  TEMPLATE_NAME=GCSTOBIGQUERY

Flow diagram

Below flow diagram shows execution flow for Dataproc templates:

Contributing

See the contributing instructions to get started contributing.

License

All solutions within this repository are provided under the Apache 2.0 license. Please see the LICENSE file for more detailed terms and conditions.

Disclaimer

This repository and its contents are not an official Google Product.

Contact

Questions, issues, and comments should be directed to professional-services-oss@google.com.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bin		bin
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
JAVA_LICENSE_HEADER		JAVA_LICENSE_HEADER
LICENSE		LICENSE
README.md		README.md
dependency-reduced-pom.xml		dependency-reduced-pom.xml
dp-templates.png		dp-templates.png
pom.xml		pom.xml

License

franklinWhaite/dataproc-templates

Folders and files

Latest commit

History

Repository files navigation

Dataproc Templates

Templates

Getting Started

Requirements

Executing Hive to GCS template. Detailed instructions at README.md

Executing Hive to BigQuery template. Detailed instructions at README.md

Executing Spanner to GCS template. Detailed instructions at README.md

Executing PubSub to BigQuery template.

Executing GCS to BigQuery template.

Flow diagram

Contributing

License

Disclaimer

Contact

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages