dataproc-templates/java at main · GoogleCloudPlatform/dataproc-templates

Name	Name	Last commit message	Last commit date
parent directory ..
.ci	.ci
bin	bin
src	src
.gitignore	.gitignore
JAVA_LICENSE_HEADER	JAVA_LICENSE_HEADER
README.md	README.md
pom.xml	pom.xml

Dataproc Templates (Java - Spark)

Please refer to the Dataproc Templates (Java - Spark) README for more information.

BigQueryToGCS
CassandraToBigQuery
CassandraToGCS (blogpost link)
DataplexGCStoBQ (blogpost link)
GCSToBigQuery (blogpost link)
GCSToBigTable (blogpost link) (Video link)
GCSToGCS (blogpost link)
GCSToJDBC (blogpost link)
GCSToSpanner (blogpost link)
GCSToMongo
GeneralTemplate
HBaseToGCS (blogpost link)
HiveToBigQuery (blogpost link)
HiveToGCS (blogpost link)
JDBCToBigQuery (blogpost link)
JDBCToGCS (blogpost link)
JDBCToJDBC
JDBCToSpanner
KafkaToBQ (blogpost link)
KafkaToBQDstream
KafkaToGCS (blogpost link)
KafkaToGCSDstream
KafkaToPubSub
MongoToGCS (blogpost link)
PubSubToBigQuery (blogpost link)
PubSubToBigTable (blogpost link)
PubSubLiteToBigTable (blogpost link)
PubSubToGCS (blogpost link)
RedshiftToGCS
S3ToBigQuery (blogpost link)
SnowflakeToGCS (blogpost link)
SpannerToGCS (blogpost link)
TextToBigquery
WordCount
KafkaToBQDstream
KafkaToGCSDstream

...

Requirements

Java 8
Maven 3

Running Templates

The Dataproc Templates (Java - Spark) support both serverless and cluster modes. By default, serverless mode is used. To run on Dataproc clusters, follow these steps:

Serverless Mode (Default)

Submits job to Dataproc Serverless using the batches submit spark command.

Cluster Mode

Submits job to a Dataproc Standard cluster using the jobs submit spark command.

To run the templates on an existing cluster, you must specify the JOB_TYPE=CLUSTER and CLUSTER=<full clusterId> environment variables. For example:

export JOB_TYPE=CLUSTER
export CLUSTER=${DATAPROC_CLUSTER_NAME}

Note: Certain templates may require a newer version of the Dataproc image. Before running a template, make sure your cluster's dataproc image version includes the supported dependencies version listed in the pom.xml.

Some HBase templates that require a custom image to execute are not yet supported in CLUSTER mode.

Submit templates

Format Code [Optional]

From either the root directory or v2/ directory, run:
```
mvn spotless:apply
```
This will format the code and add a license header. To verify that the code is formatted correctly, run:
```
mvn spotless:check
```
The directory to run the commands from is based on whether the changes are under v2/ or not.

Building the Project

Build the entire project using the Maven compile command.
```
mvn clean install
```

Executing a Template File

Once the template is staged on Google Cloud Storage, it can then be executed using the gcloud CLI tool.

To stage and execute the template, you can use the start.sh script. This takes

Environment variables on where and how to deploy the templates
Additional options for gcloud dataproc jobs submit spark or gcloud beta dataproc batches submit spark
Template options, such as the critical --template option which says which template to run and --templateProperty options for passing in properties at runtime (as an alternative to setting them in src/main/resources/template.properties).
Other common template property: log.level, which is an optional parameter to define the log level of Spark Context and it defaults to INFO. Possible choices are the Spark log levels: ["ALL", "DEBUG", "ERROR", "FATAL", "INFO", "OFF", "TRACE", "WARN"]
```
--templateProperty log.level=ERROR
```

Usage syntax:

start.sh [submit-spark-options] -- --template templateName [--templateProperty key=value] [extra-template-options]

For example:

# Set required environment variables.
export PROJECT=my-gcp-project
export REGION=gcp-region
export GCS_STAGING_LOCATION=gs://my-bucket/temp
# Set optional environment variables.
export SUBNET=projects/<gcp-project>/regions/<region>/subnetworks/test-subnet1
# ID of Dataproc cluster running permanent history server to access historic logs.
export HISTORY_SERVER_CLUSTER=projects/<gcp-project>/regions/<region>/clusters/<cluster>

# The submit spark options must be separated with a "--" from the template options
bin/start.sh \
--properties=<spark.something.key>=<value> \
--version=... \
-- \
--template <TEMPLATE TYPE>
--templateProperty <key>=<value>

Executing Hive to GCS template

Detailed instructions at README.md

bin/start.sh \
--properties=spark.hadoop.hive.metastore.uris=thrift://hostname/ip:9083
-- --template HIVETOGCS

Executing Hive to BigQuery template

Detailed instructions at README.md

bin/start.sh \
--properties=spark.hadoop.hive.metastore.uris=thrift://hostname/ip:9083 \
-- --template HIVETOBIGQUERY

Executing Spanner to GCS template.

Detailed instructions at README.md
```
bin/start.sh -- --template SPANNERTOGCS
```
Executing PubSub to BigQuery template.
```
bin/start.sh -- --template PUBSUBTOBQ
```
Executing PubSub to GCS template.
```
bin/start.sh -- --template PUBSUBTOGCS
```

Executing GCS to BigQuery template.

bin/start.sh -- --template GCSTOBIGQUERY

Executing BigQuery to GCS template.

bin/start.sh -- --template BIGQUERYTOGCS

Executing General template.

Detailed instructions at README.md

 bin/start.sh --files="gs://bucket/path/config.yaml" \
 -- --template GENERAL --config config.yaml

With, for example, config.yaml:

input:
  shakespeare:
    format: bigquery
    options:
      table: "bigquery-public-data:samples.shakespeare"
query:
  wordcount:
    sql: "SELECT word, sum(word_count) cnt FROM shakespeare GROUP by word ORDER BY cnt DESC"
output:
  wordcount:
    format: csv
    options:
      header: true
      path: gs://bucket/output/wordcount/
    mode: Overwrite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java

java

.ci

.ci

bin

bin

src

src

.gitignore

.gitignore

JAVA_LICENSE_HEADER

JAVA_LICENSE_HEADER

README.md

README.md

pom.xml

pom.xml

README.md

Dataproc Templates (Java - Spark)

Requirements

Running Templates

Serverless Mode (Default)

Cluster Mode

Submit templates

Executing Hive to GCS template

Executing Hive to BigQuery template

Executing Spanner to GCS template.

Executing PubSub to BigQuery template.

Executing PubSub to GCS template.

Executing GCS to BigQuery template.

Executing BigQuery to GCS template.

Executing General template.

Files

java

Directory actions

More options

Directory actions

More options

Latest commit

History

java

Folders and files

parent directory

Dataproc Templates (Java - Spark)

Requirements

Running Templates

Serverless Mode (Default)

Cluster Mode

Submit templates

Executing Hive to GCS template

Executing Hive to BigQuery template

Executing Spanner to GCS template.

Executing PubSub to BigQuery template.

Executing PubSub to GCS template.

Executing GCS to BigQuery template.

Executing BigQuery to GCS template.

Executing General template.