Skip to content

dwaiba/dataproc-terraform

Repository files navigation

Table of Contents (dataproc cluster debian-9 with zookeeper, kafka ,BigQuery and other tools with Terraform)

  1. Pre-reqs
  2. Creation and destroying
  3. Cluster details
  4. Cloud Dataproc version
  5. URLs and extra components via dataproc-initialization-actions
  6. Terraform graph
  7. Automatic provisioning
  8. Testing Kafka
  9. Reporting bugs
  10. Patches and pull requests
  11. License
  12. Code of conduct

Pre-reqs

  1. Download and Install Terraform

  2. Download and install google cloud sdk

    • One may install gcloud sdk silently for all users as root with access to GCLOUD_HOME for only speficic user:

      export $USERNAME="<<you_user_name>>"

      export SHARE_DATA=/data

      su -c "export SHARE_DATA=/data && export CLOUDSDK_INSTALL_DIR=$SHARE_DATA export CLOUDSDK_CORE_DISABLE_PROMPTS=1 && curl https://sdk.cloud.google.com | bash" $USER_NAME

      echo "source $SHARE_DATA/google-cloud-sdk/path.bash.inc" >> /etc/profile.d/gcloud.sh

      echo "source $SHARE_DATA/google-cloud-sdk/completion.bash.inc" >> /etc/profile.d/gcloud.sh

  3. Clone this repository and cd into dataproc folder. git clone https://github.com/dwaiba/dataproc-terraform && cd dataproc-terraform

  4. Please create Service Credential of type JSON via https://console.cloud.google.com/apis/credentials, download and save as google.json in credentials folder.

Creation and destroying

terraform init

terraform plan -out "run.plan

terraform apply "run.plan"

terraform destroy

Please note - firewall.tf has all ports open and please switch them off before creation if you so want.

Cluster details

Name Role Staging Bucket
poccluster-m
poccluster-w-*
Default 3 Masters and Auto HA with Zookeeper
Number of workers are prompted
dataproc-poc-staging-bucket

Cloud Dataproc version

Version Includes Base OS Released On Last Updated (sub-minor version) Notes
1.4-deb9 Apache Spark 2.4.4
Apache Hadoop 2.9.2
Apache Pig 0.17.0
Apache Hive 2.3.6
Apache Tez 0.9.2*
Cloud Storage connector 1.9.17-hadoop2
Debian 9 2020/02/03 2020/02/03
(1.4.21-deb9)
All releases on and after November 2, 2018 will be based on Debian 9.

URLs and extra components via dataproc-initialization-actions

  • YARN ResourceManager: http://<<Master_External_IP>>:8088/cluster

  • HDFS NameNode: http://<<Master_External_IP>>:9870

  • Hadoop Job History Server: http://<<Master_External_IP>>:19888/jobhistory

  • Node Managers: http://<<Individual_Node_External_IP>>:8042

  • Ganglia: http://<<Master_External_IP>>:80/ganglia

  • Livy: http://<<Master_External_IP>>:8998

  • docker latest is installed on all nodes.

Terraform Graph

Please generate dot format (Graphviz) terraform configuration graphs for visual representation of the repo.

terraform graph | dot -Tsvg > graph.svg

Also, one can use Blast Radius on live initialized terraform project to view graph. Please shoot in dockerized format:

docker ps -a|grep blast-radius|awk '{print $1}'|xargs docker kill && rm -rf dataproc-terraform && git clone https://github.com/dwaiba/dataproc-terraform && cd dataproc-terraform/ && terraform init && docker run --cap-add=SYS_ADMIN -dit --rm -p 5005:5000 -v $(pwd):/workdir:ro 28mm/blast-radius && cd ../../

A live example is here for this project.

Automatic Provisioning

https://github.com/dwaiba/dataproc-terraform/

Pre-req:

  1. gcloud should be installed. Silent install is - export $USERNAME="<<you_user_name>>" && export SHARE_DATA=/data && su -c "export SHARE_DATA=/data && export CLOUDSDK_INSTALL_DIR=$SHARE_DATA export CLOUDSDK_CORE_DISABLE_PROMPTS=1 && curl https://sdk.cloud.google.com | bash" $USER_NAME && echo "source $SHARE_DATA/google-cloud-sdk/path.bash.inc" >> /etc/profile.d/gcloud.sh && echo "source $SHARE_DATA/google-cloud-sdk/completion.bash.inc" >> /etc/profile.d/gcloud.sh &&

  2. Please create Service Credential of type JSON via https://console.cloud.google.com/apis/credentials, download and save as google.json in credentials folder of the gke-terraform.

Plan:

terraform init && terraform plan -var bucket_name_dp=testbuckdp -var cluster_dp_name=tstdppocclus cluster_location=europe-west2 -var project=<<your-google-cloud-project-name>> -var worker_num_instances=<<number of workers for the default auto ha with zookeper 3 masters>> -out "run.plan"

Apply:

terraform apply "run.plan"

Destroy:

terraform destroy -var bucket_name_dp=testbuckdp -var cluster_dp_name=tstdppocclus -var cluster_location=europe-west2 -var project=<<your-google-cloud-project-name>> -var worker_num_instances=<<number of workers for the default auto ha with zookeper 3 masters>>

Testing Kafka

Once the cluster has been created Kafka should be running on all worker nodes in the cluster, and Kafka libraries should be installed on the master node(s). You can test your Kafka setup by creating a simple topic and publishing to it with Kafka's command-line tools, after SSHing into one of your master nodes:

```bash
gcloud compute ssh <CLUSTER_NAME>-m-0

# Create a test topic, just talking to the local master's zookeeper server.
/usr/lib/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --create \
    --replication-factor 1 --partitions 1 --topic test
/usr/lib/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --list

# Use worker 0 as broker to publish 100 messages over 100 seconds
# asynchronously.
CLUSTER_NAME=$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-name)
for i in {0..100}; do echo "message${i}"; sleep 1; done |
    /usr/lib/kafka/bin/kafka-console-producer.sh \
        --broker-list ${CLUSTER_NAME}-w-0:9092 --topic test &

# User worker 1 as broker to consume those 100 messages as they come.
# This can also be run in any other master or worker node of the cluster.
/usr/lib/kafka/bin/kafka-console-consumer.sh \
    --bootstrap-server ${CLUSTER_NAME}-w-1:9092 \
    --topic test --from-beginning
```

You can find more information about using initialization actions with Dataproc in the Dataproc documentation.

Reporting bugs

Please report bugs by opening an issue in the GitHub Issue Tracker. Bugs have auto template defined. Please view it here

Patches and pull requests

Patches can be submitted as GitHub pull requests. If using GitHub please make sure your branch applies to the current master as a 'fast forward' merge (i.e. without creating a merge commit). Use the git rebase command to update your branch to the current master if necessary.

License

Code of Conduct

Releases

No releases published

Packages

No packages published

Languages