Table of Contents (dataproc cluster debian-9 with zookeeper, kafka ,BigQuery and other tools with Terraform)

Pre-reqs
Creation and destroying
Cluster details
Cloud Dataproc version
URLs and extra components via dataproc-initialization-actions
Terraform graph
Automatic provisioning
Testing Kafka
Reporting bugs
Patches and pull requests
License
Code of conduct

Pre-reqs

Download and Install Terraform
Download and install google cloud sdk
- One may install gcloud sdk silently for all users as root with access to GCLOUD_HOME for only speficic user:
  
  export $USERNAME="<<you_user_name>>"
  
  export SHARE_DATA=/data
  
  su -c "export SHARE_DATA=/data && export CLOUDSDK_INSTALL_DIR=$SHARE_DATA export CLOUDSDK_CORE_DISABLE_PROMPTS=1 && curl https://sdk.cloud.google.com | bash" $USER_NAME
  
  echo "source $SHARE_DATA/google-cloud-sdk/path.bash.inc" >> /etc/profile.d/gcloud.sh
  
  echo "source $SHARE_DATA/google-cloud-sdk/completion.bash.inc" >> /etc/profile.d/gcloud.sh
Clone this repository and cd into dataproc folder. git clone https://github.com/dwaiba/dataproc-terraform && cd dataproc-terraform
Please create Service Credential of type JSON via https://console.cloud.google.com/apis/credentials, download and save as google.json in credentials folder.

Creation and destroying

terraform init

terraform plan -out "run.plan

terraform apply "run.plan"

terraform destroy

Please note - firewall.tf has all ports open and please switch them off before creation if you so want.

Cluster details

Name	Role	Staging Bucket
poccluster-m poccluster-w-*	Default 3 Masters and Auto HA with Zookeeper Number of workers are prompted	dataproc-poc-staging-bucket

Cloud Dataproc version

Version	Includes	Base OS	Released On	Last Updated (sub-minor version)	Notes
1.4-deb9	Apache Spark 2.4.4 Apache Hadoop 2.9.2 Apache Pig 0.17.0 Apache Hive 2.3.6 Apache Tez 0.9.2* Cloud Storage connector 1.9.17-hadoop2	Debian 9	2020/02/03	2020/02/03 (1.4.21-deb9)	All releases on and after November 2, 2018 will be based on Debian 9.

More @ Supported Cloud Dataproc versions

URLs and extra components via dataproc-initialization-actions

YARN ResourceManager: http://<<Master_External_IP>>:8088/cluster
HDFS NameNode: http://<<Master_External_IP>>:9870
Hadoop Job History Server: http://<<Master_External_IP>>:19888/jobhistory
Node Managers: http://<<Individual_Node_External_IP>>:8042
Ganglia: http://<<Master_External_IP>>:80/ganglia
Livy: http://<<Master_External_IP>>:8998
docker latest is installed on all nodes.

Terraform Graph

Please generate dot format (Graphviz) terraform configuration graphs for visual representation of the repo.

terraform graph | dot -Tsvg > graph.svg

Also, one can use Blast Radius on live initialized terraform project to view graph. Please shoot in dockerized format:

docker ps -a|grep blast-radius|awk '{print $1}'|xargs docker kill && rm -rf dataproc-terraform && git clone https://github.com/dwaiba/dataproc-terraform && cd dataproc-terraform/ && terraform init && docker run --cap-add=SYS_ADMIN -dit --rm -p 5005:5000 -v $(pwd):/workdir:ro 28mm/blast-radius && cd ../../

A live example is here for this project.

Automatic Provisioning

https://github.com/dwaiba/dataproc-terraform/

Pre-req:

gcloud should be installed. Silent install is - export $USERNAME="<<you_user_name>>" && export SHARE_DATA=/data && su -c "export SHARE_DATA=/data && export CLOUDSDK_INSTALL_DIR=$SHARE_DATA export CLOUDSDK_CORE_DISABLE_PROMPTS=1 && curl https://sdk.cloud.google.com | bash" $USER_NAME && echo "source $SHARE_DATA/google-cloud-sdk/path.bash.inc" >> /etc/profile.d/gcloud.sh && echo "source $SHARE_DATA/google-cloud-sdk/completion.bash.inc" >> /etc/profile.d/gcloud.sh &&
Please create Service Credential of type JSON via https://console.cloud.google.com/apis/credentials, download and save as google.json in credentials folder of the gke-terraform.

Plan:

terraform init && terraform plan -var bucket_name_dp=testbuckdp -var cluster_dp_name=tstdppocclus cluster_location=europe-west2 -var project=<<your-google-cloud-project-name>> -var worker_num_instances=<<number of workers for the default auto ha with zookeper 3 masters>> -out "run.plan"

Apply:

terraform apply "run.plan"

Destroy:

terraform destroy -var bucket_name_dp=testbuckdp -var cluster_dp_name=tstdppocclus -var cluster_location=europe-west2 -var project=<<your-google-cloud-project-name>> -var worker_num_instances=<<number of workers for the default auto ha with zookeper 3 masters>>

Testing Kafka

Once the cluster has been created Kafka should be running on all worker nodes in the cluster, and Kafka libraries should be installed on the master node(s). You can test your Kafka setup by creating a simple topic and publishing to it with Kafka's command-line tools, after SSHing into one of your master nodes:

```bash
gcloud compute ssh <CLUSTER_NAME>-m-0

# Create a test topic, just talking to the local master's zookeeper server.
/usr/lib/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --create \
    --replication-factor 1 --partitions 1 --topic test
/usr/lib/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --list

# Use worker 0 as broker to publish 100 messages over 100 seconds
# asynchronously.
CLUSTER_NAME=$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-name)
for i in {0..100}; do echo "message${i}"; sleep 1; done |
    /usr/lib/kafka/bin/kafka-console-producer.sh \
        --broker-list ${CLUSTER_NAME}-w-0:9092 --topic test &

# User worker 1 as broker to consume those 100 messages as they come.
# This can also be run in any other master or worker node of the cluster.
/usr/lib/kafka/bin/kafka-console-consumer.sh \
    --bootstrap-server ${CLUSTER_NAME}-w-1:9092 \
    --topic test --from-beginning
```

You can find more information about using initialization actions with Dataproc in the Dataproc documentation.

Reporting bugs

Please report bugs by opening an issue in the GitHub Issue Tracker. Bugs have auto template defined. Please view it here

Patches and pull requests

Patches can be submitted as GitHub pull requests. If using GitHub please make sure your branch applies to the current master as a 'fast forward' merge (i.e. without creating a merge commit). Use the git rebase command to update your branch to the current master if necessary.

License

Please see the LICENSE file for licensing information.

Code of Conduct

Please see the Code of Conduct

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
credentials		credentials
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.MD		README.MD
_config.yml		_config.yml
firewalls.tf		firewalls.tf
main.tf		main.tf
networks.tf		networks.tf
provider.tf		provider.tf
schema.json		schema.json
variables.tf		variables.tf
versions.tf		versions.tf

License

dwaiba/dataproc-terraform

Folders and files

Latest commit

History

Repository files navigation

Table of Contents (dataproc cluster debian-9 with zookeeper, kafka ,BigQuery and other tools with Terraform)

Pre-reqs

Creation and destroying

Cluster details

Cloud Dataproc version

URLs and extra components via dataproc-initialization-actions

Terraform Graph

Automatic Provisioning

Testing Kafka

Reporting bugs

Patches and pull requests

License

Code of Conduct

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages