Useful commands/scripts for use on Google Cloud Platform.
- Clean your Cloud resources
- Create a Virtual Machine from CloudShell
- Check your CloudSQL instance from CloudShell
- Authorizing CloudShell to access a CloudSQL instance
- Dataproc
- Create a firewall rule to access Dataproc tools (HDFS, Datalab, Hadoop) from your browser
- Creating Dataproc clusters with the CLI/CloudShell
- Create a Cloud Storage bucket for your Dataproc cluster
- Copying source code and input files to your bucket
- Run jobs on your cluster
- Authorizing Dataproc cluster to access a CloudSQL instance
- Programmatically install a customized software on your Dataproc cluster
- Delete your Dataproc cluster at once
- Big Data and ML Fundamentals using Datalab
- More links
Whatever you build on GCP, it might cost a lot if you forget to delete some compute resources.
Here is a way to do it from the shell:
gcloud config list
gcloud config set project <your_project_name>
gcloud projects delete <your_project_name>
#!/bin/bash
# Remove "_" from USER because "_" can not be used for instance name
export INSTANCE_NAME=datalabvm-${USER//_/}
export ZONE=us-central1-a
export MACHINE_TYPE=n1-standard-1
echo "Creating instance name=$INSTANCE_NAME in zone=$ZONE of type=$MACHINE_TYPE"
gcloud compute instances create $INSTANCE_NAME \
--image-family=container-vm --image-project=google-containers \
--zone $ZONE --machine-type $MACHINE_TYPE \
--metadata "google-container-manifest=$(cat datalab.yaml)" \
--scopes cloud-platform
$ mysql --host=35.234.239.132 --user=root --password
MySQL [(none)]> show databases;
MySQL [(none)]> use recommendation_spark;
MySQL [recommendation_spark]> show tables;
MySQL [recommendation_spark]> DESCRIBE Accommodation;
MySQL [recommendation_spark]> SELECT id, count(*) FROM Accommodation GROUP BY id;
MySQL [recommendation_spark]> SELECT DISTINCT userId, count(*) FROM Rating GROUP BY userId;
MySQL [recommendation_spark]> SELECT count(*) FROM Rating;
MySQL [recommendation_spark]> SELECT * FROM Recommendation LIMIT 5;
MySQL [recommendation_spark]> SELECT count(*) FROM Recommendation;
MySQL [recommendation_spark]> select r.userid, r.accoid, r.prediction, a.title, a.location, a.price, a.rooms, a.rating, a.type from Recommendation as r, Accommodation as a where r.accoid = a.id and r.userid = 10;
Assuming your CloudSQL instance is named rentals
:
#!/bin/bash
gcloud sql instances patch rentals \
--authorized-networks `wget -qO - http://ipecho.net/plain`/32
This can be done before you create the cluster.
- find you IP with http://ip4.me/ and use it followed by
/32
in the--source-ranges
option. - allow TCP port on 8088 (Hadoop), 9870 (HDFS), 8080 (Datalab)
gcloud compute --project=coursera-gcp-course02 firewall-rules create <name-your-default-allow-dataproc-access> \
--description="description here" \
--direction=INGRESS --priority=1000 --network=default --action=ALLOW \
--rules=tcp:8088,tcp:9870,tcp:8080 \
--source-ranges=95.130.218.4/32
This command creates a Dataproc cluster named my-second-cluster in the us-central1-a zone. It creates a master node with 1 CPU and a 50 GB disk and 2 worker nodes with the same resources:
gcloud dataproc clusters create my-second-cluster --zone us-central1-a \
--master-machine-type n1-standard-1 --master-boot-disk-size 50 \
--num-workers 2 --worker-machine-type n1-standard-1 \
--worker-boot-disk-size 50
or
gcloud dataproc clusters create my-cluster --zone us-central1-a \
--master-machine-type n1-standard-1 --master-boot-disk-size 50 \
--num-workers 2 --worker-machine-type n1-standard-1 \
--worker-boot-disk-size 50 --network=default
In Google Cloud Shell, enter the following command to create a Cloud Storage bucket with the same name as your project ID in the same region as your cluster. Both Cloud Storage buckets and Project ID's have to be unique, so unless you are very unlucky your project ID would not have been previously used for a bucket name.
gsutil mb -c regional -l us-central1 gs://$DEVSHELL_PROJECT_ID
In Google Cloud Shell, enter the commands below to copy some pre-created files into your bucket (make sure to plug in your bucket name).
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd training-data-analyst/courses/unstructured
./replace_and_upload.sh <YOUR-BUCKET-NAME>
# or
# ./replace_and_upload.sh $DEVSHELL_PROJECT_ID
To run the job using the CLI, go back to the Google Cloud Shell and paste in the following command. Don't forget to replace <YOUR-BUCKET-NAME>
with the name of your bucket.
gcloud dataproc jobs submit pyspark \
--cluster my-cluster gs://<YOUR-BUCKET-NAME>/unstructured/lab2.py
Type pyspark
at the command prompt to open the PySpark shell.
patrick_merlot@my-cluster-m:~$ pyspark
Python 2.7.9 (default, Jun 29 2016, 13:08:31)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Python version 2.7.9 (default, Jun 29 2016 13:08:31)
SparkSession available as 'spark'.
>>>
>>>
>>> data = [0, 1, 2, 3, 4, 5] # range(6)
>>> distData = sc.parallelize(data)
>>> squares = distData.map(lambda x : x*x)
>>> res = squares.reduce(lambda a, b : a + b)
>>> print res
>>> 55
>>>
>>> # Write a PySpark program to compute the square root of the sum of the first 1000 terms of this series starting
at k=0: 8.0/((2k+1)(2k+1))
...
>>> import numpy as np
>>> data = range(1000)
>>> distData = sc.parallelize(data)
>>> terms = distData.map(lambda k : 8.0/((2*k+1)*(2*k+1)))
>>> res = np.sqrt(terms.sum())
>>> print res
3.1412743276
>>>
>>> quit()
patrick_merlot@my-cluster-m:~$
Step 1
In the Cloud Console, navigate to Storage and click on your bucket. It should have some files in the unstructured folder. Click on the file, lab2-input.txt
and view its contents. This file contains a comma separated list of keys and values.
Also view the contents of the file, lab2.py
. This is a PySpark job that organizes the input file by key and the total number for each type of pet. Notice that both the code and data are on Cloud Storage. We have not copied either of these to the cluster.
Step 2
Navigate to the Dataproc service in the Web Console.
Step 3
In the left-hand navigation pane select Jobs. Then click the Submit job button.
Step 4
At this point you should have one cluster called my-cluster. Make sure it is selected in the Cluster dropdown.
In the Job type dropdown, select PySpark.
In the Main python file text box enter the path to the PySpark file lab2.py that is in your bucket. It should be in the form shown below, but replace <YOUR-BUCKET-NAME>
with the name of your bucket:
gs://<YOUR-BUCKET-NAME>/unstructured/lab2.py
Step 5
No other options are required, so click Submit button at the bottom of the form.
Step 6
Wait for the job to succeed and then click on the Job ID to see its details. Take a look at the job output to see the results.
Step 7
To run the job again click the Clone button and the top, then Submit the job a second time.
Enter the following command to copy a data file and a pig script into the folder you just created. Make sure to plug in your actual bucket name.
gsutil -m cp gs://<YOUR-BUCKET-NAME>/unstructured/pet-details.* .
patrick_merlot@my-cluster-m:~$ mkdir lab2
patrick_merlot@my-cluster-m:~$ cd lab2
patrick_merlot@my-cluster-m:~/lab2$ gsutil -m cp gs://coursera-gcp-course02/unstructured/pet-details.* .
Copying gs://coursera-gcp-course02/unstructured/pet-details.pig...
Copying gs://coursera-gcp-course02/unstructured/pet-details.txt...
/ [2/2 files][ 687.0 B/ 687.0 B] 100% Done
Operation completed over 2 objects/687.0 B.
patrick_merlot@my-cluster-m:~/lab2$ cat pet-details.txt
Dog,Noir,Schnoodle,Black,21
Dog,Bree,MaltePoo,White,10
Dog,Pickles,Golden Retriever,Yellow,30
Dog,Sparky,Mutt,Gray,13
Cat,Tom,Alley,Yellow,11
Cat,Cleo,Siamese,Gray,22
Frog,Kermit,Bull,Green,1
Pig,Bacon,Pot Belly,Pink,30
Pig,Babe,Domestic,White,150
Dog,Rusty,Poodle,White,20
Cat,Joe,Mix,Black,15patrick_merlot@my-cluster-m:~/lab2$
patrick_merlot@my-cluster-m:~/lab2$ cat pet-details.pig
rmf /GroupedByType
x1 = load '/pet-details' using PigStorage(',') as (Type:chararray,Name:chararray,Breed:chararray,Color:chararray,We
ight:int);
x2 = filter x1 by Type != 'Type';
x3 = foreach x2 generate Type, Name, Breed, Color, Weight, Weight / 2.24 as Kilos:float;
x4 = filter x3 by LOWER(Color) == 'black' or LOWER(Color) == 'white';
x5 = group x4 by Type;
store x5 into '/GroupedByType';
patrick_merlot@my-cluster-m:~/lab2$
let's copy the text file into HDFS:
patrick_merlot@my-cluster-m:~/lab2$ hadoop fs -mkdir /pet-details
18/05/14 13:01:04 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.5-hadoop2
patrick_merlot@my-cluster-m:~/lab2$ hadoop fs -put pet-details.txt /pet-details
18/05/14 13:01:09 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.5-hadoop2
patrick_merlot@my-cluster-m:~/lab2$
Go back to the Web console and the details of your master node. Find the master node's external IP address and copy it to the clipboard. Then, open a new tab in your browser, paste in the ip address and then add :9870. This will open the Hadoop management site. From the Utilities menu on the right select Browse the file system.
Verify that you have a folder called pet-details and inside it you should have a file called pet-details.txt.
Run the following command to run Pig:
pig < pet-details.pig
Click Submit to start the job. It will take about a minute to run. Wait until it completes.
You will get something like:
...
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.8.3 0.16.0 patrick_merlot 2018-05-14 13:10:17 2018-05-14 13:10:51 GROUP_BY,FILTER
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1526300672789_0004 1 1 5 5 5 5 5 5 5 5 x1,x2,x3,x5 GROUP_BY /GroupedByType,
Input(s):
Successfully read 11 records (668 bytes) from: "/pet-details"
Output(s):
Successfully stored 3 records (249 bytes) in: "/GroupedByType"
Counters:
Total records written : 3
Total bytes written : 249
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1526300672789_0004
2018-05-14 13:10:51,911 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at my-cluster-m/10.128.0.3:8032
2018-05-14 13:10:51,920 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2018-05-14 13:10:51,999 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at my-cluster-m/10.128.0.3:8032
2018-05-14 13:10:52,010 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2018-05-14 13:10:52,101 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at my-cluster-m/10.128.0.3:8032
2018-05-14 13:10:52,113 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2018-05-14 13:10:52,206 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
grunt> 2018-05-14 13:10:52,254 [main] INFO org.apache.pig.Main - Pig script completed in 40 seconds and 957 milliseconds (40957 ms)
Go back to the tab with the Hadoop management site and again browse the file system. The output from this Pig job should be in a folder called GroupedByType. If you look in that folder you should see a file named part-r-00000.
Let's look at the output file.
First you have to get the file off the HDFS file system. Go back to your SSH session where you are connected to the master node. You should currently be in the folder lab2.
Make a directory below it and move into by entering the following commands, and enter the following command to get the output file from HDFS and copy it into this folder.
patrick_merlot@my-cluster-m:~/lab2$ mkdir ~/lab2/output
patrick_merlot@my-cluster-m:~/lab2$ cd ~/lab2/output
patrick_merlot@my-cluster-m:~/lab2/output$ hadoop fs -get /GroupedByType/part* .
18/05/14 13:15:35 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.5-hadoop2
patrick_merlot@my-cluster-m:~/lab2/output$
Finally, enter the following command to view the results.
patrick_merlot@my-cluster-m:~/lab2/output$ cat *
Cat {(Cat,Joe,Mix,Black,15,6.696428571428571)}
Dog {(Dog,Rusty,Poodle,White,20,8.928571428571427),(Dog,Bree,MaltePoo,White,10,4.4642857142857135),(Dog,Noir,Schnoodle,Black,21,9.374999999999998)}
Pig {(Pig,Babe,Domestic,White,150,66.96428571428571)}
patrick_merlot@my-cluster-m:~/lab2/output$
Assuming that:
- your CloudSQL instance is named
rentals
- your Dataproc cluster has:
- name:
cluster-1
- located in zone:
us-east1-b
- nb. workers:
2
- name:
#!/bin/bash
if [ "$#" -ne 3 ]; then
echo "authorize_dataproc.sh clustername zone numworkers"
echo " e.g. ./authorize_dataproc.sh cluster-1 us-east1-b 2"
exit
fi
CLUSTER=$1
CLOUDSQL=rentals
ZONE=$2
NWORKERS=$3
machines="$CLUSTER-m"
for w in `seq 0 $(($NWORKERS - 1))`; do
machines="$machines $CLUSTER-w-$w"
done
echo "Machines to authorize: $machines in $ZONE ... finding their IP addresses"
ips=""
for machine in $machines; do
IP_ADDRESS=$(gcloud compute instances describe $machine --zone=$ZONE --format='value(networkInterfaces.accessConfigs[].natIP)' | sed "s/\[u'//g" | sed "s/'\]//g" )/32
echo "IP address of $machine is $IP_ADDRESS"
if [ -z $ips ]; then
ips=$IP_ADDRESS
else
ips="$ips,$IP_ADDRESS"
fi
done
echo "Authorizing [$ips] to access cloudsql=$CLOUDSQL"
- Library of Initialization Actions (="installation scripts"): https://github.com/GoogleCloudPlatform/dataproc-initialization-actions
Note: Initialization actions are really just executables that run when a cluster is being creating. They are used to install additional software or customize your cluster as required by your programs. You can include one or more initialization actions when creating Dataproc clusters.
Step 1
In your Cloud Console, open a Google Cloud Shell and enter the following command to create a Cloud Storage bucket with the same name as your project ID
gsutil mb -c regional -l us-central1 gs://$DEVSHELL_PROJECT_ID
Step 2
In your Cloud Shell, git clone the course repository, and upload the custom initialization script to GCS. Change the bucket name as necessary.
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd training-data-analyst/courses/unstructured/
Use this script to rename the bucket name with yours in the lab's scripts:
bash replace_and_upload.sh <YOUR-BUCKET-NAME>
Step 3
View the custom initialization script. Change the bucket name as necessary.
gsutil cat gs://<YOUR-BUCKET-NAME>/unstructured/init-script.sh
What does this initialization action do on all nodes? What does it do only on the master node?
Create a cluster that will include two initialization actions:
- a pre-built action from Google to install Datalab, and
- a custom initialization action to install a Python package.
https://codelabs.developers.google.com/codelabs/cpb102-dataproc-with-gcp/#3
Follow the steps. To create the cluster, either click the Create button or click on the Command line link and copy the command onto your clipboard and then run it from Google Cloud Shell:
gcloud dataproc clusters create mycluster --subnet default --zone us-central1-a \
--master-machine-type n1-standard-1 --master-boot-disk-size 500 --num-workers 2 \
--worker-machine-type n1-standard-1 --worker-boot-disk-size 500 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project coursera-gcp-course02 \
--initialization-actions 'gs://dataproc-initialization-actions/datalab/datalab.sh','gs://coursera-gcp-course02/unstructured/init-script.sh'
Then you are going to allow access to your Dataproc cluster, but only to your machine. To do this, you will need to know your IP Address. Go to the following URL to find out what it is: http://ip4.me/
In the Cloud Console, click the menu on the left and select Networking from the Compute section. Click Firewall rules in the left-hand navigation pane. Click on the Create Firewall Rule button. Then, Enter the following:
- Name the rule default-allow-dataproc-access.
- Select IP ranges from the Source filter dropdown.
- In the source IP ranges text box enter your ip address followed by
/32
. So if your IP address is1.2.3.4
then the text box would read1.2.3.4/32
. - For Targets, select "All instances in the network"
- For Protocols and ports, select "Specified Protocols and ports", and enter the following in the text box:
tcp:8088;tcp:9870;tcp:8080
When your cluster is finished initializing, click on its name to go to its details page, then click on the VM Instances tab, and finally click on the master node to view its details.
Scroll down and find the master node's external IP address and copy it to your clipboard.
Open a new browser tab, paste in this IP address and then add :8080
after the address. This opens Datalab. You will be redirected to the Datalab main screen
You can interact with BigQuery in a notebook.
Now Using PySpark in a Datalab Notebook (https://codelabs.developers.google.com/codelabs/cpb102-dataproc-with-gcp/#6)
The last notebook didn't run anything in parallel on your Dataproc cluster. This time, let's get a notebook from the GitHub repository and execute it. This notebook uses PySpark and makes use of your Spark cluster.
Use ungit
to clone this repository https://github.com/GoogleCloudPlatform/training-data-analyst into your Datalab instance (Follow these steps).
Back on the Datalab home page click the Home
icon and navigate to datalab/notebooks/training-data-analyst/courses/unstructured
here online. Click on PySpark-Test-Solution.ipynb
to open that notebook.
gcloud dataproc clusters delete <my-cluster-name>
Check that you are already authenticated and that the project is already set to your PROJECT_ID:
$ gcloud auth list
$ gcloud config list project
[core]
project = api-project-56998XXXXXX
Your active configuration is: [cloudshell-XXXXX]
If it is not, you can set it with this command:
$ gcloud config set project <PROJECT_ID>
Pick a zone in a geographically closeby region.
$ gcloud compute zones list
NAME REGION STATUS NEXT_MAINTENANCE TURNDOWN_DATE
us-east1-b us-east1 UP
us-east1-c us-east1 UP
us-east1-d us-east1 UP
...
$ datalab create bdmlvm --zone europe-west2-b
...
Creating the network datalab-network
Creating the firewall rule datalab-network-allow-ssh
Creating the disk bdmlvm-pd
Creating the repository datalab-notebooks
...
Datalab will take about 5 minutes to start 😴
From CloudShell, click on the Web Preview icon on the top-left corner of the Cloud Shell ribbon. Switch to port 8081.
Note: The connection to your Datalab instance remains open for as long as the datalab command is active. If the cloud shell used for running the datalab command is closed or interrupted, the connection to your Cloud Datalab VM will terminate.
In Datalab, click on the icon for "Open ungit" in the top-right ribbon.
In the Ungit window, select the text that reads /content/datalab/notebooks
and remove the notebooks so that it reads /content/datalab
, then hit enter.
In the panel that comes up, type the following as the GitHub repository to Clone from (e.g. https://github.com/GoogleCloudPlatform/training-data-analyst
).
Then, click on Clone repository.
Depending on what you develop, you will need to enable Services or APIs.
From the GCP console menu, select APIs and services and select Library as submenu.
E.g. In the search box, type vision to find the Google Cloud Vision API and click on the hyperlink and Enable if necessary
If you do not already have an API key, from the GCP console menu, select APIs and services and select Credentials.
Click the Create credentials button and select API key. Once created, click close. You will need this API key in the notebook later.
You can find a demo notebook under ML APIs.