Batch ETL with Cloud Flower

Cloud Flower is a project which empowers Cloud environment (in this case Im using Google Cloud Platform) to perform Batch Extract Transform Load (ETL)

Cloud Flower stands for:

Cloud which is the production environment
Flower which is Dataflow + Composer

Data Sources:

CSV stored at GCS
BigQuery Table

Tech Stack

Cloud Composer
Google Cloud Storage
BigQuery
Dataflow
Cloud Build

Setup

Composer Environment

Composer is managed Airflow environment on Google Cloud Platform. With Composer you don't need to spend more time on managing your Airflow, you can just focus on writing the DAGs.

Here is how to setup the environment:

Activate Composer API if you haven't. You can refer to this link on how to: https://cloud.google.com/endpoints/docs/openapi/enable-api#console
Open https://console.cloud.google.com/
On Navigation Menu, go to Big Data > Composer. Wait until Composer page shown
Click CREATE button to create new Composer environment
Fill some required fields there. This projects use this machine config:
- Name: cloud-flower
- Location: asia-south1
- Node count: 3
- Zone: asia-south1-a
- Machine Type: n1-standard-1
- Disk size (GB): 20 GB
- Service account: choose your Compute Engine service account.
- Image Version: composer-1.17.0-preview.1-airflow-2.0.1
- Python Version: 3
- Cloud SQL machine type: db-n1-standard-2
- Web server machine type: db-n1-standard-2
Fill the Environment Variable with
- ENVIRONMENT: production
Then, click the CREATE button
Note that you will need to wait for some minutes until the environment successfully created.

After the environment has been created successfully you will see like this:

Airflow webserver on Composer

After the environment successfully created from the previous steps. Now its time to access the airflow web page. Here are the steps:

Click the Airflow webserver link in Composer Environments list. This will open airflow web page that same as we opened http://localhost:8080 if we using Local Airflow.
Create json file on your project that stores Airflow variables. Lets call it variables.json. The minimum key-value of its file must contains this:
```
 {
   "PROJECT_ID": "",
   "GCS_TEMP_LOCATION": "",
   "BUCKET_NAME": "",
   "ALL_KEYWORDS_BQ_OUTPUT_TABLE": "",
   "GCS_STG_LOCATION": "",
   "MOST_SEARCHED_KEYWORDS_BQ_OUTPUT_TABLE": "",
   "EVENTS_BQ_TABLE": ""
 }
```
- PROJECT_ID : your GCP project ID
- GCS_TEMP_LOCATION: temp location to store data before loaded to BigQuery. The format is gs:// followed by your bucket and object name
- BUCKET_NAME : name of your GCS bucket
- ALL_KEYWORDS_BQ_OUTPUT_TABLE : output table in BigQuery to store keyword search data. The value format is dataset_id.table_id
- GCS_STG_LOCATION: staging location to store data before loaded to BigQuery. The format is gs:// followed by your bucket and object name
- MOST_SEARCHED_KEYWORDS_BQ_OUTPUT_TABLE: output table in BigQuery to store most searched keyword data. The value format is dataset_id.table_id
- EVENTS_BQ_TABLE: output table in BigQuery to store event data from reverse engineering result. The value format is dataset_id.table_id
Go to Admin > Variables
Choose File then click Import Variables

BigQuery

On your GCP console, go to BigQuery. You can find it on Big Data > BigQuery
Then create your Dataset
Fill the Dataset field such as. In this project we only need to set:
- Data set ID (example: my_first_bigquery)
- Data location. Choose the nearest one from your location.
Click CREATE DATA SET
Ensure that your dataset has been created

Google Cloud Storage

Back to your GCP console, choose Cloud Storage. You can find it on Storage > Cloud Storage
Click CREATE BUCKET button. Then fill some fields such as:
- Name your bucket (example: blank-space-de-batch1)
- Choose where to store your data
  - I would suggest to choose Region option because it offers lowest latency and single region. But if you want to get high availability you may consider to choose other location type options.
  - Then choose nearest location from you
- Leave default for the rest of fields.
Click CREATE
Your bucket will be created and showed on GCS Browser

Dataflow

From Google Dataflow Docs. The definition is:

Fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing.

We can run Dataflow Jobs using our defined Apache Beam script or by using templated jobs which you can see more here: https://cloud.google.com/dataflow/docs/concepts/dataflow-templates . In this case, I created Apache Beam script then it triggered by BeamRunPythonPipelineOperator from Airflow

Dataflow Jobs:

One of Dataflow Pipeline:

Cloud Build

Cloud Build is a CI/CD pipeline services in Google Cloud Platform. I used Cloud Build to auto synchronize between my Github Repository with GCS Bucket that stores Cloud Composer data. Thus I dont need to copy each files to the bucket manually everytime I did some changes. Here is how to setup it:

On GCP console, go to Tools > Cloud Tasks
Click CREATE TRIGGER button
Fill some required fields.
- Name: your trigger name. This will be visible to your repository
- Description: your trigger description
- Event: in this project I used Push new tag
- Source:
  - Connect to repository
  - Tag: what tag pattern to be recognized to trigger the build
- Configuration
  - Type: Cloud build configuration (YAML or JSON)
  - Location: Repository
  - Cloud Build configuration file location: cloudbuild.yaml
Dont forget to fill Variable on Advanced section. In this project, I set _GCS_BUCKET variable with Composer's GCS Bucket name which can be accessed on cloudbuild.yaml by ${_GCS_BUCKET}
Click CREATE after all needed fields filled

If your Cloud Build runs successfully, you will see green mark and when you click those icon. You will see this:

BigQuery Output Table

This ETL process will produces 3 tables in BigQuery, which is:

event
- This table contains event table as result from reverse engineering process to backfill/restore some missing data at transactional table from data warehouse table.
keyword_searches
- This table contains information about user's keyword search on platform from 10 March until 15 March 2021. I used partitioned table on this table with Day as the partition type.
most_searched_keywords
- This table contains most searched keyword from keyword_searches table

Output:

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
dags		dags
data		data
dataflow-functions		dataflow-functions
images		images
plugins		plugins
.gitignore		.gitignore
Readme.md		Readme.md
cloudbuild.yaml		cloudbuild.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dags

dags

data

data

dataflow-functions

dataflow-functions

images

images

plugins

plugins

.gitignore

.gitignore

Readme.md

Readme.md

cloudbuild.yaml

cloudbuild.yaml

requirements.txt

requirements.txt

Repository files navigation

Batch ETL with Cloud Flower

Data Sources:

Tech Stack

Setup

Composer Environment

Airflow webserver on Composer

BigQuery

Google Cloud Storage

Dataflow

Cloud Build

BigQuery Output Table

About

Releases

Packages

Languages

okzapradhana/cloud-flower

Folders and files

Latest commit

History

Repository files navigation

Batch ETL with Cloud Flower

Data Sources:

Tech Stack

Setup

Composer Environment

Airflow webserver on Composer

BigQuery

Google Cloud Storage

Dataflow

Cloud Build

BigQuery Output Table

About

Topics

Resources

Stars

Watchers

Forks

Languages