Skip to content

Apache Beam pipeline to analyze London bicycle hiring dataset with GCP Dataflow

License

Notifications You must be signed in to change notification settings

wovago/beam-dataflow-bicycle-hire-analysis

Repository files navigation

Apache Beam pipeline to analyze London bicycle hiring dataset with GCP Dataflow

This repository contains an Apache Beam pipeline that was used to analyze the London bicycle hiring data set using Google Cloud Dataflow. This dataset contains 83205227 bicycle hiring events in London. The pipeline will output the total number of bicycle hires for all combinations of bicycle stations, as well as the total distance covered by all those bicyle hire events.

The pipeline will source the full data set from Google Bigquery. Although the whole analysis could be performed directly using SQL queries, downstream analysis will be performed using the Aoache Beam pipeline. So all subsequent data transformations, such as cleaning the station IDs, counting the number of bike hires per station and calculating total distance covered between stations, are performed by the Apache Beam pipeline runnning on GCP Dataflow.

An overview of all pipeline processing steps as executed on GCP Dataflow can be seen in the pipeline DAG below.

Note: running this pipeline using GCP Dataflow will incur costs on your Google billing account, so use at your own responsibility!

Apache Beam pipeline to analyze London bicyle sharing data set

About

Apache Beam pipeline to analyze London bicycle hiring dataset with GCP Dataflow

Topics

Resources

License

Stars

Watchers

Forks