NYC Taxi Pipeline

INTRO

Architect batch/stream data processing systems from nyc-tlc-trip-records-data, via the ETL batch process : E (extract : tlc-trip-record-data.page -> S3 ) -> T (transform : S3 -> Spark) -> L (load : Spark -> Mysql) & stream process : Event -> Event digest -> Event storage. The system then can support calculation such as Top Driver By area, Order by time windiw, latest-top-driver, and Top busy areas.

Batch data : nyc-tlc-trip-records-data

Stream data : TaxiEvent, stream from file.

Tech : Spark, Hadoop, Hive, EMR, S3, MySQL, Kinesis, DynamoDB , Scala, Python, ELK, Kafka
Batch pipeline : DataLoad -> DataTransform -> CreateView -> SaveToDB -> SaveToHive
- Download batch data : download_sample_data.sh
- Batch data : transactional-data, reference-data -> processed-data -> output-transactions -> output-materializedview
Stream pipeline : TaxiEvent -> EventLoad -> KafkaEventLoad
- Stream data : taxi-event

Please also check NYC_Taxi_Trip_Duration in case you are interested in the data science projects with similar taxi dataset.

Architecture

Architecture idea (Batch):
Architecture idea (Stream):

File structure

├── Dockerfile    : Scala spark Dockerfile
├── build.sbt     : Scala sbt build file
├── config        : configuration files for DB/Kafka/AWS..
├── data          : Raw/processed/output data (batch/stream)
├── doc           : All repo reference/doc/pic
├── elk           : ELK (Elasticsearch, Logstash, Kibana) config/scripts 
├── fluentd       : Fluentd help scripts
├── kafka         : Kafka help scripts
├── pyspark       : Legacy pipeline code (Python)
├── requirements.txt
├── script        : Help scripts (env/services) 
├── src           : Batch/stream process scripts (Scala)
└── utility       : Help scripts (pipeline)

Prerequisites

Install (batch)
- Spark 2.4.3
- Java 1.8.0_11 (java 8)
- Scala 2.11.12
- sbt 1.3.5
- Mysql
- Hive (optional)
- Hadoop (optional)
- Python 3 (optional)
- Pyspark (optional)
Install (stream)
- Zoopkeeper
- Kafka
- Elasticsearch 7.6.1
  - https://www.elastic.co/downloads/elasticsearch
- Kibana 7.6.1
  - https://www.elastic.co/downloads/kibana-oss
- Logstash 7.6.1
  - https://www.elastic.co/downloads/logstash
Set up
- Run on local:
  - n/a
- Run on cloud :
  - AWS account and get key_pair for access below services:
    - EMR
    - EC2
    - S3
    - DYNAMODB
    - Kinesis
Config
- update config with your creds
- update elk-config with your use cases

Quick start

Quick-Start-Batch-Pipeline-Manually

# STEP 1) Download the dataset
bash script/download_sample_data.sh

# STEP 2) sbt build
sbt compile
sbt assembly

# STEP 3) Load data 
spark-submit \
 --class DataLoad.LoadReferenceData \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

spark-submit \
 --class DataLoad.LoadGreenTripData \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

spark-submit \
 --class DataLoad.LoadYellowTripData \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 4) Transform data 
spark-submit \
 --class DataTransform.TransformGreenTaxiData \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

spark-submit \
 --class DataTransform.TransformYellowTaxiData \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 5) Create view 
spark-submit \
 --class CreateView.CreateMaterializedView \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 6) Save to JDBC (mysql)
spark-submit \
 --class SaveToDB.JDBCToMysql \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 7) Save to Hive
spark-submit \
 --class SaveToHive.SaveMaterializedviewToHive \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

Quick-Start-Stream-Pipeline-Manually

# STEP 1) sbt build
abt compile
sbt assembly

# STEP 2) Create Taxi event
spark-submit \
 --class TaxiEvent.CreateBasicTaxiEvent \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# check the event
curl localhost:44444

# STEP 3) Process Taxi event
spark-submit \
 --class EventLoad.SparkStream_demo_LoadTaxiEvent \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 4) Send Taxi event to Kafaka
# start zookeeper, kafka
brew services start zookeeper
brew services start kafka

# create kafka topic
kafka-topics --create -zookeeper localhost:2181 --replication-factor 1  --partitions 1 --topic first_topic
kafka-topics --create -zookeeper localhost:2181 --replication-factor 1  --partitions 1 --topic streams-taxi

# curl event to kafka producer
curl localhost:44444 | kafka-console-producer  --broker-list  127.0.0.1:9092 --topic first_topic

# STEP 5) Spark process kafka stream
spark-submit \
 --class KafkaEventLoad.LoadKafkaEventExample \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 6) Spark process kafka stream
spark-submit \
 --class KafkaEventLoad.LoadTaxiKafkaEventWriteToKafka \
 target/scala-2.11/nyc_taxi_pipeline_2.11-1.0.jar

# STEP 7) Run elsacsearch, kibana, logstach
# make sure curl localhost:44444 can get the taxi event
cd ~ 
kibana-7.6.1-darwin-x86_64/bin/kibana
elasticsearch-7.6.1/bin/elasticsearch
logstash-7.6.1/bin/logstash -f /Users/$USER/NYC_Taxi_Pipeline/elk/logstash/logstash_taxi_event_file.conf

# test insert toy data to logstash 
# (logstash config: elk/logstash.conf)
#nc 127.0.0.1 5000 < data/event_sample.json

# then visit kibana UI : localhost:5601
# then visit "management" -> "index_patterns" -> "Create index pattern" 
# create new index : logstash-* (not select timestamp as filter)
# then visit the "discover" tag and check the data

Dependency

Ref

ref.md - dataset link ref, code ref, other ref
doc - All ref docs

TODO

# 1. Tune the main pipeline for large scale data (to process whole nyc-tlc-trip data)
# 2. Add front-end UI (flask to visualize supply & demand and surging price)
# 3. Add test 
# 4. Dockerize the project 
# 5. Tune the spark batch/stream code 
# 6. Tune the kafka, zoopkeeper cluster setting

Name		Name	Last commit message	Last commit date
Latest commit History 383 Commits
.github/workflows		.github/workflows
archived/athena		archived/athena
config		config
data		data
doc		doc
elk		elk
flink		flink
hive		hive
kafka		kafka
output/raw-rides		output/raw-rides
project		project
pyspark		pyspark
script		script
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
README.md		README.md
build.sbt		build.sbt

yennanliu/NYC_Taxi_Pipeline

Folders and files

Latest commit

History

Repository files navigation

NYC Taxi Pipeline

INTRO

Architecture

File structure

Prerequisites

Quick start

Dependency

Ref

TODO

About

Topics

Resources

Stars

Watchers

Forks

Languages