Skip to content

Hotspot analysis on Big Data of a major taxi company using Apache Spark and Scala

Notifications You must be signed in to change notification settings

ParikshithKedilayaM/hotspot-analysis-geospatial-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spatial Hotspot Analysis on Geo-Spatial Data using Getis-Ord Statistic

A major peer-to-peer taxi cab firm has hired your team to develop and run multiple spatial queries on their large database that contains geographic data as well as real-time location data of their customers. A spatial query is a special type of query supported by geodatabases and spatial databases. The queries differ from traditional SQL queries in that they allow for the use of points, lines, and polygons. The spatial queries also consider the relationship between these geometries. Since the database is large and mostly unstructured, your client wants you to use a popular Big Data software application, SparkSQL. The goal of the project is to extract data from this database that will be used by your client for operational (day-to-day) and strategic level (long term) decisions.

Description

This task will focus on applying spatial statistics to spatio-temporal big data in order to identify statistically significant spatial hot spots using Apache Spark. The topic of this task is from ACM SIGSPATIAL GISCUP 2016.

The Problem Definition page is here: http://sigspatial2016.sigspatial.org/giscup2016/problem

To Get Started

Install Apache Spark and SparkSQL on Computer

You will be using Apache Spark and SparkSQL in this project. Apache Spark is a sophisticated Big Data software application. Each team member needs to install Apache Spark and SparkSQL on his/her computer by carefully following the instructions on the page https://spark.apache.org/docs/latest/

To get started, team members will need to do some research about Apache SparkSQL and spatial queries.

Required Resource:

https://www.tutorialspoint.com/spark_sql/spark_sql_quick_guide.htm

Special requirement (different from GIS CUP)

As stated in the Problem Definition page, in this task, we implemented a Spark program to calculate the Getis-Ord statistic of NYC Taxi Trip datasets. We call it "Hot cell analysis"

To reduce the computation power need,we made the following changes:

  1. The input will be a monthly taxi trip dataset from 2009 - 2012. For example, "yellow_tripdata_2009-01_point.csv", "yellow_tripdata_2010-02_point.csv".
  2. Each cell unit size is 0.01 * 0.01 in terms of latitude and longitude degrees.
  3. We used 1 day as the Time Step size. The first day of a month is step 1. Every month has 31 days.
  4. We considered only the Pick-up Location.

Coding template specification

Input parameters

  1. Output path (Mandatory)
  2. Task name: "hotzoneanalysis" or "hotcellanalysis"
  3. Task parameters: (1) Hot zone (2 parameters): nyc taxi data path, zone path(2) Hot cell (1 parameter): nyc taxi data path

Example

test/output hotzoneanalysis src/resources/point-hotzone.csv src/resources/zone-hotzone.csv hotcellanalysis src/resources/yellow_trip_sample_100000.csv

Note:

  1. The number/order of tasks do not matter.

Input data format

The main function/entrace is "cse512.Entrance" scala file.

  1. Point data: the input point dataset is the pickup point of New York Taxi trip datasets. The data format of this phase is the original format of NYC taxi trip which is different from Phase 2. But the coding template already parsed it for you. Find the data from my google drive shared folder: https://drive.google.com/file/d/1AMzJzr3JDKegbBJ3er6-xRVpnBT20IBe/view?usp=sharing

  2. Zone data (only for hot zone analysis): at "src/resources/zone-hotzone" of the template

Hot zone analysis

The input point data can be any small subset of NYC taxi dataset.

Hot cell analysis

The input point data is a monthly NYC taxi trip dataset (2009-2012) like "yellow_tripdata_2009-01_point.csv"

Output data format

Hot zone analysis

All zones with their count, sorted by "rectangle" string in an ascending order.

"-73.795658,40.743334,-73.753772,40.779114",1
"-73.797297,40.738291,-73.775740,40.770411",1
"-73.832707,40.620010,-73.746541,40.665414",20

Hot cell analysis

The coordinates of top 50 hotest cells sorted by their G score in a descending order.

-7399,4075,15
-7399,4075,29
-7399,4075,22

Example answers

An example input and answer are put in "testcase" folder of the coding template

How to debug your code in IDE

If you are using the Scala template

  1. Use IntelliJ Idea with Scala plug-in or any other Scala IDE.
  2. Append .master("local[*]") after .config("spark.some.config.option", "some-value") to tell IDE the master IP is localhost.
  3. In some cases, you may need to go to "build.sbt" file and change % "provided" to % "compile" in order to debug your code in IDE
  4. Run your code in IDE
  5. You must revert Step 3 and 4 above and recompile your code before use spark-submit!!!

How to submit your code to Spark

If you are using the Scala template

  1. Go to project root folder
  2. Run sbt clean assembly. You may need to install sbt in order to run this command.
  3. Find the packaged jar in "./target/scala-2.11/CSE512-Project-Hotspot-Analysis-Template-assembly-0.1.0.jar"
  4. Submit the jar to Spark using Spark command "./bin/spark-submit". A pseudo code example: ./bin/spark-submit ~/GitHub/CSE512-Project-Hotspot-Analysis-Template/target/scala-2.11/CSE512-Project-Hotspot-Analysis-Template-assembly-0.1.0.jar test/output hotzoneanalysis src/resources/point-hotzone.csv src/resources/zone-hotzone.csv hotcellanalysis src/resources/yellow_tripdata_2009-01_point.csv