Skip to content

Setup and demonstration of Apache Spark with HDFS on Amazon EC2 machines and execution of geo-spatial queries (SparkSQL).

Notifications You must be signed in to change notification settings

iamjagdeesh/Geo-Spatial-Data-Analysis-using-SparkSQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Large Scale Geo-Spatial Data Analysis using SparkSQL

CSE 512 - Distributed and Parallel Database Systems

Project description

The project was aimed to setup a spark cluster with HDFS and run SparkSQL queries (geo-spatial) on the it.

  • Native spark cluster was used as cluster manager.
  • Hadoop Distributed File System (HDFS) was used as distributed storage system.
  • The setup was done using Amazon EC2 virtual machines as nodes.
  • Spatial queries such as range query, range join query, distance query, distance join query, hot zone analysis and hot cell analysis were executed.
    • Spatial queries were executed by implementing user defined functions such as ST_contains and ST_within in Scala.
    • ST_contains takes a point and a rectangle and returns a boolean indicating whether the point is inside the rectangle.
    • ST_within takes two points and a distance and returns a boolean indication whether the distance between the points is not more than the distance provided.

Technology used: Apache Spark, Hadoop Distributed File System (HDFS), Scala, sbt build tool, Amazon EC2

Team members

  1. Bhavani Balasubramanyam
  2. Jagdeesh Basavaraju
  3. Sahan Vishwas
  4. Suraj Somachand Kattige

About

Setup and demonstration of Apache Spark with HDFS on Amazon EC2 machines and execution of geo-spatial queries (SparkSQL).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages