Skip to content

jim113/Advanced-Database-Topics-NTUA

Repository files navigation

Advanced Databases NTUA

Map Reduce approach of the k-means algorithm.
Data taken from HDFS file.
Data contain trip records from all trips completed in yellow taxis in NYC from 1/2015 to 6/2015.
Algorithm returns top five central points' coordinates.

How to run

  1. Install pyspark
pip3 install pyspark
  1. Upload data in Hadoop Distributed File System (HDFS)
hadoop fs -put ./yellow_tripdata_1m.csv hdfs://master:9000/yellow_tripdata_1m.csv
  1. Submit task in Spark environment
spark-submit kmeans_with_map_reduce.py
  1. Get Results to Local File
hadoop fs -getmerge hdfs://master:9000/kmeans_with_map_reduce.results ./kmeans_with_map_reduce.results
  1. Access Results
cat kmeans_with_map_reduce.results

About

Advanced Topics Databases, NTUA 2019-2020

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages