Skip to content

Using hadoop to utilize data from an automobile tracking platform that tracks the history of important incidents after the initial sale of a new vehicle.

Notifications You must be signed in to change notification settings

Andy-Pham-72/hadoop-mini-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

20160406_1-768x396

Hadoop Mini Project

Post-Sale Automobile Report

In this project, we will utilize data from an automobile tracking platform that tracks the history of important incidents after the initial sale of a new vehicle. Such incidents include subsequent private sales, repairs, and accident reports. The platform provides a good reference for second-hand buyers to understand the vehicles they are interested in.

The report is stored as CSV files in HDFS with following schema:

Screen Shot 2022-02-01 at 11 48 56 PM

Learning Objectives

  • Utilitzing MapReduce jobs in Python.
  • Leveraging a MapReduce processing model to process large scale data and break down a complex problem into smaller tasks.
  • Getting familiar with VirtualBox environment.

Setting up Hadoop using Hortonworks Hadoop Sandbox

Step 1:

From your Local Terminal run upload_files.sh to upload to the root directory in the VirtualBox:

  • You have to input the password of root account in order to upload the files.

Screen Shot 2022-02-01 at 11 56 15 PM

Step 2:

From the Sandbox's Web Shell Client - http://localhost:4200, logging into as root account and let's put the data.csv into hadoop file system:

$ hadoop fs -mkdir test_dir
$ hadoop fs -put data.csv /user/root/test_dir  

Double check the uploaded file in the Ambari Files View:

  • Note: the owner of the folder and file must be root !

Screen Shot 2022-02-03 at 3 24 21 PM

Step 3:

From the Sandbox's Web Shell Client, run file auto.sh:

$ bash auto.sh

Step 4:

After all the MapReduce jobs were successfully executed, let's check the output:

  • From all_accidents folder: Screen Shot 2022-02-02 at 12 07 31 AM

  • From make_year_count folder:

  • Screen Shot 2022-02-02 at 12 08 41 AM


NOTE:

  • In the default Python enviroment is version 2 in VirtualBox so when you should either update the python env to 3 (or above) or tailor your code to fit the python 2.

For example, Python 2 doesn't support F-string like Python 3 which can cause error when you run the MapReduce python script. Therefore, you have to use %s acts a placeholder for a string while %d acts as a placeholder for a number. More detail

  • The easiest way to check if your Python script is compatiable with python 2 is to run python mapper1.py or other python script in Sandbox's Web Shell Client - http://localhost:4200. If there is no error occurs, it means your code is good for python 2.

About

Using hadoop to utilize data from an automobile tracking platform that tracks the history of important incidents after the initial sale of a new vehicle.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published