Skip to content

This repository contains the code and outputs along with the execution instructions for the profiling and analysis of datasets from NYC Open Data

License

Notifications You must be signed in to change notification settings

sailikhithk/CSGY-6513-Big-Data-Project-Analysis-of-NYC-Open-Data

 
 

Repository files navigation

CSGY-6513-Big-Data-Project-Analysis-of-NYC-Open-Data

Sai Likhith

This repository contains the code and outputs along with the execution instructions for the profiling and analysis of datasets from NYC Open Data. A copy of this data is also available in NYU HDFS in the following path : /user/rv1130/2019BD-project-results

Introduction

Big data has revolutionized how we look at things. Due to its ever-changing nature and the enormous impact it has around us, it is important to understand what big data is, and how it works. The key objective of this work is to try to simulate a set of real-world big data problems and solve them meaningfully. The solutions that we propose are meant to act as examples for solving such problems encountered in any domain, provided there is data involved. Essentially, this body of work strives to solve the general problems that are encountered in data analysis, like generic data profiling, semantic analysis, and data analysis to derive insights from data. For the purpose of simulating a real-world scenario, we apply our solution to large number of datasets taken from NYC Open Data while adhering to strict limitations on the time and resources needed to perform these tasks.*

Objectives

This project deals with the following big data tasks applied on 1900 datasets of NYC Open Data:

  1. Generic profiling: The datasets that we are dealing with contain very little metadata which we cannot make do with. Our task involves structure and content discovery, to help a person or machine read, understand and then hopefully analyse our data. In other words, we analyse the datasets to gain more metadata to, in turn, better describe our datasets. In order to get a structure out of our data, we perform data profiling to transform the given datasets meaningfully by deriving metadata that can then be used for data discovery, querying purposes and for identification of data quality issues.

  2. Semantic profiling: This task deals with the semantic analysis of columns. Semantic analysis in our case refers to deducing the type of real world information that is being depicted in a column. Examples of semantic types that we have used are person's name, business name, vehicle make, building type, etc.

  3. Data Analysis: This task involves deriving knowledge that is hidden within the data. We try to extract higher order of information, something that is not visible at the first glance, but can help answer meaningful questions about the real world.

Implementation:

  1. Connect to the Dumbo high performance computing cluster.
  2. Import python 3.6.5 and spark 2.2.0 modules
  3. Run task1.py for generic profiling
  4. Run task2.py for semantic profiling
  5. Run task3.py for data analysis
  6. Run task1_visualization.py to analyse outputs of task 1
  7. Run task2_visualization.py to analyse outputs of task 2

Generic profiling

For small datasets, generic data profiling can be done faster when compared to the larger datasets, however, for large datasets like NYC Open Data- datasets which contain thousands of records in it, data profiling is quite difficult with generic python. We execute the task using a distributed approach, making use of Hadoop distributed file system (HDFS). This can be achieved using pyspark framework a spark framework for python, since it has parallel processing.

Results :

For every table in our dataset, we perform the following functions to find the values of certain features of that table that convey some information about the table's content and structure. We then output these values as a json file for that table and merge the outputs to a single json file. The features we calculate for each file in our dataset are:

Count: the no. of values are there in the selected component.

Missing values: There exist various types of missing values like null, nan, blanks and white spaces etc., which we identify and count

Minimum of column: Finding the minimum value in a column

Max method: Finding the maximum value in a column

Frequent values: This will help us identify which values occur more than 5 times in a column.

Standard Deviation: Standard deviation of the mean of values that are calculated for integer and float type.

Semantic Profiling

Semantic analysis refers to deducing the type of real world information that is being depicted in a column. The first step of this process is to make an indicator function for each of these semantic types. Then, we use regular expressions, fuzzy string matching and an external nlp library (Stanford's name-entity-recognition library) to classify entries as the one of the various semantic types that we are dealing with.

There are 23 semantic types considered for this task. Some of these are:

  1. Vehicle Types

  2. Building Classification

  3. Park/PlayGround

  4. NYC Agencies List

Results:

For every file in our dataset, we identify a semantic type for each of its entries and make a json file that outputs the number of occurences of these semantic types in that file. Then, we merge these files to make a single file that contains the outputs for the whole dataset.

Data Analysis

Now, we have to solve the third and arguably the most important part of the problem, deriving insights from data. Here, we analyse the complaint type dataset sourced from NYC Open Data with an emphasis on '311' type complaints. We try to understand the frequency of 311 complaints for each borough. Then, we come up with explanations for the phenomena observed.

Libraries and resources used

Python libraries:

  1. datetime
  2. time
  3. seaborn
  4. json
  5. pandas
  6. numpy
  7. modin.pandas
  8. os
  9. string
  10. re
  11. fuzzywuzzy
  12. matplotlib etc.

External libraries:

  1. Name-entity-recognition library by Stanford

External databases:

  1. Vehicle types- https://data.ny.gov/api/assets/83055271-29A6-4ED4-9374-E159F30DB5AE -vehicle types

  2. Building classification- https://www1.nyc.gov/assets/finance/jump/hlpbldgcode.html

  3. Types of location- Google places api

  4. Parks- https://en.wikipedia.org/wiki/List_of_New_York_City_parks

  5. Colleges-https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_New_York_City

  6. Agencies- https://www1.nyc.gov/nyc-resources/agencies.page

  7. Area of study- https://www.princetonreview.com/majors/all

  8. Colors- https://data.ny.gov/api/assets/83055271-29A6-4ED4-9374-E159F30DB5AE

About

This repository contains the code and outputs along with the execution instructions for the profiling and analysis of datasets from NYC Open Data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%