NYCOpenData-Profiling-Analysis

Data Profiling, Quality and Analysis on public dataset on NYCOpenData.

Dataset

Task 1 : Generic Profiling

Open data often comes with little or no metadata. You will profile a large collection of open data sets and derive metadata that can be used for data discovery, querying, and identification of data quality problems.

For each column in the dataset collection, you will extract the following metadata

Number of non-empty cells
Number of empty cells (i.e., cell with no data)
Number of distinct values
Top-5 most frequent value(s)
Data types (a column may contain values belonging to multiple types)

Identify the data types for each distinct column value as one of the following:

INTEGER (LONG)
REAL
DATE/TIME
TEXT

For each column count the total number of values as well as the distinct values for each of the above data types.
For columns that contain at least one value of type INTEGER / REAL report:

Maximum value
Minimum value
Mean
Standard Deviation

For columns that contain at least one value of type DATE report:

Maximum value
Minimum value

For columns that contain at least one value of type TEXT report:

Top-5 Shortest value(s) (the values with shortest length)
Top-5 Longest values(s) (the values with longest length)
Average value length

Task 2 : Semantic Profiling

For each column, identify and summarize semantic types present in the column. These can be generic types (e.g., city, state) or collection-specific types (NYU school names, NYC agency).
For each semantic type T identified, enumerate all the values encountered for T in all columns present in the collection.
You will look for the following types and add one or more semantic type labels to the column metadata together with their frequency in the column:

Person name (Last name, First name, Middle name, Full name)
Business name
Phone Number
Address
Street name
City
Neighborhood
LAT/LON coordinates
Zip code
Borough
School name (Abbreviations and full names)
Color
Car make
City agency (Abbreviations and full names)
Areas of study (e.g., Architecture, Animal Science, Communications)
Subjects in school (e.g., MATH A, MATH B, US HISTORY)
School Levels (K-2, ELEMENTARY, ELEMENTARY SCHOOL, MIDDLE)
College/University names
Websites (e.g., ASESCHOLARS.ORG)
Building Classification (e.g., R0-CONDOMINIUM, R2-WALK-UP)
Vehicle Type (e.g., AMBULANCE, VAN, TAXI, BUS)
Type of location (e.g., ABANDONED BUILDING, AIRPORT TERMINAL, BANK, CHURCH, CLOTHING/BOUTIQUE)
Parks/Playgrounds (e.g., CLOVE LAKES PARK, GREENE PLAYGROUND)

Task 3 : Data Analysis

Identify the three most frequent 311 complaint types by borough.
Are the same complaint types frequent in all five boroughs of the City?
How might you explain the differences?
How does the distribution of complaints change over time for certain neighborhoods and how could this be explained?

Data Visualizations

Types of complaints across the different boroughs

Distribution of "closed-dates" across the different boroughs

Heat Map Representing Status of Complaints Across The Different Boroughs

Heat Map Representing Count Of Complaints Across The Different Boroughs

Distribution of Complaint Types and their resolution dates

Types of complaints across various different locations

Heat Map representing the Types of complaints that are open in the Brooklyn region

Team

Support

If you found this useful, please consider starring(★) the repo so that it can reach a broader audience

License

This project is licensed under the MIT License. Feel free to create a Pull Request for adding implementations or suggesting new ideas to make the analysis more insightful

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
Col Data		Col Data
Visualizations		Visualizations
BD-Presentation.pptx		BD-Presentation.pptx
LICENSE		LICENSE
NYC-opendata-report-team-9.pdf		NYC-opendata-report-team-9.pdf
README.md		README.md
Task-1-Generic-profiling.py		Task-1-Generic-profiling.py
Task-2-profiling.ipynb		Task-2-profiling.ipynb
Task-2-statistics.txt		Task-2-statistics.txt
Task-3-Data-Analysis.pdf		Task-3-Data-Analysis.pdf

License

gandalf1819/NYCOpenData-Profiling-Analysis

Folders and files

Latest commit

History

Repository files navigation

NYCOpenData-Profiling-Analysis

Dataset

Task 1 : Generic Profiling

Task 2 : Semantic Profiling

Task 3 : Data Analysis

Data Visualizations

Types of complaints across the different boroughs

Distribution of "closed-dates" across the different boroughs

Heat Map Representing Status of Complaints Across The Different Boroughs

Heat Map Representing Count Of Complaints Across The Different Boroughs

Distribution of Complaint Types and their resolution dates

Types of complaints across various different locations

Heat Map representing the Types of complaints that are open in the Brooklyn region

Team

Support

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages