Skip to content

This repository is meant to showcase my Data Analytics & Data Science projects.

Notifications You must be signed in to change notification settings

Karthikeyan-Pal/Data-Analytics-and-Science

Repository files navigation

Data-Analytics-and-Science

Analyzing US Economic Data & Creating Dashboard

Programming Language: Python

• Created a function make_dashboard from bokeh.plotting library that will produce a dashboard as well as an html file which can be used to share a dashboard.

• Created Gross Domestic Product (GDP) & Unemployment data frame from respective csv files.

• Produced a dashboard displaying Year Vs. GDP Change & Unemployment

Application of Classification Algorithms on Loan Dataset

Programming Language: Python

• Loaded a loan dataset and applied classification algorithms and found the best one for this specific dataset by accuracy evaluation methods.

• The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields: Loan_status, Principal, Terms, Effective_date, Due_date, Age, Education, Gender.

• Performed data exploratory analysis using seaborn library.

• Built a model and use the test set to report the accuracy of the model using following algorithms: K Nearest Neighbor(KNN), Decision Tree, Support Vector Machine, Logistic Regression.

• Evaluated the above model using metrics such as Jaccard index, F1-score & LogLoss.

COVID-19 Data Analysis

Programming Language: Python, SQL

BI Tool: Tableau

• Created a live COVID-19 vaccination, cases & deaths statistics dashboard; Performed web scraping, data cleaning in Python; Stored clean live dataset on SQL database as an extract data source for Tableau Dashboard.

• Pulled live data from Our World Data Github repository- https://github.com/owid/covid-19-data/tree/master/public/data maintained by John Hopkins University.

• Used sqlalchemy engine to store data frames into respective sql tables.

• Tableau Dashboard link- https://public.tableau.com/profile/karthikeyan.palanikumar#!/vizhome/Covid-19Analysis_16216368189340/VaccineDashboard

Earth Data Analysis

BI Tool: Tableau

• Performed web scraping to collect relevant data about Earth and answered each of the below question using data viz(please refer to the project output folder for viz)

  1. Total number of countries- continent wise- https://www.enchantedlearning.com/math/tables/reading/continents/

  2. Top 10 countries with largest land mass- https://www.worldometers.info/geography/largest-countries-in-the-world/

  3. Top 10 countries with smallest land mass- https://www.countries-ofthe-world.com/smallest-countries.html

  4. Top 10 countries with longest coastlines- https://www.jagranjosh.com/general-knowledge/top-10-countries-with-longest-coastline-in-the-world-1540468718-1

  5. Oceans along with number of countries sharing it.

  6. Top 10 countries with largest human population- https://www.worldometers.info/world-population/population-by-country/

  7. Top 10 countries with highest population density - https://www.worldometers.info/world-population/population-by-country/

  8. Top 10 countries with highest GDP- https://www.thebalance.com/gdp-by-country-3-ways-to-compare-3306012

  9. Total number of people based on sex- continent-wise - https://www.worldometers.info/world-population/world-population-gender-age.php

  10. Total number of people based on age- continent-wise- https://www.worldometers.info/world-population/world-population-gender-age.php

  11. Types of terrain and its type on earth- continent-wise- http://chartsbin.com/view/wwu#:~:text=The%20percentages%20of%20earth's%20land,land%20doesn't%20have%20topsoil

  12. Top pinnacle points on earth along with height- https://www.infoplease.com/world/geography/highest-mountain-peaks-world

  13. Top nadir points on earth along with depth- https://en.wikipedia.org/wiki/Extreme_points_of_Earth

  14. Top hottest places on earth along with temperature- https://www.earth.com/news/10-hottest-places-earth/

  15. Top coldest places on earth along with temperature- https://www.sciencefocus.com/planet-earth/what-are-the-10-coldest-places-on-earth/

  16. Top windiest places on earth along with speed- https://www.earth.com/news/10-windiest-places-earth/

  17. Top elements on earth’s crust- https://www.mindat.org/a/common_minerals

  18. Highly populated species on earth- https://www.npr.org/sections/thetwo-way/2011/11/03/141946751/along-with-humans-who-else-is-in-the-7-billion-club

  19. Large fresh water sources on earth- https://www.the71percent.org/the-worlds-fresh-water-sources/

  20. Highly used fuel resources on the earth- https://www.visualcapitalist.com/energy-consumption-by-source-and-country-1969-2018/

Engineering Applications of Stochastic Processses

Software: Wolfram Mathematica - Technical Computing Software

• Created functions to calculate probabilities of Binomial Random Variables.

• Calculated probabilities & plotted functions of Single Sampling Plan.

• Calculated probabilities & plotted ASN function of Double Sampling Plan with & without curtailment.

• Computed probabilities, plotted AOQ & AFI function & deduced its maximum value for Single Sample Rectifying Inspection Plan.

• Computed probabilities, plotted AOQ & AFI function of Continuous Sampling Plan (CSP-1).

• Deduced Modelling Arrival Times using Poisson Process probabilities.

• Constructed probability function & plotted for Competing Poisson Process & Non- Homogeneous Poisson Process.

• Constructed Availability & Average Availability functions & plotted for Continuous-time Markov Chain Model of CFR Repairable System Models.

• Constructed & plotted functions of Availability based on optimal age-based PM policy in Age Replacement Model.

House Sales in King County USA

Programming Language: Python

• This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. The dataset contains the following fields,

   o	id :a notation for a house

   o	date: Date house was sold

   o	price: Price is prediction target

   o	bedrooms: Number of Bedrooms/House

   o	bathrooms: Number of bathrooms/bedrooms

   o	sqft_living: square footage of the home

   o	sqft_lot: square footage of the lot

   o	floors :Total floors (levels) in house

   o	waterfront :House which has a view to a waterfront

   o	view: Has been viewed

   o	condition :How good the condition is Overall

   o	grade: overall grade given to the housing unit, based on King County grading system

   o	sqft_above :square footage of house apart from basement

   o	sqft_basement: square footage of the basement

   o	yr_built :Built Year

   o	yr_renovated :Year when house was renovated

   o	zipcode:zip code

   o	lat: Latitude coordinate

   o	long: Longitude coordinate

   o	sqft_living15 :Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area

   o	sqft_lot15 :lotSize area in 2015(implies-- some renovations)

• In data wrangling, used the method describe() to obtain a statistical summary of the dataset. Replaced the missing values using the method replace()

• Performed exploratory data analysis using box plot, regression plot from seaborn library.

• Utilized the Pandas method corr() to find the feature other than price that is most correlated with price.

• Developed a linear regression model from sklearn library to predict the ‘price’ using the list of features and calculated the R^2

• Split the data into training and testing set. Created and fit a Ridge regression object using the training data, setting the regularization parameter to 0.1 and calculated the R^2 using the test data.

Loan Report

Programming Language: Python

• The dataset Loan.csv contains 2000 records of the following fields:

o	 'LOAN_NUMBER'

o	 'SAMPLE_DATE'

o	 'FILE_REQ_DATE'

o	 'SECOND_REQUEST_DATE'

o	 'SENT_TO_IMAGING_DATE'

o	 'LENDER_RESPONSE_DUE_DATE'

o	 'DATE_LOAN_FINALIZED'

o	 'LENDER_ID'

o	 'LENDER_NAME'

o	 'LENDER_LOAN_ID'

o	 'PROP_STATE'

o	 'NEW_CONSTRUCTION_INDICATOR'

o	 'CONDOMINIUM_INDICATOR'

o	 'LOAN_ORIG_DATE'

o	 'CURRENT_BALANCE '

o	 'FICO_SCORE'

o	 'LTV'

o	 'ORIG_VALUE'

o	 'AVM_VALUE'

o	 'FIELD_REVIEW_VALUE_SUPPORTED'

o	 'FIELD_REVIEW_VALUE'

o	 'PURPOSE_CODE'

o	 'OCCUPANCY_CODE'

o	 'LENDER_INST_TYPE_DESCRIPTION'

o	 'UNDERWRITER_NAME'

o	 'REVIEW_DATE'

o	 'REVIEW_STATUS'

o	 'DEAL_NAME'

o	 'START_DATE'

• Grouped data by Institution types & aggregated 'Loan Count', 'Average Current Balance', 'Max Current Balance', 'Min Current Balance' to create statistical report 1 and stored it to a csv file.

• Grouped data by Loan_to_value_cohorts & aggregated 'Loan Count', 'Average Current Balance', 'Max Current Balance' ,'Min Current Balance' to create statistical report 2 and stored it to a csv file.

• Grouped data by Loan_to_value_cohorts & aggregating 'Loan Count', 'Average Current Balance', 'Max Current Balance', 'Min Current Balance' to create statistical report 3 and stored it to a csv file.

Maps, Waffle Charts, Word Cloud, Regression Plots

Programming Language: Python

• The Dataset: Immigration to Canada from 1980 to 2013 contains annual data on the flows of international migrants as recorded by the countries of destination. The data presents both inflows and outflows according to the place of birth, citizenship, or place of previous / next residence both for foreigners and nationals.

• create_wafflechart is created from scratch to display progress toward goals using Matplotlib library. This function would take the following parameters as input:

1.	categories: Unique categories or classes in dataframe.

2.	values: Values corresponding to categories or classes.

3.	height: Defined height of waffle chart.

4.	width: Defined width of waffle chart.

5.	colormap: Colormap class.

6.	value_sign: In order to make our function more generalizable, we will add this parameter to address signs that could be associated with a value such as %, $, and so on.          value_sign has a default value of empty string.

• Word clouds are created out of alice_novel.txt data using wordcloud library and superimposed the words onto a mask of any shape. Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud.

• Created regression plots between ‘Total Immigration’ & ‘Year’ using seaborn library.

• Created a scatter plot with a regression line to visualize the total immigration from Denmark, Sweden, and Norway to Canada from 1980 to 2013.

Summarizing Sales Data

Programming Language: Python

• Imported 3 csv files- category_sales, date_dim, store_lookup & converted it into data frames using pandas library.

• Merged the above 3 dataframes, identified the missing values(null values) in running_sales and replaced it with average value.

• Created a summary table containing ‘channel-rollup’, ‘rtl_yr’, ‘avg_running_sales’ & ‘tot_running_sales’

Twitter Sentiment & Text Analysis

Programming Language: Python

• Secured access to Twitter data using an application under Twitter developer account & extracted the recent 100 tweets from a Twitter user by creating an API object.

• After transforming into pandas data frame & cleaning, created two user-defined functions -' Subjectivity' & 'Polarity' to evaluate the sentiment of each tweet.

• Using the WordCloud library, created a word cloud which is a cluster of words depicted in different sizes. The more a specific word appears in a source of textual data, the bigger and bolder it appears in the word cloud.

• Classified each tweet as 'Positive,' 'Negative,' or 'Neutral' by defining a user-defined function getAnalysis() that evaluates a tweet based on a 'Polarity' score.

• Created a scatter plot between polarity & subjectivity for all '100' tweets and a bar plot showing the value counts of Positive, Negative & Neutral tweets.

• Split the data frame into strings to stem the words to their root using natural language tool kit Snowball Stemmer. After calculating word frequency, a bar plot shows the top 20 words used in the 100 recent tweets.

• Created a user-defined function show_ents() to scan each string and assign an entity to it & displayed a bar chart showing top organizations mentioned in the 100 tweets.

Analysis of IT Spending

BI Tool: Microsoft Power BI

• Imported 9 CSV files into Power BI Desktop & converted them into 7 relational tables and established inter-relationships among those tables in the Model pane.

• Created 3 new measures- Budget running total, Actual running total & Forecast running total from existing columns in each table and visualized Budget vs. Forecast % in the form of Stacked bar chart by Region, Cost Element Group & IT Area. For more viz on this report, please refer to Headline.PNG

• Created 2 interactive matrices to show the difference between Budget & Forecast by Region, IT Area & Cost Element Group. Also, displayed a Filled Map showing actual spending by Country. For more viz on this report, please refer to Region.PNG

• Analyzed Budget vs. Forecast in the form of interactive Decomposition tree & explained it by Cost Element Group, Cost Element, Country, IT Department. To view this report, please refer to Decomposition.PNG