Skip to content

grumpyclimber/portfolio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Update

This repo was created when I was learning how to code (so I could get my first job in data). I haven't updated the repo in ages and looking at some of my code with real world experience can be... hmm quite entertaining :) Nevertheles it's a good indicator of how your projects can look if you're also trying to get into the data world.

Portfolio

Hello world, here are my humble beginnings of a portfolio! Most of the below projects are based on well known datasets, but in each and every one of them I've pushed myself to outshine the standard scope of the project. If I couldn't find anything new in the data, and my project would look the same as every other notebook, I've collected new data (like in Fandango project). No idea or room for new data? Then I've grabbed a brush and worked on the visualizations. (eg. Star Wars). At this stage I'm focusing on machine learning projects. My newest addition is a project feedback notebook, where I've used a web scraper to extract all the comments to students guided projects and analyzed that data.

The portfolio is divided into 3 parts:

🔍 1. Exploratory Data Analysis

🎰 2. Machine Learning

🪄 3. Tricks and intros

Get in touch: LinkedIn icon  Stack icon  

Exploratory Data Analysis

🚗 Ebay cars - A deeper analysis of a well known car dataset.

This dataset very often serves as an introduction to pandas. Students focused on surviving their first coding project forget to unleash their curiosity. Because of that the dataset has a lot of untapped potential: extracting engine size from the cars names, identyfing sontiage_autos, identyfying the issue with post-2015 entries to name a few. It's also a perfect dataset for a basic introduction to geopandas.

👾 Star Wars - This one is all about the style...

Star Wars fans survey is a small dataset that doesn't give us a lot potential for analysis. To make it more interesting I've decided to work on the visuals of this notebook. Custom fonts, color palettes, and lots of plots. I've even plotted a death-star. The force is strong with this one.

🎥 Fandango - Extended version of Fandango ratings analysis.

To dig deeper into Fandangos rating shift I've gathered more data, specifically distribution company and budget data for each movie. I've set up a BeautifulSoup scraper get the required information from Wikipedia. That gives us a better look how movie budgets and their distributors affect the ratings.

🚑 Road fatalities - A basic analysis of road fatalities on Australian roads

A bit of a break from recent ML projects - a quick EDA on a relatively simple dataset. Australian roads became much safer in the last 30 years. But that change doesn't affect everybody equally. Some social groups and locations are becoming more common in road fatalities.


Machine Learning

🚙 ML car prices - Introduction to ML with k-nearest neighbors algorithm.

I've extended the project with testing out multiple random seeds, checking many column combinations and different dataframe versions (based on cleaning techniques).

🏠 ML house prices - Building a linear regression model to predict house prices.

Multiple feature engineering layers to merge various numeric and categorical columns into 1. Using feature selection techniques and testing different outliers removal methods.

🚕 ML NYC taxi trips - An ongoing project with large datasets of NYC taxi trips.

The core idea of this project is to experience working with large data. Using pandas big data techniques or Dask library to manage importing and merging datasets, all while trying to fit under strict memory limitations of a kaggle notebook.

📋 Project feedback - Scraping and analyzing projects feedback.

Another BeautifulSoup scraping session gathered feedback to all of the published projects on Dataquest forum. Having gathered a lof of text data. I've tested different NLP techniques, applied supervised and unsupervised machine learning models to analyze text data.

🚲 Bike Sharing - Using multiple regression models to predict rental count

Random Forest hyperparameter optimization using GridSearch, gathering more weather data using meteostat, testing various regression models, small steps into stacking models: averaging predictions of multiple models and using neural network model as a meta model. 


Tricks and intros:

🔡 Scraping data - scraping data from Wikipedia pages.

Getting introduced to web-scraping with BeautifulSoup, we'll develop a function to extract budget data from the website.

🎣 Tricks - Mix of short and easy tricks, hacks and intros.

Giving back, improving on others work and explaining your work is an essential part of learning how to code. In this folder I'll try to include some of my notebook that can be helpful.

🌐 Maps - Quick and easy intro to geopandas.

Using the ebay dataset to conduct a quick tutorial to geospatial visualization with Geopandas.


languages: Python, HTML

libraries:

  • Pandas
  • Numpy
  • Matplotlib
  • Geopandas
  • Seaborn
  • Scikit-learn
  • Wikipedia
  • Missingno
  • BeautifulSoup
  • Dask
  • Textwrap
  • Meteostat

Adam Kubalica

LinkedIn icon  Stack icon