Skip to content

shejz/Machine-Learning-Projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine-Learning-Projects

Solved end-to-end machine learning projects

In this data science project, you will develop automated methods for predicting the cost, and severity of insurance claims.

Description:

When you've been devastated by a serious car accident, your focus is on the things that matter the most: family, friends, and other loved ones. Pushing paper with your insurance agent is the last place you want your time or mental energy spent. This is why Allstate, a personal insurer in the United States, is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect.

  • Basic exploratory analysis using the claims data
  • Insights from exploratory data analysis
  • Factors to be considered for claims processing and severity prediction
  • Implementation of the model using R
  • Building smarter predictive models including XGBoost

In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

Description:

The objective of this machine learning project is to use binary leaf images and extracted features, including shape, margin, and texture, to accurately identify 99 species of plants. Leaves, due to their volume, prevalence, and unique characteristics, are an effective means of differentiating plant species. They also provide a fun introduction to applying techniques that involve image-based features. We are going to apply different classification techniques to benchmark the relevance of classifiers in image classification problem.

  • Image Processing
  • Feature selection
  • Classifier comparison
  • Benchmarking
  • Prediction

There are different time series forecasting methods to forecast stock price, demand etc. In this machine learning project, you will learn to determine which forecasting method to be used when and how to apply with time series forecasting example.

Description:

In this machine learning project, we will be taking open source datasets that are publicly available and will be discussing various methods/techniques of performing time series forecasting. We will discuss about the traditional methods such as holt-winters method, Autoregressive integrated moving average method, exponential smoothing methods, as well we will also be comparing the modern methods of performing forecasting using neural network based models.

  • Understanding the importance of time series
  • Understanding the mathematics of time series
  • Discussion about methods/techniques
  • Application of the models using R or Python
  • Making conclusions

In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

  • Handle imbalance data
  • Creation classifier
  • Compare accuracy
  • Use deep learning to classify
  • Implementation using R

The Credit Card Fraud detection Dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset present transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

The dataset has been collected and analyzed during a research collaboration of Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML

Description:

The weekly sales transaction dataset consists of weekly purchased quantities of 800 products over 52 weeks. Normalised values are provided too. The objective of this data science project in R is to find out product bundles that can be put together on sale. Typically Market Basket Analysis was used to identify such bundles, here we are going to compare the relative importance of time series clustering in identifying product bundles.

  • Time series clustering
  • K-means
  • HC- clustering
  • Model Based clustering
  • Comparison of clustering

Build a recommendation engine which will predict the products to be purchased by an Instacart consumer again.

  • Read data from large size files
  • Perform Exploratory Data Analysis (EDA)
  • Apply logic to derive insights
  • Create association rule model
  • Implementation using R

Description:

Whether you shop from meticulously planned grocery lists or let whimsy guide your grazing, our unique food rituals define who we are. Instacart, a grocery ordering and delivery app aim to make it easy to fill your refrigerator and pantry with your personal favorites and staples when you need them. After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you.

Instacart’s data science team plays a big part in providing this delightful shopping experience. Currently, they use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to their cart next during a session. Recently, Instacart open-sourced this data - see their blog post on 3 Million Instacart Orders (https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2), Open Sourced.

In this data science project, we are going to use this anonymized data on customer orders over time to predict which previously purchased products will be in a user's next order.

Work with KKBOX's Music Recommendation System dataset to build the best music recommendation engine.

Description:

The 11th ACM International Conference on Web Search and Data Mining (WSDM 2018) is challenging you to build a better music recommendation system using a donated dataset from KKBOX. WSDM (pronounced "wisdom") is one of the the premier conferences on web inspired research involving search and data mining. They're committed to publishing original, high quality papers and presentations, with an emphasis on practical but principled novel models. They currently use a collaborative filtering based algorithm with matrix factorization and word embedding in their recommendation system but believe new machine learning techniques could lead to better results.

In this machine learning project, you will be asked to predict the chances of a user listening to a song repetitively after the first observable listening event within a time window was triggered. If there are recurring listening event(s) triggered within a month after the user's very first observable listening event, its target is marked 1, and 0 otherwise in the training set. The same rule applies to the testing set.

KKBOX provides a music dataset that consists of information of the first observable listening event for each unique user-song pair within a specific time duration. Metadata of each unique user and song pair is also provided. The use of public data to increase the level of accuracy of your prediction is encouraged.

The train and the test data are selected from users listening history in a given time period. Note that this time period is chosen to be before the WSDM-KKBox Churn Prediction time period. The train and test sets are split based on time, and the split of public/private is based on unique user/song pairs.

  • Working with Music Data with several category
  • EDA using several Visualization techniques
  • Building Automated Recommendation Engine
  • Solve this use case using Python and R
  • Finding Parameter Tuning for better Algorithm

In this project, we are going to talk about Time Series Forecasting to predict the electricity requirement for a particular house using Prophet.

Description:

There are various methods to perform time series forecasting. Traditionally people have used AR, MA or ARIMA based models to perform forecasting. Prophet is an open source forecasting tool built by Facebook. It can be used for time series modeling and forecasting trends in the future. The advantage of using Prophet over traditional libraries is that one does not need to know the technicalities of time series, domain knowledge is not really required to do time series forecasting. In this Hackerday we are going to use Prophet vs other methods to do the benchmarking.

  • Time series forecasting using ARIMA
  • Time series forecasting using Prophet
  • Implementing Prophet
  • Knowing advantages of Prophet
  • Using Bayesian Method of forecasting

The aim of this project is to build a predictive model, and use historical data to predict sales for each particular product in different stores.

Given sales data for 1559 products across 10 stores of the Big Mart chain in various cities. I will try to understand the properties of products and stores which play a key role in increasing sales.

The train and test data, can be found at the Analytics Vidya’s Big Mart Sales Prediction Challenge

Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Description:

Customer churn refers to a decision made by the customer about ending the business relationship. It is also referred to the loss of clients or customers. Customer loyalty and customer churn always add up to 100%. If a firm has a 60% loyalty rate, then their loss or churn rate of customers is 40%. As per 80/20 customer profitability rule, 20% of customers are generating 80% of revenue. So, it is very important to predict the users likely to churn from the business relationship and the factors affecting the customer decisions. Here we are going to show how logistic regression model using R can be used to identify the customer churn in the telecom dataset.

  • Understand the customer behavior
  • Understand reasons for churn
  • What are the top factors
  • How to retain customers
  • Apply multiple classification models

In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Description:

Banks often depend on credit score prediction models to approve or deny a loan request. A good prediction model is necessary for a bank so that they can provide maximum credit without exceeding the risk threshold. This data science project uses credit score dataset which has fairly large volume of data (250K). The predictive models will be build following various approaches - random forests, graident boosting and logistic regression. At the end of the project you will build a predictive model that will automatically score each applicant with a credit score which is human readable and easy to interpret.

Given his or her job role, predict employee access needs using amazon employee database.

Description:

When an employee at any company starts work, they first need to obtain the computer access necessary to fulfill their role. This access may allow an employee to read/manipulate resources through various applications or web portals. It is assumed that employees fulfilling the functions of a given role will access the same or similar resources. It is often the case that employees figure out the access they need as they encounter roadblocks during their daily work (e.g. not able to log into a reporting portal). A knowledgeable supervisor then takes time to manually grant the needed access in order to overcome access obstacles. As employees move throughout a company, this access discovery/recovery cycle wastes a nontrivial amount of time and money.

There is a considerable amount of data regarding an employee’s role within an organization and the resources to which they have access. Given the data related to current employees and their provisioned access, models can be built that automatically determine access privileges as employees enter and leave roles within a company. In this data science project, we will build an auto-access model that minimizes the human involvement required to grant or revoke employee access.

In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Description:

Two Sigma is a technology company dedicated to finding value in the world’s data. Since its founding in 2001, Two Sigma has built an innovative platform that combines extraordinary computing power, vast amounts of information, and advanced data science to produce breakthroughs in investment management, insurance, and related fields. Economic opportunity depends on the ability to deliver singularly accurate forecasts in a world of uncertainty.

By accurately predicting financial movements, you will learn about scientifically-driven approaches to unlocking significant predictive capability.

Two Sigma is excited to find predictive value and gain a better understanding of the skills offered by the global data science crowd.

  • Application of linear regression
  • Application of non-linear regression
  • Application of LASSO and elastic net regression
  • Application of XGBoost model
  • Interpretation of models

The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification.

The dataset is a collection of morphological measurements of several Iris flowers. These measurements will enable us to distinguish multiple species of the flowers. Today, species are identified by their DNA fingerprints, but in the 1930s, DNA's role in genetics had not yet been discovered.

In this machine learning and IoT project, we are going to test out the experimental data using various predictive models and train the models and break the energy usage.

  • Multiple linear regression,
  • Support vector machine with radial kernel,
  • Random forest and Gradient boosting machines (GBM).
  • Use of statistical models with repeated cross validation and evaluated in a testing set

Description:

This IoT project presents and discusses data-driven predictive models for the energy use of appliances. Data used include measurements of temperature and humidity sensors from a wireless network, whether from a nearby airport station and recorded energy use of lighting fixtures. The machine learning project discusses data filtering to remove non-predictive parameters and feature ranking. The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru) and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non-predictive attributes (parameters).

Build a machine learning algorithm to predict the future sale prices of homes.

  • Problem statement analysis
  • Exploratory Data Analysis
  • Input Data Visualization
  • Interpretation from Visualization

Description:

Zillow is asking you to predict the log-error between their Zestimate and the actual sale price, given all the features of a home. The log error is defined as: and it is recorded in the transactions file train.csv. In this project, you are going to predict the log error for the months in Fall 2017.

"Zestimates" are estimated home values based on 7.5 million statistical and machine learning models that analyze hundreds of data points on each property. And, by continually improving the median margin of error (from 14% at the onset to 5% today), Zillow has since become established as one of the largest, most trusted marketplaces for real estate information in the U.S. and a leading example of impactful machine learning.

In this data science project, we will develop a machine learning algorithm that makes predictions about the future sale prices of homes. We will also build a model to improve the Zestimate residual error. And finally, we'll build a home valuation algorithm from the ground up, using external data sources.

In this project, we are going to talk about H2O and functionality in terms of building Machine Learning models.

Description:

H2O.ai is focused on bringing AI to businesses through software.

H2O includes many common Machine Learning algorithms, such as generalized linear modeling (linear regression, logistic regression, etc.), Naive Bayes, principal components analysis, k-means clustering, and word2vec. H2O implements best-in-class algorithms at scale, such as distributed random forest, gradient boosting, and deep learning. H2O also includes a Stacked Ensembles method, which finds the optimal combination of a collection of prediction algorithms using a process known as stacking.

  • Data cleaning using H2O
  • Model Training using H2O
  • Model scalability using H2O in Hadoop environment
  • Driverless AI using H2O

About

Solved end-to-end machine learning projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published