Skip to content

kwanit1142/Machine-Learning-Models-on-different-scenarios

Repository files navigation

Machine-Learning Models on Different Scenarios

These Notebooks with their Question Statement and Reports, came under the course CSL2050, taken by Prof. Richa Singh.

Lab-1 :- Confusion Matrix

image

A csv file has been provided to you. It contains three columns. First column is the actual labels for a binary classification problem. Second, and third column are predicted probabilities from two classifiers. You will be converting these probabilities values in the final label based on the threshold value. Helping code-script is in Notebook. You are supposed to complete the functions computing the different evaluation metrics as described in the Colab Notebook at this link. You may download the Notebook and may start working on it on your local device or on Colab Notebook. The Notebook is provided to you for a quick start. You will define the functions for the following tasks

i.) To calculate accuracy.

ii.) To calculate precision and recall.

iii.) To calculate F1 score.

Both per-class and per-sample average precision, recall and F1-scores need to be calculated.

Additionally you are required to change the threshold value (0.5, 0.4, 0.6 etc.) and compare, contrast the difference in metrics for both the models.

Lab-2 :- Decision Tree

image

Q1: A csv file has been provided to you. The dataset represents the mood of a student to go to class depending on the weather at IIT Jodhpur. We have been accustomed to online classes so this is to give you a feeling of attending classes in the post-COVID scenario. A Colab Notebook is attached for your reference about the stepwise procedure to solve the exercise . The tasks are as follows:

i) Preprocessing the data.

ii) Cross-validation over the data.

iii) Training the final model after cross-validation

iv) Perform decision tree classification and calculate the prediction accuracy for the test data.

v) Plot the decision tree and the decision surface.

Q2: In the previous case, the nodes are split based on entropy/gini impurity.The following dataset contains real-valued data..The column to be predicted is 'Upper 95% Confidence Interval for Trend' i.e. the last column present in the dataset using other columns as features. The tasks are as follows:

i) Preprocessing the data.

ii) Cross-validation over the data.

iii) Training the final model after cross-validation

iv) Perform decision tree regression and calculate the squared error between the predicted and the ground-truth values for the test data.

v) Plot the decision tree and the decision surface.

Lab-3 :- Random Forest and Bagging Classifier

image

Consider the credit sample dataset, and predict whether a customer will repay their credit within 90 days. This is a binary classification problem; we will assign customers into good or bad categories based on our prediction.

Data Description:-

Features --> Variable Type --> Value Type --> Description

Age --> Input Feature --> integer --> Customer age

Debt Ratio --> Input Feature --> real --> Total monthly loan payments (loan, alimony, etc.) / Total monthly income percentage.

Number_Of_Time_30-59_Days_Past_Due --> Input Feature --> integer --> The number of cases when a client has overdue 30-59 days (not worse) on other loans during the last 2 years.

Number_Of_Time_60-89_Days_Past_Due --> Input Feature --> integer --> A number of cases when the customer has 60-89dpd (not worse) during the last 2 years.

Number_Of_Times_90_Days_Late --> Input Feature --> integer --> Number of cases when a customer had 90+dpd overdue on other credits

Dependents --> Input Feature --> integer --> The number of customer dependents

Serious_Dlq_in_2yrs --> Target Variable --> Binary: 0 or 1 --> The customer hasn't paid the loan debt within 90 days

Perform the following tasks for this dataset:-

Question-1 (Random Forest):

1. Preprocessing the data.

2. Plot the distribution of the target variable.

3. Handle the NaN values.

4. Visualize the distribution of data for every feature.

5. Train the Random Forest Classifier with the different parameters, for e.g.:-

Max_features = [1,2,4]

Max_depth = [2,3,4,5]

6. Perform 5 fold cross-validation and look at the ROC AUC against different values of the parameters (you may use Stratified KFold function for this) and Perform the grid-search for the parameters to find the optimal value of the parameters. (you may use GridSearchCV for this )

7. Get the best score from the grid search.

8. Find the feature which has the weakest impact in the Random Forest Model.

image

Question-2 (Bagging) :

1. Perform bagging-based classification using Decision Tree as the base classifier.

2. The number of trees to be considered is {2,3,4}.

3. Perform 5 fold cross-validation using ROC AUC metric to evaluate the models and collect the cross-validation scores (use function cross_val_score for this).

4. Summarize the performance by getting mean and standard deviation of scores

5. Plot the model performance for comparison using boxplot.

6. Compare the best performance of bagging with random forest by plotting using boxplot.

Lab-4 :- Adaboost and Bayes Classification

image

Perform the following tasks for this dataset:-

Question-1 (Boosting ):

1. Preprocessing the data.

2. Plot the distribution of the target variable.

3. Visualize the distribution of data for every feature.

4. Perform boosting-based classification using Decision Tree as the base classifier.

5. Perform cross validation over the data and calculate accuracy for a weak learner.

6. Build the AdaBoost model using the weak learner by increasing the number of trees from 1 to 5 with a step of 1. Compute the model performance.

Question-2 (Bayes classification) :

1. Estimate the accuracy of Naive Bayes algorithm using 5-fold cross validation on the data set. Plot the ROC AUC curve for different values of parameters.

2. Use linear discriminant function to calculate the accuracy on the classification task with 80% training and 20% testing data.

3. Calculate the Bayes risk from the customized matrix of your choice.

image

Question 3: Visualisation in Bayesian Decision Theory

DATASET 1:

1. Consider the height of the car and its cost is given. If the cost of a car > 550 then the label is 1, otherwise 0.

2. Create the labels from the given data.

3. Plot the distribution of samples using histogram.

4. Determine the prior probability for both the classes.

5. Determine the likelihood / class conditional probabilities for the classes. (Hint : Discretize the car heights into bins, you can use normalized histograms)

6. Plot the count of each unique element for each class. (Please mention in the report why this plot is different from the distribution)

7. Calculate the P(C1|x) and P(C2|x) i.e posterior probabilities and plot them in a single graph.

DATASET 2:

Now for the second dataset there are two files c1 and c2 . c1 and c2 contain two features each for class 1 and 2 respectively. Read the dataset and repeat all the above steps for Dataset 2.

Note : Plot the data distribution and the histogram of feature 1 and feature 2 in the X axis and Y axis respectively. The distribution of feature 1 will be along the top of X axis and feature 2 along the right of Y axis. An example is shown below.

Real Life Dataset:

Now it's time to visualise a real life dataset. Take any one feature from the above IRIS dataset and take the class labels. In this dataset there are three class labels. Extend all the visualisation mentioned previously for this dataset.

Lab-5 :- Text Analysis using Bayes Classification

image

Data Preparation:

1. Import necessary libraries

2. Load the data

3. Plot the count for each target

4. Print the unique keywords

5. Plot the count of each keyword

6. Visualize the correlation of the length of a tweet with its target

7. Print the null values in a column

8. Removing null values

9. Removing Double Spaces, Hyphens and arrows, Emojis, URL, another Non-English or special symbol

10. Replace wrong spellings with correct ones

11. Plot a word cloud of the real and fake target

12. Remove all columns except text and target

13. Split data into train and validation

14. Compute the Term Document matrix for the whole train dataset as well as for the two classes.

15. Find the frequency of words in class 0 and 1.

16. Does the sum of the unique words in target 0 and 1 sum to the total number of unique words in the whole document? Why or why not?

17. Calculate the probability for each word in a given class.

18. We have calculated the probability of occurrence of the word in a class, we can now substitute the values in the Bayes equation. If a word from the new sentence does not occur in the class within the training set, the equation becomes zero. This problem can be solved using smoothing like Laplace smoothing. Use Bayes with Laplace smoothing to predict the probability for sentences in the validation set.

19. Print the confusion matrix with precision, recall and f1 score.

Lab-6 :- Linear Regression

image

Build a linear regression model for the Medical cost dataset. The dataset consists of age, sex, BMI(body mass index), children, smoker, and region features, and charges. You need to predict individual medical costs billed by health insurance. The target variable here is charges, and the remaining six variables such as age, sex, BMI, children, smoker, region, are the independent variables. The hypothesis function looks like

hθ(xi)=θ0+θ1age+θ2sex+θ3bmi+θ4children+θ5smoker+θ6region

Perform the following tasks for this dataset:-

1. Load the dataset and do exploratory data analysis.

2. Plot correlation between different variables and analyze whether there is a correlation between any pairs of variables or not.

4. Plot the distribution of the dependent variable and check for skewness (right or left skewed) in the distribution.

5. Convert this distribution into normal by applying natural log and plot it. (If the distribution is normal then skip this).

6. Convert categorical data into numbers. (You may choose one hot encoding or label encoding for that).

7. Split the data into training and testing sets with ratio 0.3.

8. Build a model using linear regression equation θ=(XTX)−1XTy . (First add a feature X0 =1 to the original dataset).

9. Build a linear regression model using the sklearn library. ( No need to add X0 =1, sklearn will take care of it.)

10. Get the parameters of the models you built in step 7 and 8, compare them, and print comparisons in a tabular form. If the parameters do not match, analyze the reason(s) for this (they should match in the ideal case).

11. Get predictions from both the models (step 7 and step 8).

12. Perform evaluation using the MSE of both models (step 7 and step 8). (Write down the MSE equation for the model in step 7 and use the inbuilt MSE for the model in step 8).

13. Plot the actual and the predicted values to check the relationship between the dependent and independent variables. (for both the models)

Lab-7 :- Multi-Layer Perceptron, K-Means Clustering and Neural Network

image

The objective of this assignment is to learn to implement Multi Layer Perceptron (MLP) from scratch using python. For this a nice tutorial has been provided. After implementing MLP from scratch, you need to compare it with Sklearn’s in-built implementation (resource-2). For this you are supposed to use wheat seeds dataset provided.

Please go through the following blog to learn how to recognize handwritten digits using Neural Network. Here Neural Network is coded using PyTorch Library in Python.

Use above code and report your observation based on the following:

(i) Change loss function,

(ii) Change in learning rate, and

(iii) Change in Number of hidden layers

image

You may use the MNIST dataset or any dataset for Face Images or Flower Images or Iris dataset for this Question.

Implement k-means clustering. Analyse the clusters formed for various values of k. Display the centroids of the clusters. DO NOT USE IN_BUILT ROUTINE for k-means clustering.

Lab-8 :- Dimensionality Reduction and Feature Selection

image

Using the data set, execute a PCA analysis using at least two dimensions of data (note that the last column should not be used here). In your code , discuss/include the following items.

1. Standardize the data.

2. How many eigenvectors are required to preserve at least 90% of the data variation?

3. Look at the first eigenvector. What dimensions are the primary contributors to it (have the largest coefficients)? Are those dimensions negatively or positively correlated?

4. Show a plot of your transformed data using the first two eigenvectors.

For the aforementioned dataset perform Linear discriminant analysis

1. Compare the results of PCA and LDA.

2. Plot the distribution of samples using the first 2 principal components and the first 2 linear discriminants.

3. Learn a Bayes classifier using the original features and compare its performance with the features obtained in part (b).

image

Perform feature selection using any 2 methods studied in class and do the classification for the dataset using a classification algorithm of your choice. Do the following tasks:

1. Preprocess the data and perform exploratory data analysis.

2. Identify the features having high significance using both of the methods.

3. Calculate and compare the accuracy and F1 score by both the methods and with the classifier learned using all the features (without doing feature selection), and analyze which method performs the best and why.

4. Use Pearson Correlation and compute correlated features with a threshold of 70%.

Lab-9 :- Support Vector Machines

image

Problem 1 (Handwritten Digit Classification):

Your goal is to develop a Handwritten Digit Classification model. This model should take input as an image of a Handwritten Digit and classify it to one of the five classes {0,1,2,3,4}. To this end, you are supposed to work on the MNIST dataset. You were shown how the MNIST dataset is read and displayed in python in one of the labs. Now, perform the following experiments:

1. Use 70-20-10 split for training, validation and testing. Use the validation set for hyperparameter tuning by doing grid search, and report the classification accuracy on the test set.

2. Use nearest neighbour, perceptron and SVM classifiers for classifying handwritten digits of MNIST, and compare their performance.

3. Normalize the data by mean subtraction followed by standard deviation division. Redo the above experiments on normalized data and report the performance.

4. Implement any two from OVA/OVO/DAG, and compare the results.

Problem 2: Use the “diabetes” dataset from the previous lab assignment. Split the dataset into training, validation and test sets (e.g., in 70:20:10 split, or 80:10:10 split).

On this dataset, evaluate the classification accuracy using the following classifiers:

1. SVM classifier (using a linear kernel)

2. SVM classifier (using a Polynomial kernel and a Gaussian kernel)

3. If your data is not linearly separable, then you may use the soft margin SVM formulation. You can use the inbuilt implementation of SVM in SciKit Learn.

4. Compare and analyze the results obtained in different cases. During cross-validation, try different values of various hyper-parameters, such as the regularization hyper-parameter ‘C’ (e.g., by varying it in {0.0001, 0.001, ... , 1, 10, 100, 1000}), and the kernel function hyper-parameter(s).

5. Report the number of support vectors obtained in the final model in each case.

6. Perform an experiment to visualize the separating hyper-plane (in a 2-D space).

About

Pattern Recognition and Machine Learning based Assignments and Labs, under Prof. Richa Singh in Course CSL2050.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published