Lending services companies allow individual investors to partially fund personal loans as well as buy and sell notes backing the loans on a secondary market. This data will be used to determine whether a borrower is creditworthy and should be issued a loan.
You will be using this data to create machine learning models to classify the risk level of given loans. Specifically, you will be comparing the Logistic Regression model and Random Forest Classifier.
The data is located in the Resources folder.
lending_data.csv
Import the data using Pandas.
The following prediction was made as to whether a Logistic Regression model or a Random Forest model would perform better when fit to the given data.
This dataset is already preprocessed. There are duplicate values but that makes sense in the scope of this problem. It is a little suspicious that there are so many duplicate rows but since it is theoretically possible, it's best to not drop any data. Since all of the data is numeric, Logistic Regression should perform well. I suspect Random Forests to perform slightly better since there are many features involved and the Random Forest methon generally has the edge when we're comparing more variables.
A LogisticRegression model was created, fit it to the data, and the model's score was printed. The same was done for a RandomForestClassifier. The following questions were considered.
- Which model performed better?
- How does that compare to your prediction?
Contrary to my prediction, it seems as though the logistic regression performed slightly better but only by .02%. It is likely that the random forests method would perform better after tweaking some of the parameters but since they are both receiving scores of 99% and logistic regression is much faster, it's unlikely that it would be worth it in this case.
- Loan Approval Dataset (2022). Data generated by Trilogy Education Services, a 2U, Inc. brand, and is intended for educational purposes only.
- Assignment 19 Instructions