Mycological Risk Assessment - Machine Learning Approach.

Problem statement

Mycologists face the daunting challenge of quickly and accurately distinguishing edible mushrooms from poisonous mushrooms based on their physical attributes. To address this, the project aims to develop a robust machine learning model using the Mushroom dataset. This model seeks to provide mycologists with a reliable tool for on-the-spot mushroom classification during fieldwork, significantly reducing the risks associated with misidentification and enhancing safety measures for both researchers and the general populace.

PROJECT OVERVIEW

The project's focus is to train a machine learning model capable of accurately classifying mushrooms as edible or poisonous by analysing their physical characteristics. The model is designed to assist teams of mycologists in conducting risk assessments associated with fungal species.

BUSINESS UNDERSTANDING

Identifying edible vs. poisonous mushrooms presents challenges due to their varied traits. Leveraging machine learning on the Mushroom dataset's key features like cap shape and odor, this project aims to create a precise classification tool. It aims to boost safety and efficiency for mycologists and enthusiasts involved in mushroom analysis.

Components

Jupyter Notebook The Jupyter Notebook is my key deliverable and contains details of the approach, methodology, data cleaning, exploratory data analysis, pipelines, model building and validation used.

I recommend using nbviewer to view the Jupyter Notebook.

Presentation The presentation gives a high-level overview of our approach, findings and recommendations for non-technical stakeholders. It is aimed to be between 5 and 10 minutes long.
Data

The dataset can be found in the file "secondary_data.csv" in the Data folder, in this repository. It was originally provided in the following repository.

Technologies/ Packages

Python version: 3.6.9
Matplotlib version: 3.1.3
Seaborn version: 0.9.0
Pandas version: 0.25.1
Numpy version: 1.16.5
Statsmodels version: 0.10.1
Scikit-learn version: 0.21.2

To get started

Clone this repository - guidance.
Dataset can be found in the file "secondary_data.csv".
Check requirements in Technologies section above and download libraries if necessary.

1. Data Understanding

Here I loaded the data and got a brief introduction to the columns I was working with.

2. Exploratory Data Analysis (EDA)

Here we will explore the different features of the dataset to gain a better understanding of the data. We will use data vizualization to uncover trends and patterns. We will use Feature Engineering to create new features from existing ones and perform One-Hot Encoding on categorical variables that we will require for analysis.

Most houses are priced around a half million to a million dollars, while the most expensive houses imply the order of two million dollars and more

Overview of mushroom features

Categorical features include class, cap-shape, cap-surface, cap-color, does-bruise-or-bleed, gill-attachment, gill-spacing, gill-color, stem-root, stem-surface, stem-color, veil-type, veil-color, has-ring, ring-type, spore-print-color, habitat, season
Numerical variables include cap-diameter, stem-height, stem-width.

3. Data Preprocessing & Pipeline Setup

Here I worked on data cleaning, handled missing values, data transformation, handled duplicates, data reshaping and other processes to ensure that I have a clean, structured, and suitable format for analysis and modeling.

3. Base Model

Logistic Regression Model

After completing the data preprocessing steps involving feature scaling, one-hot encoding of categorical variables, and handling class imbalance using SMOTE, the dataset is now ready for model implementation.

We, begin with Logistic regression as a baseline model. Here are results for this model.

Accuracy: 0.8636
Precision: 0.8899627121223651
Recall: 0.8611929307805597
F1-score: 0.8753414917106396

4. Advance Models

Naive Bayes

Accuracy: 0.6225

        precision    recall  f1-score   support

     0       0.76      0.22      0.34     10848
     1       0.60      0.95      0.74     13580
     accuracy                           0.62     24428
     macro_avg       0.68      0.58      0.54     24428
     weighted_avg       0.67      0.62      0.56     24428

Random Forest

Accuracy: 1.0000

        precision    recall  f1-score   support

     0       1.00      1.00      1.00     10848
     1       1.00      1.00      1.00     13580
     accuracy                           1.00     24428
     macro_avg       1.00      1.00      1.00     24428
     weighted_avg       1.00      1.00      1.00     24428

Conclusions

Therefore, while both logistic regression and naive bayes showcased reasonable performances, the random forest model delivered exceptional results, demonstrating a superior ability to distinguish between edible and poisonous mushrooms in this dataset.

Contributor:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dataset		dataset
README.md		README.md
index.ipynb		index.ipynb
mushroooms.png		mushroooms.png
notebook.pdf		notebook.pdf
presentation.pdf		presentation.pdf
shrooms.jpg		shrooms.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

README.md

README.md

index.ipynb

index.ipynb

mushroooms.png

mushroooms.png

notebook.pdf

notebook.pdf

presentation.pdf

presentation.pdf

shrooms.jpg

shrooms.jpg

Repository files navigation

Mycological Risk Assessment - Machine Learning Approach.

Problem statement

PROJECT OVERVIEW

BUSINESS UNDERSTANDING

Components

Technologies/ Packages

To get started

1. Data Understanding

2. Exploratory Data Analysis (EDA)

3. Data Preprocessing & Pipeline Setup

3. Base Model

Logistic Regression Model

4. Advance Models

Naive Bayes

Random Forest

Conclusions

Contributor:

About

Releases

Packages

Languages

MarwaBrian/Mycological-Risk-Assessment

Folders and files

Latest commit

History

Repository files navigation

Mycological Risk Assessment - Machine Learning Approach.

Problem statement

PROJECT OVERVIEW

BUSINESS UNDERSTANDING

Components

Technologies/ Packages

To get started

1. Data Understanding

2. Exploratory Data Analysis (EDA)

3. Data Preprocessing & Pipeline Setup

3. Base Model

Logistic Regression Model

4. Advance Models

Naive Bayes

Random Forest

Conclusions

Contributor:

About

Topics

Resources

Stars

Watchers

Forks

Languages