disease-prediction-model

A prediction model that uses genetic data for disease classification.

Objective

The goal of this project is to build a model hat can classify a disease using machine learning classifier.

Dataset

Data is extracted from a DNA microarray which measures the expression levels of large numbers of genes simultaneously. Samples in the datasets represent patients. For each patient 7070 genes expressions (values) are measured in order to classify the patient’s disease into one of the following cases: EPD, JPA, MED, MGL, RHB.

Data Preprocessing:

The data set was split into training and testing files ‘pp5i_train.gr.csv’ and ‘pp5i_test.gr.csv’ which was kept locally and loaded into the python source code using the pandas library. The library allows us to do data manipulations and analysis, used specially for reading data from csv files. Then the data is labelled and the fold difference is limited between 2 and 16000. The subset of top gene samples are extracted in the sets of 2,4,6,8,10,12,15,20,25, and 30 top genes with the highest absolute T-value.

Implementation in Python:

Prediction model is developed using different algorithms - Gaussian Naïve Bayes Classifier, K Nearest Neighbor Classifier, Extra Tree Classifier, Neural Network - Multi-Layer Perceptron and Decision Tree Classifier. The conceptual idea behind this classifier is to pick an algorithm and a particlaur subset of genepool so that we can make tweaks to it with various regularization schemes, this process improves the learning ability of the model in a gradual and additive fashion.

Accuracy:

The Extra Tree classifier is particularly effective at classifying this particular gene sample with optimal k value 25. The classification is accurate ~95% of the time (this is validated against a pre-defined validation dataset fed into the model during the train/test phase).

Visualisation Model Performance and Validation:

Results/Inference:

In this project, we developed and compared several machine learning classifiers for predicting disease using dataset collected from gene microarray. The classifiers are trained in the labelled training gene samples and predicted on the provided unlabeled test sample. The most efficient classifier among them was identified as Extra Tree Classifier with best accuracy rate. Based on the proposed classification model, the disease prediction can be done for any sample collected over the microarray and the patient can be diagnosed in a most efficient manner.

Please refer to the research paper(unpublished) Disease Prediction on Genetic Microarray Data.pdf for futher explanation.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
datasets		datasets
results		results
src		src
Disease Prediction on Genetic Microarray Data.pdf		Disease Prediction on Genetic Microarray Data.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

results

results

src

src

Disease Prediction on Genetic Microarray Data.pdf

Disease Prediction on Genetic Microarray Data.pdf

README.md

README.md

Repository files navigation

disease-prediction-model

Objective

Dataset

Data Preprocessing:

Implementation in Python:

Accuracy:

Visualisation Model Performance and Validation:

Results/Inference:

Contributions:

About

Releases

Packages

Languages

BhuvaneshRavi/disease-prediction-model

Folders and files

Latest commit

History

Repository files navigation

disease-prediction-model

Objective

Dataset

Data Preprocessing:

Implementation in Python:

Accuracy:

Visualisation Model Performance and Validation:

Results/Inference:

Contributions:

About

Topics

Resources

Stars

Watchers

Forks

Languages