Noise Detection and Removal on Cancer Dataset

This is my Scientific Initiation Project repository. The purpose of this work is to check how noise affects the performance of machine learning predictive models, more specifically with genomic data related to the cancer disease.

How it works

This script get a sequence of datasets, and may or may not balance the classes of the instances (all datasets used has only two classes). The two experiments are made separately.

Then, artificial noise is inputted on the dataset at 0.2 ratio, using a random methodology (each instance has the given fixed ratio of getting it class label exchanged).

The corrupted dataset is feed into each noise filter at a time, expecting that they remove at least the majority of the artificial inputted noise, and some noise instances originally on the dataset.

To verify the performance of each noise filter, different predictive models are fit with the original dataset, the corrupted dataset and the filtered dataset, generating three accuracies that are compared.

Technical details:

The experiment is reproducible, because the random seed is fixed with the arbitrarily selected '101010' value, which means that all script runs should hold the same results. However, your results may not coincide to those found inside 'results' subdirectory, as they are collected between several script changes and, most important, separated script runs.

If you want to change some experiment parameters, like the datasets, noise filters, classifiers, noise input ratio, number of folds on cross validation etc, just check out the './src/config.R' file and edit by hand whatever and whenever you want. More specific configuration must be edited on the 'main.R' file and it shouldn't be done unless you're sure of what you're doing, do with your own risk.

How to run:

The whole experiment should take days or weeks to be completed, and running it isn't recommended unless you want some metadata which is not already available on the 'result' subdirectory.

Just open a R section and type:

source('main.R')

The results will be automatically appended into the 'NoiseResults.dat' output file. If it does not exists, then it will be created. Make sure that the R section have all the permission it needs on your machine.

Filters used:

Will may find all these filters, alongside all relevant information about then, at https://CRAN.R-project.org/package=NoiseFiltersR.

AENN ('All-k Edited Nearest Neighbors')
INFFC ('Iterative Noise Filter based on the Fusion of Classifiers')
HARF ('High Agreement Random Forest')
ENG ('Editing with Nearest Graphs')

Classifiers used (R package):

k-Nearest Neighbors ('class')
Random Forest ('randomForest')
Support Vector Machine with Linear Kernel ('e1071')

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
compressed-datasets		compressed-datasets
extra		extra
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.R		main.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compressed-datasets

compressed-datasets

extra

extra

results

results

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

main.R

main.R

Repository files navigation

Noise Detection and Removal on Cancer Dataset

How it works

Technical details:

How to run:

Filters used:

Classifiers used (R package):

About

Releases

Packages

Languages

License

FelSiq/noise-detection-on-cancer-dataset

Folders and files

Latest commit

History

Repository files navigation

Noise Detection and Removal on Cancer Dataset

How it works

Technical details:

How to run:

Filters used:

Classifiers used (R package):

About

Topics

Resources

License

Stars

Watchers

Forks

Languages