URL Classification using NaiveBayes Data Mining.

This is a code base for my capstone Data Mining and Cybersecurity class. It is designed to make predictions on whether or not a URL is benign or malicious, without accessing the web page directly.

Code developed using Ubuntu 18.04 and Python 3.6

Description of Modules

loadAlexaDB.py

- downloads the top 1m webpages from Alexa and loads the csv to a sqlite database

extract_features.py

- reads in the url files (benign.csv and merged_malware.csv)
- processes these records through **vector_creator.py** to vectorize the url
- vector is written out to **combined_url.csv**

cleanAndDedupData.py.py

- reads the **combined_url.csv** data and checks for data issues that were found causing issues , and 
    corrects them or drops those records
- removed duplicate URLs from the file
- creates **cleaned_data.csv** for use in further processing

findBestK.py

- read in the **cleaned_data.csv** , and used SelectKBest feature selection, running n-times as determined by 
    the number of features on the file.  
- each of the k-best values is run through cross validation, to determine which would produce the best model. 
- the feature mask for the k-best that produces the best model is then written out, to be used to produce the full
    model and for using in the predictor module

trainFullBayesModel.py

- using the feature mask from **findBestK.py** , cross validation is run again to produce average statistics
- ROC curve, Confusion Matrix and accuracy scores are produces
- writes out files for false negative and false positive classifier errors
- model is saved to a pickle file, to be used again in the **predictor.py**

trainFullBayesModelAllFeatures.py

- runs the same as the trainFullBayesModel with the exception that it uses all of the features on the file
- also does not write out classifier errors

DecisionForest.py

- creates a classifier using the cleaned, deduped data using the Decision Tree model
- ROC curve, Confusion Matrix and accuracy scores are produces

randomForest.py

- creates a classifier using the cleaned, deduped data using the Random Forest model
- ROC curve, Confusion Matrix and accuracy scores are produces

predictor.py

- Asks for URL for testing; runs the url through **vector_creator.py** 
- returned vector is run throught he naive model and prediciton is produced.

dataAnalysis.py

- creates statistical informaiton from the benign and malicious groups of deduped data and writes out to a CSV
- counts the number of records on the original cleaned file and the dedupped file for comparison.

errorAnalysis.py

- uses the false positive and false negative classifier errors to perform analysis
- creates counts of the number of records from each of the files that fit criteria for the features
    to help determine where errors may be coming from 
- using analysis of those errors, addition csv files are created to further analyze the largest of the error for 
    trends 
- after analysis was done, records are removed for further testing and analysis.

featureAnalysis.py

- creates counts of the number of records from the deduped file that fit criteria for the features
    to allow for analysis

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
OriginalData		OriginalData
Christopher Garner - Identification of Malicious URL’s using Data Mining -Final Report.pdf		Christopher Garner - Identification of Malicious URL’s using Data Mining -Final Report.pdf
FN_Alexa_Rank_1.csv		FN_Alexa_Rank_1.csv
FN_Country_EQ255.csv		FN_Country_EQ255.csv
FN_alexa_country.csv		FN_alexa_country.csv
FN_amazon_ancestry.csv		FN_amazon_ancestry.csv
FP_Alexa_Rank_0.csv		FP_Alexa_Rank_0.csv
FP_Country_EQ255.csv		FP_Country_EQ255.csv
FP_Country_NE255.csv		FP_Country_NE255.csv
README.md		README.md
ROC_plot.png		ROC_plot.png
ROC_plot_RForest.png		ROC_plot_RForest.png
ROC_plot_TREE.png		ROC_plot_TREE.png
alexa.db		alexa.db
benign.csv		benign.csv
benign_describe.csv		benign_describe.csv
bestFeatures.csv		bestFeatures.csv
classifierErrorsFN.csv		classifierErrorsFN.csv
classifierErrorsFP.csv		classifierErrorsFP.csv
cleanAndDedupData.py		cleanAndDedupData.py
cleaned_data.csv		cleaned_data.csv
cleaned_deduped.csv		cleaned_deduped.csv
combined_url.csv		combined_url.csv
confusion_matrix_plot.png		confusion_matrix_plot.png
confusion_matrix_plot_RForest.png		confusion_matrix_plot_RForest.png
confusion_matrix_plot_TREE.png		confusion_matrix_plot_TREE.png
countries.py		countries.py
dataAnalysis.py		dataAnalysis.py
decisionTree.py		decisionTree.py
dropped.csv		dropped.csv
entropy.py		entropy.py
errorAnalysis.py		errorAnalysis.py
extract_features.py		extract_features.py
featureAnalysis.py		featureAnalysis.py
featuresMask.csv		featuresMask.csv
fileHeader.py		fileHeader.py
findBestK.py		findBestK.py
forestPredictor.sav		forestPredictor.sav
invalid.csv		invalid.csv
kbest_accuracy.txt		kbest_accuracy.txt
loadAlexaDB.py		loadAlexaDB.py
malicious_describe.csv		malicious_describe.csv
merged_malware.csv		merged_malware.csv
naivePredictor.sav		naivePredictor.sav
names.py		names.py
predictor.py		predictor.py
randomForest.py		randomForest.py
requirements.txt		requirements.txt
to_drop.csv		to_drop.csv
top-1m.csv		top-1m.csv
trainFullBayesModel.py		trainFullBayesModel.py
trainFullBayesModelAllFeatures.py		trainFullBayesModelAllFeatures.py
treePredictor.sav		treePredictor.sav
tree_render		tree_render
vector_creator.py		vector_creator.py

garner007/url-classification-system-ml

Folders and files

Latest commit

History

Repository files navigation