Skip to content

anaungurean/Spam-Email-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Study on Spam Email Classification Algorithms

Description

This repository represents the practical assignment within the "Machine Learning" course. The project aims to investigate the adaptability/adequacy of various classification algorithms in the context of solving the spam email detection problem, using the Ling-Spam dataset available here.

Requirements

1. Understanding the Dataset

  • Document the attributes and labels of the dataset, as well as the process of extracting them from the textual representation. Highlight the clues in the file titles (in the form of the "spm" prefix) indicating spam messages.

2. Dataset Split

  • Utilize the 9 folders (from part1 to part9) for training and keep one folder for testing (part10) from each category (wood, bars, stop, wood_stop).

3. Algorithm Selection and Implementation

  • Choose and implement an algorithm, among those studied, that you consider suitable for solving the spam classification problem.

4. LaTeX Report

  • Justify the algorithm choice in a LaTeX report, both theoretically and experimentally. Include a comparison with other candidate algorithms.

5. Leave-One-Out Cross-Validation

  • Implement and present results using the Leave-One-Out cross-validation strategy, including a statistical graph.

6. Algorithm Performance on Test Set

  • Add to the report a graph illustrating the algorithm's performance on the test dataset in terms of accuracy obtained. The accuracy should be significantly better than trivial strategies (random guessing or constant class selection). Include comparative graphs if you tested multiple algorithms.

7. Additional Details

  • Explain any relevant experiment detail, either in text or through graphs. Investigate improved variants of the algorithm studied in the seminar to implement and enhance accuracy.