Plagiarism Detection Using Amazon SageMaker

This is the second deployment project which is part of the MLE Nanodegree. Detecting plagiarism is an active area of research; the task is non-trivial and the differences between paraphrased answers and original work are often not so obvious. In this project, a plagiarism detector is developed that examines a text file and performs binary classification; labeling that file as either plagiarized or not, depending on how similar that text file is to a provided source text.

This project is broken down into three main notebooks:

Notebook 1: Data Exploration

Load in the corpus of plagiarism text data.
Explore the existing data features and the data distribution.

Notebook 2: Feature Engineering

Clean and pre-process the text data.
Define features for comparing the similarity of an answer text and a source text, and extract similarity features.
Select "good" features, by analyzing the correlations between different features.
Create train/test .csv files that hold the relevant features and class labels for train/test data points.

Notebook 3: Train and Deploy Model in SageMaker

Upload train/test feature data to S3.
Define a binary classification model and a training script.
Train the model and deploy it using SageMaker.
Evaluate deployed classifier.

Prerequisites

AWS Account
Experience with model development on AWS SageMaker
Familiarity AWS S3

Setup instructions

cd SageMaker
git clone https://github.com/rohanjn98/plagiarism-detection-sagemaker.git
exit

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
source_sklearn		source_sklearn
1_Data_Exploration.ipynb		1_Data_Exploration.ipynb
2_Plagiarism_Feature_Engineering.ipynb		2_Plagiarism_Feature_Engineering.ipynb
3_Training_a_Model.ipynb		3_Training_a_Model.ipynb
README.md		README.md
helpers.py		helpers.py
problem_unittests.py		problem_unittests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source_sklearn

source_sklearn

1_Data_Exploration.ipynb

1_Data_Exploration.ipynb

2_Plagiarism_Feature_Engineering.ipynb

2_Plagiarism_Feature_Engineering.ipynb

3_Training_a_Model.ipynb

3_Training_a_Model.ipynb

README.md

README.md

helpers.py

helpers.py

problem_unittests.py

problem_unittests.py

Repository files navigation

Plagiarism Detection Using Amazon SageMaker

Notebook 1: Data Exploration

Notebook 2: Feature Engineering

Notebook 3: Train and Deploy Model in SageMaker

Prerequisites

Setup instructions

About

Releases

Packages

Languages

rohanjn98/plagiarism-detection-sagemaker

Folders and files

Latest commit

History

Repository files navigation

Plagiarism Detection Using Amazon SageMaker

Notebook 1: Data Exploration

Notebook 2: Feature Engineering

Notebook 3: Train and Deploy Model in SageMaker

Prerequisites

Setup instructions

About

Topics

Resources

Stars

Watchers

Forks

Languages