Skip to content

faizann24/Authorship-Attribution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Authorship Attribution with Machine Learning

Authorship Attribution with Random Forests and TFIDF Scores

Python 3.6

This repository contains code for the blog post Large Scale Authorship Attribution with Machine Learning. It uses a Random Forest model along with TFIDF scores as features to perform authorship classification among n number of authors.

Files Description

Path Description
Authorship-Attribution Main folder.
└  sample_data Folder containing data for authors.
   ├  authors_folders One folder for each author.
      ├  authors_article_0.txt First article of the author.
      ├  authors_article_1.txt Second article.
      ├  ... authors_article_n.txt ... Last article.
├  attribution_model.py Authorship attribution model.

Usage

Packages

You will need to install the following package to run the authorship attribution model.

  • Scikit-learn

How to run

In order to run the model, please use the following command:

python3 attribution_model.py --articles_per_author 250 --authors_to_keep 5 --data_folder sample_data

The script takes three parameters as inputs:

  • articles_per_author: How many articles do you want to use per author. The range can be anywhere between [10-Maximum Number of Articles per any Author]
  • authors_to_keep: How many authors do you want in your attribution classifier. The range can be anywhere between [2-Total Authors]
  • data_folder: Data folder containing a single directory for each author.

License

MIT

Copyright (c) 2020-present, Faizan Ahmad