Contains various NLP datasets made from 400 papers by top 20 authors by citation in Machine Learning
MLPA-400 is a multiclass, mutilable authorship attribution problem. See it's own README.md and Weka for details
MLP-400AV is a new flexible dataset that uses the same data but for Authrship Verification. Because of it's extensive API, it can be adapted to almost any NLP task.
Total size of the datasets is over 18 milllion characters for MLPA-400 and almost roudble that for MLPA-400AV.