Group Pyxis - Project 1 Scalable Document Classification

This repository classify documents by using Reuters Corpus, which is a set of news stories with multiple class labels. with Spark's Python API completed for CSCI8360: Data Science Practicum at the University of Georgia. By using Reuters Corpus, which is a set of new stories with multiple class labels. The training data is over 1 gigabyte and the testing data is roughly 117 MB with over 80000 documents/news stories.

These different news stories are split into different categories. In this project, we are only focusing on these four labels:

CCAT: Corporate/Industrial
ECAT: Economics
GCAT: Government/Social
MCAT: Markets

For documents with more than one label, we treat it as if it's observed once for each CAT label. In prediction, we only predict one label for each document.

Getting started

Prerequisites

This project uses Apache Spark. You'll need to have Spark installed on the target cluster. The SPARK_HOME environment variable should be set, and the Spark binaries should be in your system path. You also need to install the library of NLTK for stemming.

Built with

How to run

To get it running, you can use

spark-submit main.py **kwargs

to run the main program and the output should be in the path you specified. There are several keyword arguments for the program. They are as follows:

-x: the path for the x_training file (documents). REQUIRED
-y: the path for the y_training file (labels). REQUIRED
-xtest: the path for the x_testing file. (documents) REQUIRED
-st: the path for the stopword files. OPTIONAL, default value None.
-l: length of words to throw away. OPTIONAL, default value 2 (i.e. ignore all words with length 2 or less).
-o: path for the output file output.txt. OPTIONAL, default value is the same as the main.py file.

If you find any problem, please create a ticket!

Specific features

Here are some methods that we use in this project:

Baseline: Naive Base Model
removal of words whose lengths are less than or equal to 2
removal of stop words: we choose a long list of stop words (see here and find "a very long stopword list").
TF-IDF (term frequency inverse document frequency)
n-gram (2-gram; 3-gram or higher could be achieved by tuning the parameter in pre_processing.py)
stemming: we use NLTK's porter stemmer (see http://www.nltk.org/howto/stem.html)

Contributors (alphabetically sorted)

Layton Hayes, Institute of Artificial Intelligence, University of Georgia
Parya Jandaghi, Department of Computer Science, University of Georgia
Jeremy Shi, Institute of Artificial Intelligence, University of Georgia

See the contributors file for detailed contributions. We also thank Shannon Quinn for helpful instructions.

License

MIT

TODO

Tuning the stopword list to improve the accuracy.
Improve smoothing in tf-idf.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.gitignore		.gitignore
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md
derby.log		derby.log
main.py		main.py
pre_processing.py		pre_processing.py
testing.py		testing.py
tf_idf.py		tf_idf.py
training.py		training.py
unit_tests.py		unit_tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

CONTRIBUTORS.md

CONTRIBUTORS.md

LICENSE

LICENSE

README.md

README.md

derby.log

derby.log

main.py

main.py

pre_processing.py

pre_processing.py

testing.py

testing.py

tf_idf.py

tf_idf.py

training.py

training.py

unit_tests.py

unit_tests.py

Repository files navigation

Group Pyxis - Project 1 Scalable Document Classification

Getting started

Prerequisites

Built with

How to run

Specific features

Contributors (alphabetically sorted)

License

TODO

About

Releases

Packages

Contributors 3

Languages

License

dsp-uga/Pyxis-p1

Folders and files

Latest commit

History

Repository files navigation

Group Pyxis - Project 1 Scalable Document Classification

Getting started

Prerequisites

Built with

How to run

Specific features

Contributors (alphabetically sorted)

License

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Languages