GitHub - relwell/chompsky: An NLP pipeline for Wikia data

Chompsky - Wikia NLP Library

These are scripts that use data available on Wikia for natural language processing. The results of these scripts can be reused for a variety of purposes (e.g. improved search relevance, customer support, new features)

This is named after a goofy drawing I made one time of Noam Chomsky. And also because it chomps through text. Take your pick. This project in no way endorses Chomsky's views on linguistics. In fact, it flouts it.

Written for the Wikia December 2012 Hackathon

Dependencies

Scripts

get-sentiment.py -- A script that accesses text from Solr for a given wiki and namespace and determines the average sentiment (positive/negative) and objectivity (subjective/objective).
get-named-entities.py -- A script that accesses all named entities for a document
summarize-doc.py -- Summarizes the text of one or more documents
ark-coref.py -- Uses ARKRef to perform NER on a specific document
get-readability.py -- Performs a number of readability tests on a provided document

The Coref Pipeline

This is a set of scripts respnsible for extracting named entities out of Solr text and and storing in MongoDB. This includes:

coref/coref-write-files.py -- writes a raw file per wiki in a folder named after the wiki host
coref/coref-attach-files.py -- attaches a batch of files to Stanford CoreNLP's parser
coref/coref-transform-xml.py -- iterates over a directory, reads all XML files, and extracts, transforms, and loads the data into MongoDB
coref/coref-etl-batch.py -- attaches ETL scripts to particular directories

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
arkref		arkref
bin		bin
nltk_contrib @ 438d325		nltk_contrib @ 438d325
summarize @ 440555b		summarize @ 440555b
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arkref

arkref

bin

bin

nltk_contrib @ 438d325

nltk_contrib @ 438d325

summarize @ 440555b

summarize @ 440555b

.gitignore

.gitignore

.gitmodules

.gitmodules

README.md

README.md

init.py

init.py

Repository files navigation

Chompsky - Wikia NLP Library

Dependencies

Scripts

The Coref Pipeline

About

Releases

Packages

Languages

relwell/chompsky

Folders and files

Latest commit

History

Repository files navigation

Chompsky - Wikia NLP Library

Dependencies

Scripts

The Coref Pipeline

About

Resources

Stars

Watchers

Forks

Languages