Skip to content
Simon Gray edited this page Oct 14, 2017 · 22 revisions

This is the Wiki for corenlp-clj, a wrapper for Stanford CoreNLP written in Clojure.

Mission

Stanford CoreNLP is a powerful tool for Natural Language Processing, but features a rather clunky design with lots of cruft built up over the years. This library seeks to apply a lighter and more functional style to its API, while still retaining direct use of the data structures found in the Java version.

A secondary goal of the project is to provide sensible documentation for newcomers to NLP. Stanford CoreNLP is not a beginner-friendly tool, but corenlp-clj aims to be just that while still remaining powerful.

Usage

(ns example-project.core
  (:require [corenlp-clj.core :refer [pipeline prerequisites]]
            [corenlp-clj.annotations :refer [sentences dependency-graph]]
            [corenlp-clj.loom.io :refer [view]]))

(def nlp (pipeline {"annotators" (prerequisites "depparse")}))

(view (->> "The dependencies of this sentence have been visualised using Graphviz."
           nlp
           sentences
           dependency-graph
           first))

Check out the dependency graph that was generated in this short example. For an introduction to corenlp-clj, please refer to the tutorial!

API design

Function naming

As a general rule, the names of annotation retrieval function reflect the output of said function, e.g. pos outputs a part-of-speech, text outputs some text, dependency-graph outputs a dependency graph, and sentences outputs a seq of sentences.

Unnecessary annotations and method names from Stanford CoreNLP that shadow others have not been re-implemented in corenlp-clj, e.g. value or word would retrieve the same annotation as text when applied to tokens.

The output of functions whose names end with an s will typically have seqable output. In the case of chaining annotation functions together this does not matter, as seqable function outputs are mapped automatically by the next function in the chain. This principle results in conceptually clear code from the input to the output. The dimensionality of the final array can be gauged by counting how many function names are in plural.

Current goal

The development of this library is driven right now by my own needs for sensible implementations of CoreNLP functionality related to parts-of-speech and dependency graphs in Chinese. The base pipeline is ready for any kind of annotation work -- all of which can be accessed using a set of common functions -- and I'm working on implementing specific functionality in the semgraph package at the moment.

Links worth checking out