StAn

In natural language processing, algorithms often require additional linguistic features (syntactic and semantic), such as part-of-speech, named entity, and dependency tags; information that is not readily available in most datasets. StAn provides a convenient way to quickly annotate an existing dataset with additional linguistic features computed by Stanford CoreNLP.

Getting Started

Prerequisites

StAn either uses a local CoreNLP installation or an exisiting CoreNLP Server. To use a local installation, download and unpack the latest version from the Stanford CoreNLP website.

Installing

With pip

TBD

From Source

Clone the repository and run:

pip install [--editable] .

Usage

For example, the following command annotates the SemEval 2010 Task 8 relation extraction dataset with POS, NER, and dependency information and saves it in JSONL format.

stan \
    --input-dir $INPUT_PATH/SemEval2010_task8_all_data/ \
    --output-dir $OUTPUT_PATH/ \
    --corenlp $PATH_TO_CORENLP_JAR_OR_SERVER_URL \
    --input-format semeval2010task8 \
    --output-format jsonl \
    --shuffle \
    --validation-size 0.1 \
    --n-jobs 4

Parameters:

input-dir: the directory containing the dataset or dataset files. StAn expects a specific structure for common datasets (e.g. SemEval 2010 Task 8). The format of the input is specified by input-format.
output-dir: the directory to store the annotated dataset. The format in which to save the dataset is specified by output-format.
corenlp: the path to the directory containing the CoreNLP jar file or a url pointing to an exisiting CoreNLP server.
input-format: the format of the input dataset, can be one of "semeval2010task8", "json" or "jsonl".
output-format: the format of the output dataset, can be one of "tacred", "json", "jsonl".
shuffle: whether to shuffle the training dataset before splitting into train and validation (only if validation size > 0).
validation-size: if > 0, use a validation-size fraction of the training dataset for validation.
n-jobs: the number of threads to use for concurrent requests to CoreNLP.

Running the tests

Explain how to run the automated tests for this system

Unittests

pytest -v tests/

Typechecker and coding style tests

mypy stan --ignore-missing-imports

Built With

Stanford CoreNLP
stanford-corenlp - Python wrapper for Stanford CoreNLP

Authors

Christoph Alt

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.circleci		.circleci
stan		stan
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.circleci

.circleci

stan

tests

tests

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pytest.ini

pytest.ini

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

StAn - Quickly annotate your dataset with Stanford CoreNLP

Getting Started

Prerequisites

Installing

With pip

From Source

Usage

Parameters:

Running the tests

Unittests

Typechecker and coding style tests

Built With

Authors

License

About

Releases

Packages

Languages

License

ChristophAlt/StAn

Folders and files

Latest commit

History

Repository files navigation

StAn - Quickly annotate your dataset with Stanford CoreNLP

Getting Started

Prerequisites

Installing

With pip

From Source

Usage

Parameters:

Running the tests

Unittests

Typechecker and coding style tests

Built With

Authors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages