PyDKPro

PyDKPro provides a Python wrapper for the DKPro Core NLP framework. DKPro Core itself is based on the UIMA framework and programmed in Java. Interoperability is achieved via web services deployed as docker container. After containerization, services remain active unless manually disabled or idle for limited period of interval.

Analysis results in DKPro Core are represented as CAS objects. Conversion between Java and Python data structures is based on dkpro-cassis We also provide built-in support for spacy format and NLTK) format allowing for seamless integration.

PyDKPro is still under heavy development. Feedback is highly appreciated.

Demo Version

For demo purpose, different use cases are provided with working example (mocked) in Examples/UseCases.ipynb

System requirements

Python >=3.6
Git
mvn
Docker (please make sure it is running)

Installation

Creating virtual environment

Install virtualenv if not already installed.

$ python -m pip install virtualenv

Create virtual environment. Replace [env_name] with a name of your choice.

$ mkdir [env_name]

$ virtualenv -p python3 [env_name]

or

$ python3 -m venv [env_name]

Activate created virtual environment.

For Windows:

$ [env_name]\Scripts\activate.bat

For Linux and Mac OS:

$ source [env_name]/bin/activate

Creating virtual environment using conda

Create virtual environment with conda. Replace [env_name] with a name of your choice.

$ conda create --name [env_name] python=3.6

To activate the environment (on Windows, MacOS and Unix),

$ conda activate [env_name]

Clone this repository

$ git clone https://github.com/zesch/pydkpro.git

Install dependencies using pip

$ cd pydkpro

$ python -m pip install -r requirements.txt

$ python -m spacy download en_core_web_sm

Features

Using DKPro Core components directly in Python
Conversion to spaCy
Conversion to NLTK

Usage

How to open examples notebook

$ cd Examples

$ jupyter notebook

Defining an NLP pipeline

A pipeline is build by adding DKPro Core components.

from pydkpro import Pipeline, Component
p = Pipeline(version="2.0.0", language='en')
p.add(Component().clearNlpSegmenter())
p.add(Component().stanfordPosTagger(variant='fast-caseless', printTagSet='false'))
p.build() # fire up the container web service

Note: All parameters are optional and default to best performed model versions.

Run the pipeline

For the triggered pipeline above, a CAS object will be generated for the provided string. This CAS object can be used to retrieve annotations like tokens and POS tags, etc.

cas = p.process('Backgammon is one of the oldest known board games.', language='en')

Note: Language detector is used, if language parameter is not provided.

To return all the tokens:

from pydkpro import DKProCoreTypeSystem as dts
cas.select(dts().token).as_text()

Output:

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']

To return all the pos tags:

cas.select(dts().token).get_pos()

Output:

['NNP', 'VBZ', 'NN', 'IN', 'DT', 'JJS', 'VBN', 'NN', 'NNS', '.']

Provide UIMA CAS functionality

DKProCoreTypeSystem would allow integration of other type systems to nicely use DKPro Cassis with their types systems. Generated cas object provide UIMA CAS functionality. For example:

# add annotation
from pydkpro.cas import Cas
Token = dts().typesystem.get_type('de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token') # define dkpro token
cas = Cas(dts().typesystem)()
cas.sofa_string = "I like cheese ."
tokens = [
    Token(begin=0, end=1, id='0', pos='NNP'),
    Token(begin=2, end=6, id='1', pos='VBD'),
    Token(begin=7, end=13, id='2', pos='IN'),
    Token(begin=14, end=15, id='3', pos='.')
]


for token in tokens:
    cas.add_annotation(token)

Cas token attributes can printed as following:

print([x.get_covered_text() for x in cas.select_all()])
print([x.pos for x in cas.select_all()])

Output:

['I', 'like', 'cheese', '.']
['NNP', 'VBD', 'IN', '.']

Conversion from CAS to spaCy format and vice-versa

Generated CAS objects can also be typecast to the spaCy type system.

from pydkpro import To_spacy, From_spacy
cas = p.process('Backgammon is one of the oldest known board games.', language='en')


for token in To_spacy(cas)():
    print(token.text, token.tag_)

Conversion from spaCy

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
cas = From_spacy(doc)()
print(cas.select(dts().token).get_pos())

Conversion from CAS to NLTK format

NLTK returns a specific format for each type of preprocessing. Here is an example for POS:

from pydkpro.external import To_nltk, From_nltk
print(To_nltk().tagger(cas))

Output:

[('Backgammon', 'NNP'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('oldest', 'JJS'), ('known', 'VBN'), ('board', 'NN'), ('games', 'NNS'), ('.', '.')]

This output can then be used for further integration with other NLTK components:

import nltk
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(To_nltk().tagger(cas))
print(chunked)

Output:

(S
(Chunk Backgammon/NNP)
is/VBZ
one/CD
of/IN
the/DT
oldest/JJS
known/VBN
board/NN
games/NNS
./.)

Conversion from NLTK

PyDKPro also provides reverse functionality where a CAS object can be created from spaCy or NLTK output. In the following example, tokenization is performed using NLTK tweet tokenizer, but POS tagging is done with the DKPro wrapper of Stanford CoreNLP POS tagger using their fast.41 model:

from nltk.tokenize import TweetTokenizer
cas = From_nltk().tokenizer(TweetTokenizer().tokenize('Backgammon is one of the oldest known board games.'))

Cas processing

PyDKPro pipeline also provide direct cas object processing as demonstrated in below example:

cas = p.process(cas)

# get tokens
print(cas.select(dts().token()).as_text())

# get pos tags
print(cas.select(dts().token()).get_pos())

Shortcut for running single components

A single component can also be run without the need to build a pipeline first:

tokenizer = Component().clearNlpSegmenter()

cas = tokenizer.process('I like playing cricket.')
print(cas.select(dts().token).as_text())

Output:

['I', 'like', 'playing', 'cricket', '.']

Working with list of strings

Multiple strings in the form of list can also be processed, where each element of list will be considered as document.

str_list = ['Backgammon is one of the oldest known board games.', 'I like playing cricket.']
for str in str_list:
    cas = p.process(str)
    print(cas.select(dts().token).as_text())

Working with text documents

Pipelines can also be directly run on text documents:

from pydkpro.external import File2str

cas = p.process(File2str('test_data/input/test2.txt')())
print(cas.select(dts().token).as_text())

Working with multiple text documents

Multiple documents can also be processed by providing documents path and document name matching patterns

# documents available at different path can be provided in list
docs = ['test_data/input/1.txt', 'test_data/input/2.txt']
for doc in docs:
    p.process(File2str(doc)())

End collection process

With following command pipeline's collection process will be completed (Alternatively, scope operator with can be used)

p.finish()

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
Examples		Examples
pydkpro		pydkpro
.gitignore		.gitignore
README.rst		README.rst
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Examples

Examples

pydkpro

pydkpro

.gitignore

.gitignore

README.rst

README.rst

requirements.txt

requirements.txt

Repository files navigation

PyDKPro

Demo Version

System requirements

Installation

Features

Usage

About

Releases

Packages

Contributors 6

Languages

zesch/pydkpro

Folders and files

Latest commit

History

Repository files navigation

PyDKPro

Demo Version

System requirements

Installation

Features

Usage

About

Resources

Stars

Watchers

Forks

Languages