Tokens with fancy quotes are being merged #16

cakelly · 2016-09-01T15:23:44Z

I am working on a comparison of tokenizers for microblog texts, and am finding issues with nlpj 1.1.3 (from http://nlp.mathcs.emory.edu/nlp4j/nlp4j-appassembler-1.1.3.tgz).

The first involves texts with fancy quotes, e.g. [ “@DevTheBarbie: ] | [ #Colorado’s ], which are being lumped into the same token as the twitter tokens they are precede or follow. The 's in "#Colorado’s" is a possessive and should be a separate token. Same for the opening " in " “@DevTheBarbie"

The online demo (http://nlp.mathcs.emory.edu:8080/nlp4j/NLP4JServlet) is handling these correctly, however.

I'm attaching the original input files, and the parses from NLP4J.
098.conll.txt
103.conll.txt

098.txt
103.txt

[This issue imported from emorynlp/nlp4j-tokenization#8]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokens with fancy quotes are being merged #16

Tokens with fancy quotes are being merged #16

cakelly commented Sep 1, 2016

Tokens with fancy quotes are being merged #16

Tokens with fancy quotes are being merged #16

Comments

cakelly commented Sep 1, 2016