Incorrect handling of punctuation for tokenization in Latin pipeline #1207

nkprasad12 · 2023-02-22T14:17:36Z

Describe the bug
A clear and concise description of what the bug is.

In some cases, the tokenizer for the Latin pipeline doesn't properly separate ! as a token.

To Reproduce
Steps to reproduce the behavior:

Install Python version 3.8
Install CLTK version 1.1.6 with pip
In a script or REPL, run the following code … (include literal copy-paste)

(venv) nitin@nkprasad:~/Documents/code/morcus/morcus-net$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cltk
>>> nlp = cltk.NLP('lat')
‎𐤀 CLTK version '1.1.6'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinLexiconProcess`.

See error (include literal copy-paste)

>>> doc = nlp.analyze('Cautus esto, mi fili! Iam sequere me!')
>>> doc.morphosyntactic_features[4]
{}
>>> doc.pos[4]
'PUNCT'
>>> doc.tokens[4]
'fili!'

Expected behavior
fili should be a separate token from !; due to this, we are getting PoS tags for fili (I assume it's processed as PUNCT instead).

Desktop (please complete the following information):

OS and version: Ubuntu 20.04.05 LTS

The text was updated successfully, but these errors were encountered:

nkprasad12 added the bug label Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect handling of punctuation for tokenization in Latin pipeline #1207

Incorrect handling of punctuation for tokenization in Latin pipeline #1207

nkprasad12 commented Feb 22, 2023

Incorrect handling of punctuation for tokenization in Latin pipeline #1207

Incorrect handling of punctuation for tokenization in Latin pipeline #1207

Comments

nkprasad12 commented Feb 22, 2023