Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of punctuation for tokenization in Latin pipeline #1207

Open
nkprasad12 opened this issue Feb 22, 2023 · 0 comments
Open
Labels

Comments

@nkprasad12
Copy link
Contributor

Describe the bug
A clear and concise description of what the bug is.

In some cases, the tokenizer for the Latin pipeline doesn't properly separate ! as a token.

To Reproduce
Steps to reproduce the behavior:

  1. Install Python version 3.8
  2. Install CLTK version 1.1.6 with pip
  3. In a script or REPL, run the following code … (include literal copy-paste)
(venv) nitin@nkprasad:~/Documents/code/morcus/morcus-net$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cltk
>>> nlp = cltk.NLP('lat')
‎𐤀 CLTK version '1.1.6'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinLexiconProcess`.
  1. See error (include literal copy-paste)
>>> doc = nlp.analyze('Cautus esto, mi fili! Iam sequere me!')
>>> doc.morphosyntactic_features[4]
{}
>>> doc.pos[4]
'PUNCT'
>>> doc.tokens[4]
'fili!'

Expected behavior
fili should be a separate token from !; due to this, we are getting PoS tags for fili (I assume it's processed as PUNCT instead).

Desktop (please complete the following information):

  • OS and version: Ubuntu 20.04.05 LTS
@nkprasad12 nkprasad12 added the bug label Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant