Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshoot Coptic in Stanza #1058

Open
kylepjohnson opened this issue Feb 13, 2021 · 0 comments
Open

Troubleshoot Coptic in Stanza #1058

kylepjohnson opened this issue Feb 13, 2021 · 0 comments

Comments

@kylepjohnson
Copy link
Member

Coming from #634 which was finished by @AMR-KELEG .

Amir's stopword list is OK (in that StopsProcess in fact does a 1:1 match on a token and marks it stops=True in a Word object.

However, the tokenization coming out of Stanza is clearly wrong. I bet this is because of the diacritics on the example texts we're using (here). So perhaps we need a different example but also need someone who knows the language to help with proper preprocessing, so as to avoid this in the future.

image

I propose a new Process at cltk/alphabet, which removes things like diacritics and diaereses not expected by some other NLP process. NormalizeTextProcess might work. We should also figure out whether to use the previous old character normalizer cltk_normalize, too (ideally we would).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant