Troubleshoot Coptic in Stanza #1058

kylepjohnson · 2021-02-13T18:21:34Z

Coming from #634 which was finished by @AMR-KELEG .

Amir's stopword list is OK (in that StopsProcess in fact does a 1:1 match on a token and marks it stops=True in a Word object.

However, the tokenization coming out of Stanza is clearly wrong. I bet this is because of the diacritics on the example texts we're using (here). So perhaps we need a different example but also need someone who knows the language to help with proper preprocessing, so as to avoid this in the future.

I propose a new Process at cltk/alphabet, which removes things like diacritics and diaereses not expected by some other NLP process. NormalizeTextProcess might work. We should also figure out whether to use the previous old character normalizer cltk_normalize, too (ideally we would).

The text was updated successfully, but these errors were encountered:

kylepjohnson added the bug label Feb 13, 2021

kylepjohnson mentioned this issue Feb 13, 2021

Add NormalizeProcess #1059

Closed

kylepjohnson added the version1.0 label Feb 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshoot Coptic in Stanza #1058

Troubleshoot Coptic in Stanza #1058

kylepjohnson commented Feb 13, 2021

Troubleshoot Coptic in Stanza #1058

Troubleshoot Coptic in Stanza #1058

Comments

kylepjohnson commented Feb 13, 2021