You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Amir's stopword list is OK (in that StopsProcess in fact does a 1:1 match on a token and marks it stops=True in a Word object.
However, the tokenization coming out of Stanza is clearly wrong. I bet this is because of the diacritics on the example texts we're using (here). So perhaps we need a different example but also need someone who knows the language to help with proper preprocessing, so as to avoid this in the future.
I propose a new Process at cltk/alphabet, which removes things like diacritics and diaereses not expected by some other NLP process. NormalizeTextProcess might work. We should also figure out whether to use the previous old character normalizer cltk_normalize, too (ideally we would).
The text was updated successfully, but these errors were encountered:
Coming from #634 which was finished by @AMR-KELEG .
Amir's stopword list is OK (in that
StopsProcess
in fact does a 1:1 match on a token and marks itstops=True
in aWord
object.However, the tokenization coming out of Stanza is clearly wrong. I bet this is because of the diacritics on the example texts we're using (here). So perhaps we need a different example but also need someone who knows the language to help with proper preprocessing, so as to avoid this in the future.
I propose a new
Process
at cltk/alphabet, which removes things like diacritics and diaereses not expected by some other NLP process.NormalizeTextProcess
might work. We should also figure out whether to use the previous old character normalizer cltk_normalize, too (ideally we would).The text was updated successfully, but these errors were encountered: