Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode regexp for NLP #599

Open
vitalydolgov opened this issue Dec 8, 2021 · 2 comments
Open

Unicode regexp for NLP #599

vitalydolgov opened this issue Dec 8, 2021 · 2 comments

Comments

@vitalydolgov
Copy link

According to documentation Str module, which you heavily use for NLP, is not working with UTF-8 at all, and for this reason it seems it does not fit for the general task of text processing.

For example Owl_nlp_utils.regexp_split defined as Str.regexp "[ \t;,.'!?()’“”\\/&—\\-]+". Note that it contains two special quotation mark each 3 bytes long:

# String.length "“”";;
- : int = 6

This brings us to the problem:

# Str.split (Str.regexp "[“”]+") "рык";;
- : string list = ["Ñ"; "ык"]

Because (of the second char \128):

# List.of_seq @@ String.to_seq "рык";;
- : char list = ['\209'; \128'; '\209'; '\139'; '\208'; '\186']
# List.of_seq @@ String.to_seq "“";;
- : char list = ['\226'; '\128'; '\156']

To solve this I propose switching to Pcre or similar libraries, which accept Unicode regular expressions:

# Pcre.split ~rex:(Pcre.regexp "[“]+" ~flags:[`UTF8]) "рык“льва";;
- : string list = ["рык"; "льва"]

What do you think?

@mseri
Copy link
Member

mseri commented Dec 8, 2021

This is my personal opinion, however we should move away from Str and use re, which is the recommended library, and the fastest if that supports utf-8 in some way.

Otherwise I agree with you that pcre could be a better fit.

@mseri
Copy link
Member

mseri commented Dec 8, 2021

Unfortunately re does not support utf-8 at the moment: ocaml/ocaml-re#24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants