Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of URIs #320

Open
ryanheise opened this issue Feb 22, 2024 · 1 comment
Open

Handling of URIs #320

ryanheise opened this issue Feb 22, 2024 · 1 comment

Comments

@ryanheise
Copy link

URLs tend to make the language detector switch to English. For example:

最近、新しいウェブサイトを見つけました。
そのサイトはhttps://somerandomwebsite.com/hello.htmlで、興味深いコンテンツがいっぱいです。
特に「こんにちは.html」のセクションは非常に面白く、訪れる価値があります。
新しい発見があるかもしれませんので、ぜひチェックしてみてください。

Results:

JAPANESE: 最近、新しいウェブサイトを見つけました。
そのサイトは
ENGLISH: https://somerandomwebsite.com/hello.
JAPANESE: htmlで、興味深いコンテンツがいっぱいです。
特に「こんにちは.html」のセクションは非常に面白く、訪れる価値があります。
新しい発見があるかもしれませんので、ぜひチェックしてみてください。

Notice also the end of the URL .html is separated from the beginning of the URL and classified as a language change again.

The same also happens if just the domain name somerandomwebsite.com is referenced in the text.

Would it be reasonable for the language detector to treat URIs as "language neutral" stretches of text that maybe also assume the language of the surrounding text? Treating URis as atomic would also solve the issue of URIs being split by the language detector.

Note: It is also possible to do this in post processing the results of Lingua. So after receiving the start/end indices of each language segment from Lingua, I then apply my URI regular expression to find start/end indices of URIs and then modify the Lingua results accordingly.

@serega
Copy link

serega commented Feb 26, 2024

I work with text that may contain URLs. I pre-process documents before feeding into lingua-rs, and I use linkify crate to find URL indices. Finding URLs is a tricky problem on its own, and there are many ways to do it. linkify returns any string that is valid according to specs, but there can be false positives. In addition, I validate domain names using addr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants