Handling of URIs #320

ryanheise · 2024-02-22T14:34:15Z

URLs tend to make the language detector switch to English. For example:

最近、新しいウェブサイトを見つけました。
そのサイトはhttps://somerandomwebsite.com/hello.htmlで、興味深いコンテンツがいっぱいです。
特に「こんにちは.html」のセクションは非常に面白く、訪れる価値があります。
新しい発見があるかもしれませんので、ぜひチェックしてみてください。

Results:

JAPANESE: 最近、新しいウェブサイトを見つけました。
そのサイトは
ENGLISH: https://somerandomwebsite.com/hello.
JAPANESE: htmlで、興味深いコンテンツがいっぱいです。
特に「こんにちは.html」のセクションは非常に面白く、訪れる価値があります。
新しい発見があるかもしれませんので、ぜひチェックしてみてください。

Notice also the end of the URL .html is separated from the beginning of the URL and classified as a language change again.

The same also happens if just the domain name somerandomwebsite.com is referenced in the text.

Would it be reasonable for the language detector to treat URIs as "language neutral" stretches of text that maybe also assume the language of the surrounding text? Treating URis as atomic would also solve the issue of URIs being split by the language detector.

Note: It is also possible to do this in post processing the results of Lingua. So after receiving the start/end indices of each language segment from Lingua, I then apply my URI regular expression to find start/end indices of URIs and then modify the Lingua results accordingly.

The text was updated successfully, but these errors were encountered:

serega · 2024-02-26T21:38:07Z

I work with text that may contain URLs. I pre-process documents before feeding into lingua-rs, and I use linkify crate to find URL indices. Finding URLs is a tricky problem on its own, and there are many ways to do it. linkify returns any string that is valid according to specs, but there can be false positives. In addition, I validate domain names using addr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of URIs #320

Handling of URIs #320

ryanheise commented Feb 22, 2024

serega commented Feb 26, 2024

Handling of URIs #320

Handling of URIs #320

Comments

ryanheise commented Feb 22, 2024

serega commented Feb 26, 2024