question about domains #247

dataf3l · 2021-09-29T10:01:44Z

Hi guys, I love this library.

I have a question:
sometimes I get domain names as text input such as freizeit.com or toscanamare.com or someexample.com, notice that people don't nicely separate the text in the domain names like in "frei zeit" or "toscana mare",
when I use a tokenizer, in order to detect the language of the domain, the tokenizer requires me to proivde a language, i.e. en.

is there a library that can, in a multi-language fashion split a word which contains more words into a. sub word by taking the best guess as to what the language is before splitting it, so that this library can do a good job at detecting the language from the text?

I googled "multi-language text split" but I'm not finding good results, I thought maybe you guys have worked on this issue before.

do you have hints for me?

Bachstelze · 2021-10-14T10:14:52Z

You could try the sentencepiece model from multilingual language processing pipelines. But they work on a subword level and you will have many possible combinations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about domains #247

question about domains #247

dataf3l commented Sep 29, 2021

Bachstelze commented Oct 14, 2021

question about domains #247

question about domains #247

Comments

dataf3l commented Sep 29, 2021

Bachstelze commented Oct 14, 2021