Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about domains #247

Open
dataf3l opened this issue Sep 29, 2021 · 1 comment
Open

question about domains #247

dataf3l opened this issue Sep 29, 2021 · 1 comment

Comments

@dataf3l
Copy link

dataf3l commented Sep 29, 2021

Hi guys, I love this library.

I have a question:
sometimes I get domain names as text input such as freizeit.com or toscanamare.com or someexample.com, notice that people don't nicely separate the text in the domain names like in "frei zeit" or "toscana mare",
when I use a tokenizer, in order to detect the language of the domain, the tokenizer requires me to proivde a language, i.e. en.

is there a library that can, in a multi-language fashion split a word which contains more words into a. sub word by taking the best guess as to what the language is before splitting it, so that this library can do a good job at detecting the language from the text?

I googled "multi-language text split" but I'm not finding good results, I thought maybe you guys have worked on this issue before.

do you have hints for me?

@Bachstelze
Copy link

You could try the sentencepiece model from multilingual language processing pipelines. But they work on a subword level and you will have many possible combinations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants