Training on Arabic language #115

lecidhugo · 2020-12-03T17:01:59Z

Hello,
Is there any document or guide on how to train on Arabic ?
Is this possible ? if yes what are the requirements ?

Thanks in advance,

kermitt2 · 2020-12-03T23:58:39Z

You can create the resources for a new language with https://github.com/kermitt2/grisp
The readme describes the process. It's an Hadoop process that is going to take a few hours.

Once done, you can start an environment for Arabic with entity-fishing, the knowledge base will be automatically build. Then you need to train a ranker and a selector model as described here -> https://nerd.readthedocs.io/en/latest/train.html#training-with-wikipedia

Loading the markupFull is the DB that is time consuming, it stores all the article text content.

You don't need to create embeddings if I remember well, it should work without them. However it improves a bit the disambiguation. This is also quite time consuming (it should be half day for Arabic given the number of articles).

There are 1,080,907 articles in Arabic, so it's a pretty big number, it should be doable and provide decent results.

lecidhugo · 2020-12-08T14:45:31Z

Thank you very much for your kind reply!
I will try to do it

kermitt2 · 2022-05-03T21:15:25Z

Note that Arabic is now supported by default, with already trained models and KB resources available, see the documentation.

Lucaterre mentioned this issue Jul 19, 2022

Support for Danish Lucaterre/spacyfishing#9

Open

kermitt2 added the implemented label Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on Arabic language #115

Training on Arabic language #115

lecidhugo commented Dec 3, 2020

kermitt2 commented Dec 3, 2020

lecidhugo commented Dec 8, 2020 •

edited

kermitt2 commented May 3, 2022

Training on Arabic language #115

Training on Arabic language #115

Comments

lecidhugo commented Dec 3, 2020

kermitt2 commented Dec 3, 2020

lecidhugo commented Dec 8, 2020 • edited

kermitt2 commented May 3, 2022

lecidhugo commented Dec 8, 2020 •

edited