Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Danish #9

Open
MalteHB opened this issue Jul 18, 2022 · 2 comments
Open

Support for Danish #9

MalteHB opened this issue Jul 18, 2022 · 2 comments
Labels
question Further information is requested

Comments

@MalteHB
Copy link

MalteHB commented Jul 18, 2022

I really love this library, and it would be awesome, if support for the Danish Wikipedia was added.

What is needed for this to happen?

@Lucaterre
Copy link
Owner

Lucaterre commented Jul 19, 2022

Hi @MalteHB !

First of all, thank you for your interest for this extension.

Currently, spaCy fishing relies on entity-fishing version 0.0.5 which supports 11 languages ​​(English, French, German, Spanish, Italian, Arabic, Japanese, Chinese (Mandarin), Russian, Portuguese and Farsi). Unfortunately, Danish resources are not yet supported by entity-fishing.

The resources creation for a new language is a process which strongly depends on the evolution of entity-fishing tool and not (directly) of spaCy fishing.

However, if the Wikipedia corpus for Danish language is sufficient, you can create the resources for a new language with grisp tool and start a new entity-fishing instance for Danish. All detailed process to initialize a new language with grisp & entity-fishing is described here.

Feel free to write an issue on entity-fishing for more details on this process (maybe this language is already considered in progress?).

@Lucaterre Lucaterre added the question Further information is requested label Jul 19, 2022
@kermitt2
Copy link

kermitt2 commented Nov 7, 2022

Hi @MalteHB & @Lucaterre

(I am the developer of entity-fishing)

There is no plan currently to support Danish because the size of the Danish Wikipedia is very small for a decent entity disambiguation usage - it has 286,583 articles. I made some experiments and with a size lower than 1M articles, it starts to be difficult to have a correct coverage of entities, enough statistics and disambiguation context examples. The resulting disambiguator would be very limited and inaccurate for a concrete usage.

So I am currently rather focusing on languages with around 1M articles or more.

However, with some cross-lingual approaches, it might be possible in the future to support languages with limited Wikipedia size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants