Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What would it take to support additional languages? #8

Open
abartov opened this issue Feb 15, 2021 · 1 comment
Open

What would it take to support additional languages? #8

abartov opened this issue Feb 15, 2021 · 1 comment

Comments

@abartov
Copy link

abartov commented Feb 15, 2021

Hello, and thank you for your work on this!

I see two existing requests for additional languages, and I see the maintainer stated they have no time to implement them.

But perhaps you can indicate what it would take to support additional languages? If you could document the main requirements, I am sure people could be found to create pull requests and gradually increase the coverage of this very useful tool!

@faflo
Copy link
Collaborator

faflo commented Feb 18, 2021

Hey! So for additional languages, the main requirement is to run through the dumps server-side to create the initial data and then set up the pipeline for continously listening to the repespective event streams/pull the revision texts. That is basically just a repetition of previous languages but still requires meaningful effort for which there are no resources on my side right now. And there is not much that can be contributed through pull requests there. Also the server we are running this on right now would proably not support more than 2 larger language editions without being needed to be upgraded.

(A secondary task would be to actually test the output for additional languages, since tokenization is not 100% language-agnostic, although the rest of the algorithm is)

So bottom line I have been thinking about that it might be better to move the project to Wikimedia servers. The Wikimedia iOS team has also been reaching out and it seems that the demand for languages (and increased requests) make this a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants