Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: New Language Models and Discussion on Norwegian Variants #324

Open
kareglazie opened this issue Mar 5, 2024 · 2 comments
Open

Comments

@kareglazie
Copy link

Hello,

Thank you for your great Lingua crate!

As part of our efforts to adapt Lingua for our production environment and requirements, we've been working on extending its language support. We believe these enhancements can also be beneficial for the wider Lingua community and would like to participate in mainstream development by contributing our changes.

Added Language Models

We have introduced models for the following languages:

Language avg-low-ac single-low-ac pairs-low-ac sent-low-ac avg-high-ac single-high-ac pairs-high-ac sent-high-ac
Amharic 100 100 100 100 100 100 100 100
Burmese 99 100 100 99 100 100 100 100
Chechen 83 77 85 86 86 86 88 86
Kyrgyz 54 37 37 89 58 45 41 89
Malayalam 100 100 100 100 100 100 100 100
Nepali 35 13 26 66 41 21 29 72
Pashto 79 63 76 97 89 7 92 99
Sanskrit 40 19 34 67 56 37 49 82
Sinhala 100 100 100 100 100 100 100 100
Sindhi 66 49 60 89 87 73 89 98
Tatar 43 21 29 80 47 26 34 80
Tajik 79 65 73 98 89 81 85 99
Turkmen 28 44 16 23 30 48 17 23
Uzbek 90 82 88 99 96 92 97 99
Lao 99 100 100 99 99 99 100 99
Khmer 100 100 100 100 100 100 100 100

Norwegian Language Model Consideration

Additionally, during our development, we identified the need to consolidate the Norwegian language models. Originally, Lingua supports both Bokmål and Nynorsk. However, for our specific use case, a singular Norwegian model proved to be more effective. Therefore, we've replaced Bokmål with a more general Norwegian model in our branch.

This change raises an important question for the Lingua project: Would there be interest in adding a unified Norwegian model alongside the existing Bokmål and Nynorsk models, or would you prefer maintaining the distinct form of Norwegian as currently represented by Bokmål and Nynorsk? We're open to reverting our Norwegian model to separate Bokmål and Nynorsk models to align with your preferences.

Here's the link to our branch: https://github.com/kareglazie/lingua-rs/tree/new-langs

@pemistahl
Copy link
Owner

Hi Svetlana,

thank you for your effort to enhance my library with more languages. This is great. :) Can you please open a pull request? Then it's easier to review your changes and additions and to comment on them.

As for Norwegian, I prefer to treat Bokmal and Nynorsk separately because they are basically two different variants of written Norwegian. I want my library to be able to differentiate between them.

@kareglazie
Copy link
Author

Hello! Thanks for your reply. I've opened the PR and removed general Norwegian from models (now there are two separate variants, as it was originally in your crate).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants