Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find candidate for autonomic translator module #26

Open
vitonsky opened this issue Jul 28, 2021 · 16 comments
Open

Find candidate for autonomic translator module #26

vitonsky opened this issue Jul 28, 2021 · 16 comments
Labels
enhancement New feature or request help wanted Extra attention is needed important Must do it

Comments

@vitonsky
Copy link
Collaborator

vitonsky commented Jul 28, 2021

Now Linguist uses an third party services to translate texts. It is not private, because users send their texts to a servers like Google, Yandex, Bing, etc.

We need to allow users to use offline translation that will not send their texts to internet and will process it local on user machine. It is technically possible right now with custom translators, we have guide about it, however it is not embedded solution "for geeks only", so most users will not use this way.

Thus, we have to implement embedded module to offline translate text.

Requirements to candidate

  • High quality of translation
  • High performance
  • Support WASM or ready of maintainers to implement it
  • Many languages (at least 6)
  • Support all translate directions between supported languages
  • Not so big size of additional files (like dictionary or AI models)
@vitonsky vitonsky added enhancement New feature or request important Must do it labels Jul 28, 2021
@vitonsky
Copy link
Collaborator Author

New candidate LibreTranslate/LibreTranslate#120

@vitonsky vitonsky pinned this issue Jul 29, 2021
@vitonsky
Copy link
Collaborator Author

@vitonsky
Copy link
Collaborator Author

vitonsky commented Aug 8, 2021

Research the OpenNMT/CTranslate2#528

@noaho
Copy link

noaho commented Aug 9, 2021

Hi, Thank you for your project!

I think your expectations regarding size and speed of translation engines might be a bit unrealistic, I use m2m_100 to translate subtitles with easynmt on a fast ryzen box and it takes up to 24gb of ram and only outputs about a sentence a second.. The speed you'd need to make this plug in usable would probably require GPU acceleration, and I don't think you can download 5GB models in a browser plug in and use 20+ GB of ram.

There are alternatives that use less ram and cpu but at the expense of accuracy / language support. There's a good table here under the heading "available models' https://github.com/UKPLab/EasyNMT

However, easynmt does provide a REST API that you can host on a fast machine / cloud with docker, it's under the title " Docker & REST-API" on that easynmt page. It would be awesome it you could support that!

Anyway take my recommendations with a grain of salt, I'm just an amateur in the world of ML trying to roll my own alternatives to Google :)

@vitonsky
Copy link
Collaborator Author

vitonsky commented Aug 9, 2021

Thanks for your suggestion @noaho, i will research it.

I don't sure about models size. For example, i heard that https://github.com/browsermt/bergamot-translator is use models with size 20-80 mb.

At this time i not have knowledge about machine learning to explore code, and in near time i have no time to learn it, but when i will have time, i will work on this issue. I think it is possible to find a translator module which enough powerful and with resources requirements to translate pages in offline.

To avoid transfer your personal data in near future you could up your own any translate server and write your own translator module which will use its local API. But autonomic translator module must work when you common user and you have not internet, but you need to translate page in offline, any way (of course you can do it right now if previously already translate this page and it keep in cache).

Please, if you have contact with people who understand this, contact us. Invite them to see on this repository and this issue, maybe they will interesting and can help me to find best candidate or even want to maintenance in this project to make translator module on WASM. I really think that this plugin for translate is best, i make it because all other was not enough quality. My purpose is make translate pages and any other text in browser is easy, free and convenient to use, and maybe in future move it on your desktop. Maybe someone have interest to maintain in this project. At this time i need help with NMT

@vitonsky vitonsky added the help wanted Extra attention is needed label Aug 14, 2021
@vitonsky
Copy link
Collaborator Author

vitonsky commented Jun 13, 2022

@vitonsky
Copy link
Collaborator Author

@vitonsky
Copy link
Collaborator Author

vitonsky commented Jul 2, 2022

@jelmervdl
Copy link

jelmervdl commented Jul 5, 2022

Hey, I noticed your tweets to the bergamot project!

If you want to try to implement a bergamot backend for Linguist, I'd suggest taking a look at the node.js test script. Unfortunately we don't have good documentation about the API yet. But this is the most concise example of how to use the wasm binary: https://github.com/browsermt/bergamot-translator/blob/main/wasm/node-test.js

You'll need the compiled wasm binary and the javascript helper functions: bergamot-translator-worker.wasm & bergamot-translator-worker.js. I'd suggest just grabbing them from the latest release for now.

If you do want to build it from source, I recommend using Docker, and running something like docker run --rm -v $(pwd):/src -w /src emscripten/emsdk:3.1.8 bash build-wasm.sh. But that doesn't give you anything that's not already in the release.

Mozilla maintains a list of models in their extension: modelRegistry.js. They also have code for downloading & loading the models in translationWorker.js but it's not as concise.

I'm maintaining an experimental version of the firefox addon that's based on it. The meaty bits are WASMTranslationWorker.js and WASMTranslationHelper.js which you can also use for inspiration how to weave it all together. Do note that that uses a different model registery, namely that of translateLocally.

Speaking of translateLocally, that is another way to implement bergamot-translator! If you have translateLocally installed you can communicate with it through the native messaging api. Generally, it's much (much!) faster that way, and you don't have to worry about caching of the models or anything. Downside: you need translateLocally on the machine and you can't ship it as part of the extension.

You can use TLTranslationHelper.js or native_client.py as examples on how to use it. Note that translateLocally needs to know about your extension, otherwise Chrome/Firefox won't allow you to communicate with it. There is a bit of info about it in the README.md but this pull request is merged that should be a lot easier.


Edit: One major difference I noticed between how Mozilla's translation addon is implemented and Linguist is that Linguist only translates text nodes, while Mozilla passes in chunks of HTML. This allows for the translator to move the HTML around to make sure the markup follows any word reordering.

For websites with little inline HTML, your implementation is probably reasonably sufficient. But it's something to keep in mind. Translating and then merging back the translated HTML is a bit of a challenge and I don't know how many of your translation backends will support it. Google does through their format key, but I don't know about the others.

@vitonsky
Copy link
Collaborator Author

vitonsky commented Jul 5, 2022

@jelmervdl thank you for your responsiveness. I would happy to integrate a Bergamot translator into Linguist! I have few questions to start it.

Documentation

As developers we have to rely on docs to interact between code modules. It's especially important for distributed teams. Documentation is our contract which ensure that we use module as planned and when behavior in documentation and in real life is not match, we may create bug report to fix this differences.

What's plans of Bergamot project about documentation? How soon developers will be ready to maintenance a docs? Is this plans exists?

For me interesting this topics

  • detailed instruction how to use WASM module (and probably notes about optimizations and performance)
  • API of WASM module
  • detailed information about exists AI models, about status, features, performance, supported languages
  • information about architecture of Bergamot. How it works, what core modules do and how interact, how to build and how to train a models
  • information about current problems which we have to solve

Models

Where i can find an AI models and information about it (features, supported languages, performance, notes)?

How to use with browser?

Is we have some examples how it works in browser? Which browser features Bergamot required to work?

What about speech processing?

I heard that Bergamot project have speech processing features as "text to speak" and "speech to text", is it? If yes, what's status of this features? Where i can find information and documentation about it?

I would to make linguist full featured autonomic translator (as option, if it will works fine, then as option by default), so this features will useful too.

Could you explain your notices about performance of interactions Linguist with DOM?

I seek to make the Linguist fast, so your opinion is important. You told that current Linguist approach may be sub optimal for large web pages. Could you explain why and what's your suggestions to fix it?

Right now a Linguist just take text of DOM nodes, translate it and replace values. This way is allow to interact with page as minimal as possible, to keep references to DOM nodes, which may be used to memoize rendering for JS frameworks (in cases when DOM node been replaced, the framework will re-render whole content of this node).

Feel free to suggest your vision about pros and cons


I will explore your links, sorry if some questions been answered already, let's just collect and organize all information which we have about it to simplify navigation for me and other developers in the future.

@jelmervdl
Copy link

Re documentation:

For bergamot-translator there is no up-to-date documentation covering the full scope of C++, Python or Node API. And I don't expect that to change in the short term.

bergamot-translator is glue code that allowed us to use the decoding part (translation using a given translation model) in the Firefox extension: it is a wrapper around marian-nmt, exposing a much simpler API. It is still a pretty low level API though. I've tried to document it as practically as possible in the test script.

Re models:

Right now these are the two parties I know of that are training efficient models, and info about them:

These ones are optimised for the quality vs (size + speed) trade-off. In theory any Marian model is compatible. The University of Helsinki has a massive number of them.

Training your own models is an option: https://github.com/mozilla/firefox-translations-training. It is a bit daunting though, and often requires specific changes per language and dataset. And some trial and error.

Re browser:

See my point about documentation, but also my earlier comment. There are two working extensions that use the WASM API, there is the WASM demo code in bergamot-translator, and the annotated test script I just linked. It should be sufficient to piece together how to use it.

Re speech processing:

This is not a feature of this project as far as I'm aware. I think your best bet for that right now is the Web Speech API. That's not guaranteed to be offline and private, unfortunately.

Re performance:

Sorry for the misunderstanding. It was not about performance, but quality. I was trying to explain that bergamot-translator can re-order words For example, "Jeden Tag esse ich Pizza" translates to "Every day I eat pizza", and in "esse ich" and "I eat" the order is reversed.

Now imagine that there is HTML wrapping that bit: <em>Jeden Tag esse</em> ich Pizza. Linguist translates that in two steps: Jeden Tag esse and ich Pizza. And you will likely get something like "Every day eats" and "I pizza". Still understandable, but could be better in terms of quality.

bergamot-translator accepts the whole sentence (or multiple, it doesn't care) with markup and tries to maintain that markup as words are moved around:

image

But this has exactly the difficulty you mention: it now becomes tricky to match that translated HTML/text with the text nodes that are already in the DOM tree. The Firefox extension try to do this, but that code is complicated and probably a lot slower than just translating text nodes.

Ignore the bit about translateLocally and native messaging for now. If translation speed is really important, that avenue is great. But even with the WASM module you can translate the visible bit of a page in seconds, and it doesn't require your users to download and install extra software.

@vitonsky
Copy link
Collaborator Author

facebook/meta model https://github.com/facebookresearch/fairseq/tree/nllb

200 languages, 300Gb, can't be used in browser, just for research

@BrightXiaoHan
Copy link

https://github.com/ggerganov/whisper.cpp
image

We can achieve this goal if we can port open-source translation models to this library.

@vitonsky
Copy link
Collaborator Author

vitonsky commented Mar 1, 2023

@BrightXiaoHan it looks interesting, thanks for link, but it looks this project about speech recognition, but in current issue we try to find a projects about text translation with no internet. It is fine, your link useful too, because in future we have plans to implement input text with speech, but if you have relevant links to the topic, share it please.

I will edit the issue to improve description what problem we try to solve

@BrightXiaoHan
Copy link

BrightXiaoHan commented Mar 1, 2023

@vitonsky
What I mean is that this library implements the Transformer model and its decoding process in a single c file, which can be easily adapted to WASM. It should be easily adaptable to machine translation models. There is also an example of implementing the GPT2 model in their project.

@vitonsky
Copy link
Collaborator Author

vitonsky commented Mar 1, 2023

@BrightXiaoHan oh, it is interesting then. I will research

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed important Must do it
Projects
None yet
Development

No branches or pull requests

4 participants