WordDumb WrongTranslation Issue #134

Dorisking · 2023-06-23T17:22:47Z

Hi ~guys, I try to use WordDumb to read HarryPotter. But I get a lot of wrong meanings of words. for example , it explain drills as a type of strong cotton cloth instead a hand tool, power tool, or machine with a rotating cutting tip used for making holes. It sometimes choose the very rare and useless meaning of a word.
How can I adjust Translation Settings? Does it related with the dictionary in kindle？

xxyzz · 2023-06-24T01:00:32Z

You could click the "other meanings" button to select the correct definition. This plugin only matches words with one definition, you could also change the default meaning in the plugin's "customize Kindle word wise" window.

xxyzz · 2023-06-30T12:03:25Z

Maybe I could train a machine learning model to match the word to its gloss, and also match person name or location to Wikipedia summary or Wikidata item.

Vuizur · 2023-07-09T20:15:16Z

Maybe I could train a machine learning model to match the word to its gloss, and also match person name or location to Wikipedia summary or Wikidata item.

It is a super interesting question. I randomly stumbled upon this problem for my thesis and tried using llama.cpp with an instruction-fine-tuned language model from Llama such Wizard-Vicuna-7B. I simply gave it the task in the format:

Sentence: <sentence>
Question: Which definition of <word> is correct here?
1. <definition>
2. <another definition>
Answer only with a number.
Answer:

I benchmarked it for Russian (to copy a WIP graphic)

Disclaimer: I benchmarked the association of words with etymologies, not with senses.

(The accuracy in reality is maybe 5 percent higher, the test data has a few mistakes).
So WV7 (Wizard vicuna) runs on 8 GB RAM and Manticore 13B on 16 GB RAM PCs. And ChatGPT aced everything (except 1 example), but might be a bit too expensive.

In English the results will surely be better. The runtime will probably suck though, but if the users are very patient it might be possible.

Of course, training an own model, maybe with synthetic GPT3.5/4 data looks also pretty promising. But no idea.

This is maybe also interesting, but apparently only works for English (didn't test it): https://github.com/alvations/pywsd

xxyzz · 2023-07-17T15:20:18Z

I think I'll need to take a deep learning course first...

Using existing model is easier to start but the performance could be bad. Training a model might be unavoidable because the model needs to output customized data(Kindle Word Wise database id or Wiktionary gloss). And for that same reason, pywsd might not be suitable, or maybe I could replace the default gloss data they're using.

The ultimate goal is to find(or build) a model or a library the could take a chunk of text then magically mark the words in it with correct gloss and Wikipedia summary(output data should also have the token offset location).

Vuizur · 2023-07-20T11:36:51Z

Using existing model is easier to start but the performance could be bad. Training a model might be unavoidable because the model needs to output customized data(Kindle Word Wise database id or Wiktionary gloss). And for that same reason, pywsd might not be suitable, or maybe I could replace the default gloss data they're using.

I think large language models such as Llama would work out of the box, but be extremely slow. For Worddumb they would only be viable (but probably still a bit slow) if the user has a GPU with at least 8 GB VRAM, which probably almost nobody has. Compared to English, Llama does have pretty mediocre multilingual skills unfortunately.

pywsd uses oldschool algorithms, if I understood it correctly they might be applied to the Wiktionary data and not even be too slow, but the accuracy will likely be garbage. (But I don't know a lot about this.)

The ultimate goal is to find(or build) a model or a library the could take a chunk of text then magically mark the words in it with correct gloss and Wikipedia summary(output data should also have the token offset location).

True. I tried asking GPT-4 to add a short translation after each word of a specific text in [brackets], and it did what I asked. But it was still a bit buggy and will probably hallucinate a lot and give wrong answers with more exotic languages or rarer words.

It might only be a matter of time before something like this gets more viable. 👍

xxyzz · 2023-07-20T14:58:00Z

Using large language model for WSD maybe a little bit overkill IMO. I found this EWISER library: https://github.com/SapienzaNLP/ewiser, and they also have spacy plugin. Their paper is more recent and I'll see how I could integrate their work, look like I have a lot to learn...

The EWISER paper's authors' university also created babelfy.org, which has almost all the features I need but it has API limit(1000 per day).

xxyzz · 2023-08-28T14:47:51Z

I find the state-of-the-art WSD model at here: https://paperswithcode.com/sota/word-sense-disambiguation-on-supervised, and the best model is ConSeC: https://paperswithcode.com/paper/consec-word-sense-disambiguation-as

But I never trained a model before and don't have a GPU card, this would take some time...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WordDumb WrongTranslation Issue #134

WordDumb WrongTranslation Issue #134

Dorisking commented Jun 23, 2023

xxyzz commented Jun 24, 2023

xxyzz commented Jun 30, 2023

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Vuizur commented Jul 9, 2023 •

edited

xxyzz commented Jul 17, 2023

Vuizur commented Jul 20, 2023

xxyzz commented Jul 20, 2023

xxyzz commented Aug 28, 2023

WordDumb WrongTranslation Issue #134

WordDumb WrongTranslation Issue #134

Comments

Dorisking commented Jun 23, 2023

xxyzz commented Jun 24, 2023

xxyzz commented Jun 30, 2023

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Vuizur commented Jul 9, 2023 • edited

xxyzz commented Jul 17, 2023

Vuizur commented Jul 20, 2023

xxyzz commented Jul 20, 2023

xxyzz commented Aug 28, 2023

Vuizur commented Jul 9, 2023 •

edited