Addition dot in translated text #5487

gosha70 · 2024-04-21T01:53:04Z

❓ Questions and Help

What is your question?

For some english words, the model add . to the end of translation ; for example: ok.
See the code which produces the following output:

Translating: 'ok'
Translation from eng_Latn to heb_Hebr: 'בסדר, בסדר.'
Translation from eng_Latn to rus_Cyrl: 'Хорошо'
Translation from eng_Latn to fra_Latn: 'Je suis d'accord.'
Translation from eng_Latn to kor_Hang: '괜찮아요'

Translating: 'Ok'
Translation from eng_Latn to heb_Hebr: 'בסדר.'
Translation from eng_Latn to rus_Cyrl: 'Хорошо.'
Translation from eng_Latn to fra_Latn: 'Je suis d'accord.'
Translation from eng_Latn to kor_Hang: '좋아'

Translating: 'OK'
Translation from eng_Latn to heb_Hebr: 'בסדר.'
Translation from eng_Latn to rus_Cyrl: 'Хорошо.'
Translation from eng_Latn to fra_Latn: 'Je suis d'accord.'
Translation from eng_Latn to kor_Hang: '괜찮아요'

Code

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline

model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def translate(from_lang: str, to_lang: str, text: str):
    translator = pipeline(task='translation', model=model, tokenizer=tokenizer, src_lang=from_lang, tgt_lang=to_lang, max_length = 400)
    output = translator(text)
    translated_text = output[0]['translation_text']
    print(f"Translation from {from_lang} to {to_lang}: '{translated_text}'")


text_to_translates = [
    "ok",
    "Ok",
    "OK"
]

to_langs = [
    "heb_Hebr",
    "rus_Cyrl",
    "fra_Latn",
    "kor_Hang"
]

for text in text_to_translates:
    print(f"Translating: '{text}'")
    for lang in to_langs:
        translate(from_lang="eng_Latn",to_lang=lang, text=text)
        print("\n\n")

What have you tried?

Tried adding an empty space; sometimes it helps, but most often it does not.

Tried the different combination of hyperparameters; none of them made any difference:

max_length: As you've mentioned, this parameter controls the maximum length of the input sequence to the model. If the input sequence is longer than max_length, it will be truncated. Increasing this might prevent important context from being lost but could increase computation time.
num_beams: This parameter is used in beam search, which is a strategy for generating text where multiple translation paths are considered at each step. Increasing the number of beams can potentially improve the quality of the output at the cost of more computation.
temperature: This parameter controls randomness in the output generation. Lower temperatures make the model outputs more deterministic and conservative, while higher temperatures encourage more diversity but can also introduce more mistakes.
top_k: This parameter is used with sampling strategies, limiting the number of highest probability vocabulary tokens to be considered for each step. A lower top_k reduces randomness.
top_p (nucleus sampling): This sampling strategy involves selecting the smallest set of tokens whose cumulative probability exceeds the probability p. The model will then only consider this set of tokens for generating the next word. This can lead to more fluent and coherent text generation.
repetition_penalty: This parameter discourages the model from repeating the same line verbatim. Adjusting this can help in reducing redundancies in the translations.
length_penalty: Adjusts the length of the generated output. Setting this parameter can help if the model consistently generates too short or too long outputs.
no_repeat_ngram_size: This parameter prevents the repetition of n-grams. This can be useful to avoid repeated phrases or sentences, which is a common issue in generated text.
early_stopping: If set to True, generation will stop if all beam candidates reach the EOS token (end of sequence). This can save computation time without affecting the output quality significantly.

What's your environment?

fairseq Version (e.g., 1.0 or main): N/A
PyTorch Version (e.g., 1.0): N/A
OS (e.g., Linux): MacBook Pro - Apple M3 Max
How you installed fairseq (pip, source): N/A
Build command you used (if compiling from source): N/A
Python version: Python 3.11.8
CUDA/cuDNN version: N/A
GPU models and configuration: N/A
Any other relevant information:
- transformers: 4.40.0

The text was updated successfully, but these errors were encountered:

gosha70 added needs triage question labels Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Addition dot in translated text #5487

Addition dot in translated text #5487

gosha70 commented Apr 21, 2024 •

edited

Addition dot in translated text #5487

Addition dot in translated text #5487

Comments

gosha70 commented Apr 21, 2024 • edited

❓ Questions and Help

What is your question?

Code

What have you tried?

What's your environment?

gosha70 commented Apr 21, 2024 •

edited