Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine translation - long sentences cause incomplete translation #32

Open
gregorybrooks opened this issue Nov 8, 2021 · 2 comments
Open

Comments

@gregorybrooks
Copy link

I'm translating English sentences into Farsi with mt5-base-parsinlu-translation_en_fa (from Huggingface). Sentences longer than around 8 words result in the translation of the first part of the sentence, but the rest of the sentence is ignored. For example:

English sentences:

Terry's side fell to their second Premier League loss of the season at Loftus Road

Following a four-day hiatus, UN envoy Ismail Ould Cheikh Ahmed on Thursday will resume mediation efforts in the second round of Kuwait-hosted peace talks between Yemen’s warring rivals.

Mark Woods is a writer and broadcaster who has covered the NBA, and British basketball, for over a decade.

Translations:

طرفدار تری در فوتبال دوم فصل در لئوپوس رود به

پس از چهار روز توقف، سفیر سازمان ملل، ایمیل اولد شیخ

مارک ولز نویسنده و پخش کننده ای است که بیش از یک دهه

which according to Google Translate translates back to this:

More fans in the second football season in Leopard

After a four-day hiatus, the ambassador to the United Nations, Old Sheikh Sheikh

Mark Wells has been a writer and broadcaster for over a decade

I can't find any configuration settings that would be limiting the number of tokens being translated
Here is my code:

#!/usr/bin/python3
import sys
#from transformers import MarianTokenizer, MarianMTModel
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
from typing import List
import torch

device = "cuda:0"

dir=sys.argv[1] + "persiannlp"
size="base"
mname = f'{dir}/data/mt5-{size}-parsinlu-translation_en_fa'

tokenizer = MT5Tokenizer.from_pretrained(mname)
model = MT5ForConditionalGeneration.from_pretrained(mname)
model = model.to(device)

lines = [] 
while True:
    for line in sys.stdin:
        line = line.strip()
        if line == 'EOD':
            inputs    = tokenizer(lines, return_tensors="pt", padding=True).to(device)
            translated   = model.generate(**inputs).to(device)
            [print(tokenizer.decode(t, skip_special_tokens=True)) for t in translated]
            print('EOL')
            sys.stdout.flush()
            lines.clear()
        elif line.startswith('EOF'):
            sys.exit(0)
        else:
            lines.append(line)
sys.exit(0)
@danyaljj
Copy link

@gregorybrooks sorry for the delayed response! 👋
I am not sure what is the root of the issue, unfortunately.
I tried the online demo here and it seems to match your observations. I am honestly not sure why this happening.
Just sharing my 2 cents:

  • Data: I think the data has longer sentences (you should be able to verify this).
  • Decoding: it's possible that Huggingface has some strategy config/behavior that we don't quite understand
  • Training: it might be that we made a mistake in training these models. If so, maybe it's worth training your model, from scratch and monitoring its behavior.

@ali-abz
Copy link

ali-abz commented Mar 26, 2022

Same Problem when I try it on google colab.

from transformers import MT5ForConditionalGeneration, MT5Tokenizer
model_size = "small"
model_name = f"persiannlp/mt5-{model_size}-parsinlu-translation_en_fa"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)

def run_model(input_string, **generator_args):
    input_ids = tokenizer.encode(input_string, return_tensors="pt")
    print(input_ids)
    print(len(input_ids[0]))
    res = model.generate(input_ids, **generator_args)
    print(res)
    print(len(res[0]))
    output = tokenizer.batch_decode(res, skip_special_tokens=True)
    print(output)

sent = "The Iran–Iraq War was a protracted armed conflict that began on 22 September 1980 with a full-scale invasion of Iran by neighbouring Iraq. The war lasted for almost eight years, and ended in a stalemate on 20 August 1988, when Iran accepted Resolution 598 of the United Nations Security Council."
run_model(sent)

Result:

tensor([[   486,  19255,   1326,  36986,   4576,    639,    259,    262,    731,
          99155,    345,    259, 178869,  31320,    533,    390,   2739,    351,
           1024,   3258,  17522,    514,    259,    262,   3622,    264,  31749,
            259, 154171,    304,  19255,    455,    259, 134309,    347,    259,
          36986,    260,    486,   2381,   3167,    345,    332,    259,    262,
          28746,  49889,   3127,    261,    305,    259,  57830,    281,    259,
            262,  28604,  79328,    351,    628,   3155,  18494,    261,    259,
           1909,  19255,  12004,    345,    259,  91698, 147677,    304,    287,
           4248,    259,  35577,  19004,  28996,    260,      1]])
79
tensor([[    0, 10948,  4379,   341,   259, 35125,   343,  2665,   259, 11783,
           376,   259, 22838,  7244, 85200, 33040,   376,  3418,   934,   509]])
20
['جنگ ایران و عراق، یک حمله طولانی مسلحانه بود که در']

You can see the the tokenizer is doing a good job but the model is really limiting the output length. A work around is to add max_length to the model arguments so it generates more tokens:

def run_model(input_string, **generator_args):
    input_ids = tokenizer.encode(input_string, return_tensors="pt")
    print(input_ids)
    print(len(input_ids[0]))
    res = model.generate(input_ids, max_length=100, **generator_args)
    print(res)
    print(len(res[0]))
    output = tokenizer.batch_decode(res, skip_special_tokens=True)
    print(output)

Result:

tensor([[   486,  19255,   1326,  36986,   4576,    639,    259,    262,    731,
          99155,    345,    259, 178869,  31320,    533,    390,   2739,    351,
           1024,   3258,  17522,    514,    259,    262,   3622,    264,  31749,
            259, 154171,    304,  19255,    455,    259, 134309,    347,    259,
          36986,    260,    486,   2381,   3167,    345,    332,    259,    262,
          28746,  49889,   3127,    261,    305,    259,  57830,    281,    259,
            262,  28604,  79328,    351,    628,   3155,  18494,    261,    259,
           1909,  19255,  12004,    345,    259,  91698, 147677,    304,    287,
           4248,    259,  35577,  19004,  28996,    260,      1]])
79
tensor([[     0,  10948,   4379,    341,    259,  35125,    343,   2665,    259,
          11783,    376,    259,  22838,   7244,  85200,  33040,    376,   3418,
            934,    509,   1024,  15140,    636,  68820,  18430, 122748,    768,
           2741, 130744,   8878,    572,    695,   4379,    554,    259,  13361,
            259,  35125,    259,  17213,   3164,    260,  10948,  22625,  59491,
            259,  37033,   3037,    259,  22838,  20275,   1555,    341,    509,
           3939,   2408,    259,  27895,  48129, 153840,    259,  26598,    259,
          14594,    343,    259,   5143,    406,   4379,    259,   9898,    259,
          13727,   1845,  14727,   6916,    572,    916,    259,  30887,   3716,
            260,      1]])
83
['جنگ ایران و عراق، یک حمله طولانی مسلحانه بود که در 22 سپتامبر ۲۰۸۰ با تهاجم کامل از ایران به توسط عراق شروع شد. جنگ تقریبا هشت سال طول کشید و در بیست اوت ۱۹۸۸ پایان یافت، وقتی ایران مجلس امنیت سازمان ملل را قبول کرد.']

max_length is None by default so there should not be any limit to how many tokens the model generate so I am not sure why this problem exists in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants