Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot translate the whole paragraph/sentences #5478

Open
junxu-ai opened this issue Apr 16, 2024 · 1 comment
Open

cannot translate the whole paragraph/sentences #5478

junxu-ai opened this issue Apr 16, 2024 · 1 comment

Comments

@junxu-ai
Copy link

junxu-ai commented Apr 16, 2024

馃悰 Bug

When translating eng_Latn to zho_Hant, there are always missing parts to be translated. It doesn't happen in Zho_hans. Evan yue_hant is better than zho_hant.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

import ctranslate2
import transformers

src_lang = "eng_Latn"
tgt_lang = "zho_Hant"

translator = ctranslate2.Translator("nllb-200-distilled-600M")
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang=src_lang)

content = """
A database of Chinese surnames and Chinese given names (1930-2008). This database contains nationwide frequency statistics of 1,806 Chinese surnames and 2,614 Chinese characters used in given names, covering about 1.2 billion Han Chinese population (96.8% of the Han Chinese household-registered population born from 1930 to 2008 and still alive in 2008). This package also contains a function for computing multiple features of Chinese surnames and Chinese given names for scientific research (e.g., name uniqueness, name gender, name valence, and name warmth/competence).
"""
source = tokenizer.convert_ids_to_tokens(tokenizer.encode(content))
target_prefix = [tgt_lang]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]

print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))

in this case, "A database of Chinese surnames and Chinese given names (1930-2008)." is not translated. The same issue happened if using transformers only.

@junxu-ai
Copy link
Author

It happens for nllb-200-distilled-1.3B too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant