You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When translating eng_Latn to zho_Hant, there are always missing parts to be translated. It doesn't happen in Zho_hans. Evan yue_hant is better than zho_hant.
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
content = """
A database of Chinese surnames and Chinese given names (1930-2008). This database contains nationwide frequency statistics of 1,806 Chinese surnames and 2,614 Chinese characters used in given names, covering about 1.2 billion Han Chinese population (96.8% of the Han Chinese household-registered population born from 1930 to 2008 and still alive in 2008). This package also contains a function for computing multiple features of Chinese surnames and Chinese given names for scientific research (e.g., name uniqueness, name gender, name valence, and name warmth/competence).
"""
source = tokenizer.convert_ids_to_tokens(tokenizer.encode(content))
target_prefix = [tgt_lang]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]
in this case, "A database of Chinese surnames and Chinese given names (1930-2008)." is not translated. The same issue happened if using transformers only.
The text was updated successfully, but these errors were encountered:
馃悰 Bug
When translating eng_Latn to zho_Hant, there are always missing parts to be translated. It doesn't happen in Zho_hans. Evan yue_hant is better than zho_hant.
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
import ctranslate2
import transformers
src_lang = "eng_Latn"
tgt_lang = "zho_Hant"
translator = ctranslate2.Translator("nllb-200-distilled-600M")
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang=src_lang)
content = """
A database of Chinese surnames and Chinese given names (1930-2008). This database contains nationwide frequency statistics of 1,806 Chinese surnames and 2,614 Chinese characters used in given names, covering about 1.2 billion Han Chinese population (96.8% of the Han Chinese household-registered population born from 1930 to 2008 and still alive in 2008). This package also contains a function for computing multiple features of Chinese surnames and Chinese given names for scientific research (e.g., name uniqueness, name gender, name valence, and name warmth/competence).
"""
source = tokenizer.convert_ids_to_tokens(tokenizer.encode(content))
target_prefix = [tgt_lang]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]
print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))
in this case, "A database of Chinese surnames and Chinese given names (1930-2008)." is not translated. The same issue happened if using transformers only.
The text was updated successfully, but these errors were encountered: