Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError for multiple models #100

Open
nikit91 opened this issue Apr 30, 2024 · 0 comments
Open

UnicodeDecodeError for multiple models #100

nikit91 opened this issue Apr 30, 2024 · 0 comments

Comments

@nikit91
Copy link

nikit91 commented Apr 30, 2024

Hello,

I am facing the following UnicodeDecodeError error:

File "/usr/src/app/server.py", line 188, in <module>
    application = make_app(args)
  File "/usr/src/app/server.py", line 166, in make_app
    worker_pool = initialize_workers(services)
  File "/usr/src/app/server.py", line 147, in initialize_workers
    worker_pool[lang_pair] = TranslatorInterface(
  File "/usr/src/app/server.py", line 17, in __init__
    self.contentprocessor = ContentProcessor(
  File "/usr/src/app/content_processor.py", line 18, in __init__
    self.bpe_source = BPE(BPEcodes)
  File "/usr/src/app/apply_bpe.py", line 37, in __init__
    firstline = codes.readline()
  File "/usr/local/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 54: invalid start byte

for the following models:

"it-en" : "https://object.pouta.csc.fi/OPUS-MT-models/it-en/opus-2019-12-18.zip" # SentencePiece
"ja-en" : "https://object.pouta.csc.fi/OPUS-MT-models/ja-en/opus-2019-12-18.zip" # SentencePiece
"id-en" : "https://object.pouta.csc.fi/OPUS-MT-models/id-en/opus-2019-12-18.zip" # SentencePiece
"bn-en" : "https://object.pouta.csc.fi/OPUS-MT-models/bn-en/opus-2020-02-11.zip" # SentencePiece
"et-en" : "https://object.pouta.csc.fi/OPUS-MT-models/et-en/opus-2019-12-18.zip" # SentencePiece
"lv-en" : "https://object.pouta.csc.fi/OPUS-MT-models/lv-en/opus-2019-12-18.zip" # SentencePiece
"th-en" : "https://object.pouta.csc.fi/OPUS-MT-models/th-en/opus-2020-01-16.zip" # SentencePiece
"uk-en" : "https://object.pouta.csc.fi/OPUS-MT-models/uk-en/opus-2020-01-16.zip" # SentencePiece

For most of them (except "lv-en") the error goes away when I switch to the BPE model. However, SentencePiece models are the ones with better translation performance as per the shared metrics.

Please let me know if I am doing something wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant