Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new option to allow manually setting encoding #269

Closed
wants to merge 3 commits into from
Closed

Add a new option to allow manually setting encoding #269

wants to merge 3 commits into from

Conversation

imgaojun
Copy link
Contributor

@imgaojun imgaojun commented Dec 14, 2019

resolve #270
I can not load utf-8 file while building my vocabulary or loading my dataset because gbk is used by default on windows. I added a new option to allow manually setting encoding PairedTextData

$ python main.py 
Traceback (most recent call last):
  File "main.py", line 62, in <module>
    main()
  File "main.py", line 28, in main
    hparams=config_data.train, device=device)
  File "C:\Users\gaojun4ever\Miniconda3\lib\site-packages\texar\torch\data\data\paired_text_data.py", line 140, in __init__
    eos_token=src_hparams.eos_token)
  File "C:\Users\gaojun4ever\Miniconda3\lib\site-packages\texar\torch\data\vocabulary.py", line 103, in __init__
    = self.load(self._filename)
  File "C:\Users\gaojun4ever\Miniconda3\lib\site-packages\texar\torch\data\vocabulary.py", line 119, in load
    vocab = list(line.strip() for line in vocab_file)
  File "C:\Users\gaojun4ever\Miniconda3\lib\site-packages\texar\torch\data\vocabulary.py", line 119, in <genexpr>
    vocab = list(line.strip() for line in vocab_file)
UnicodeDecodeError: 'gbk' codec can't decode byte 0x8c in position 2: illegal multibyte sequence

@gpengzhi
Copy link
Collaborator

Could you fix the CI error? CI is currently failing because the modified lines of code are too long (> 80). And if we want to change the code, we also need to update the corresponding documentation accordingly.

@huzecong
Copy link
Collaborator

I guess we could add encoding arguments for functions where we perform file I/O.

Another question is, would it be helpful to set the default to UTF-8, regardless of platform and locale? IIRC the default encoding on Windows is locale dependent (e.g. GBK for Chinese systems).

@imgaojun imgaojun closed this Dec 17, 2019
@imgaojun imgaojun deleted the default-encoding branch December 17, 2019 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Encoding error on windows
4 participants