Fix encoding problem. #9

JiaxiangBU · 2020-09-02T10:12:00Z

Since the input and output text is in Chinese, I add the lines for the open function with specific encoding. If not, I get this kind of error.

>> Synonyms load wordseg dict [D:\installed\miniconda\lib\site-packages\synonyms\data\vocab.txt] ...
Building prefix dict from D:\installed\miniconda\lib\site-packages\synonyms\data\vocab.txt ...
Loading model from cache C:\Users\LIJIAX~1\AppData\Local\Temp\jieba.u24e2f9dc467017ec363179dba6484c45.cache
Loading model cost 1.352 seconds.
Prefix dict has been built successfully.
>> Synonyms on loading stopwords [D:\installed\miniconda\lib\site-packages\synonyms\data\stopwords.txt] ...
>> Synonyms on loading vectors [D:\installed\miniconda\lib\site-packages\synonyms\data\words.vector] ...
D:\installed\miniconda\lib\site-packages\smart_open\smart_open_lib.py:254: UserWarning: This function is deprecated, use smart_open.open instead. See the
migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Traceback (most recent call last):
  File "code/augment.py", line 54, in <module>
    gen_eda(args.input, output, alpha=alpha, num_aug=num_aug)
  File "code/augment.py", line 38, in gen_eda
    lines = open(train_orig, 'r').readlines()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 24: illegal multibyte sequence

And I find the proper input is

0	今天天气不错哦。
1	今天天气不行啊！不能出去玩了。
0	又是阳光明媚的一天！

instead of

0	今天天气不错哦。

1	今天天气不行啊！不能出去玩了。

0	又是阳光明媚的一天！

which make the parts[1] object is "" and the following error message is here.

Traceback (most recent call last):
  File "code/augment.py", line 54, in <module>
    gen_eda(args.input, output, alpha=alpha, num_aug=num_aug)
  File "code/augment.py", line 44, in gen_eda
    sentence = parts[1]
IndexError: list index out of range

我修改了下编码问题，因为这里的输入和输出都是中文，是非英文本，另外我发现，这里的 train.txt 中间不能空行。

JiaxiangBU added 2 commits September 2, 2020 18:01

fix encoding problem. @gaowenxin85

b52b6e6

keep repo compact.

aa8ef67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix encoding problem. #9

Fix encoding problem. #9

JiaxiangBU commented Sep 2, 2020

Fix encoding problem. #9

Are you sure you want to change the base?

Fix encoding problem. #9

Conversation

JiaxiangBU commented Sep 2, 2020