Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding problem. #9

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

JiaxiangBU
Copy link

Since the input and output text is in Chinese, I add the lines for the open function with specific encoding. If not, I get this kind of error.

>> Synonyms load wordseg dict [D:\installed\miniconda\lib\site-packages\synonyms\data\vocab.txt] ...
Building prefix dict from D:\installed\miniconda\lib\site-packages\synonyms\data\vocab.txt ...
Loading model from cache C:\Users\LIJIAX~1\AppData\Local\Temp\jieba.u24e2f9dc467017ec363179dba6484c45.cache
Loading model cost 1.352 seconds.
Prefix dict has been built successfully.
>> Synonyms on loading stopwords [D:\installed\miniconda\lib\site-packages\synonyms\data\stopwords.txt] ...
>> Synonyms on loading vectors [D:\installed\miniconda\lib\site-packages\synonyms\data\words.vector] ...
D:\installed\miniconda\lib\site-packages\smart_open\smart_open_lib.py:254: UserWarning: This function is deprecated, use smart_open.open instead. See the
migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Traceback (most recent call last):
  File "code/augment.py", line 54, in <module>
    gen_eda(args.input, output, alpha=alpha, num_aug=num_aug)
  File "code/augment.py", line 38, in gen_eda
    lines = open(train_orig, 'r').readlines()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 24: illegal multibyte sequence

And I find the proper input is

0	今天天气不错哦。
1	今天天气不行啊!不能出去玩了。
0	又是阳光明媚的一天!

instead of

0	今天天气不错哦。

1	今天天气不行啊!不能出去玩了。

0	又是阳光明媚的一天!

which make the parts[1] object is "" and the following error message is here.

Traceback (most recent call last):
  File "code/augment.py", line 54, in <module>
    gen_eda(args.input, output, alpha=alpha, num_aug=num_aug)
  File "code/augment.py", line 44, in gen_eda
    sentence = parts[1]
IndexError: list index out of range

我修改了下编码问题,因为这里的输入和输出都是中文,是非英文本,另外我发现,这里的 train.txt 中间不能空行。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant