Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 issues regarding UTF-8 conversion and others #226

Open
yuliu opened this issue Jul 21, 2019 · 1 comment
Open

UTF-8 issues regarding UTF-8 conversion and others #226

yuliu opened this issue Jul 21, 2019 · 1 comment

Comments

@yuliu
Copy link
Member

yuliu commented Jul 21, 2019

Preface. I've got a Chinese user based Discuz! forum and have had written a converter for it. During working with the users module, I found problems of the Merge System with converting usernames correctly. By looking into the encode_to_utf8() function in ./merge/resources/functions.php and check_for_duplicates() in ./merge/resources/modules/users, I found something may be causing the problem.

BTW, I've written a small script to visually show the problem. It simulates the using of some functions in the Merge System and MyBB, and assumes you want conversion to UTF-8, and mb_* & iconv functions exist. You should save the file in ANSI encoding in a text editor, but not UTF-8, to have it work. Oh, I'm running PHP 5.5 right now.

Let's go on:

  • The UTF-8 conversion is working in our demo since iconv is given a correct encoding of the original string.

  • The culprit of UTF-8 encoding is no correct encoding is given when using a mb_* function when converting a string has an encoding other than UTF-8. Because when no encoding is given to a mb_* function, it will use mb_internal_encoding() which is set to UTF-8 by the Merge System, hard coded in ./merge/index.php. encode_to_utf8() uses my_strlen() in MyBB.

  • The problem of duplicated user checking is using my_strtolower() in MyBB, in which no encoding is given to a mb_ function.

  • A quick fix is to provide mb_* functions with an encoding when calling them. But encoding naming convention used by mb diffs from iconv. So we have to do a bit more work.

However, I still can't understand the using of different string lower functions on a username, via:
my_strtolower($duplicate_user['username']) == strtolower($encoded_username)

Maybe, this issue is caused by it, too. I've written my versions of these affected functions in my own converter's and module's class, without modifying the basic Merge System. I may make a pull if we come to a conclusion it's wrong usage of mb_*.

@yuliu
Copy link
Member Author

yuliu commented Jul 22, 2019

I've put some encodings across MySQL charset, iconv encoding and mbstring encoding together here, incomplete list.

Taking Chinese character encodings for example, their namings and support status by corresponding softwares are different. Maybe we can move some encoding judgement into the board converter's class, or users have to write their own encode_to_utf8/strlen/strtolower functions, when the encoding went wrong.

Well, it's understandable that the Merge System cannot handle languages well other than English. But it's still a great converter system, and that's why I choose MyBB rather than phpBB/smf/... .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant