Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some of the "sjis" characters are not returned correctly #1464

Open
sergb213 opened this issue Mar 18, 2024 · 4 comments
Open

Some of the "sjis" characters are not returned correctly #1464

sergb213 opened this issue Mar 18, 2024 · 4 comments

Comments

@sergb213
Copy link

sergb213 commented Mar 18, 2024

Software versions
MySQL 5.7.34, MySQL 8.0.22

Describe the bug
Some of the CHARSET=sjis characters (Japanese) are returned as question marks when they are read by MySqlConnector. Exactly the same code works fine with the official MySql.Data.MySqlClient connector.

Exception
n/a

Code sample

The following app prints this when it is compiled with the official MySQL connector (using MySql.Data.MySqlClient namespace) as expected:

id = 1, name = 髙 (first char = 0x9ad9), expected 髙, is expected = True
Done

but the 髙 character gets replaced with question mark when using MySqlConnector namespace:

id = 1, name = ? (first char = 0x3f), expected 髙, is expected = False
Done
static void Main(string[] args)
{
    Console.OutputEncoding = Encoding.UTF8;
    Encoding cp932 = CodePagesEncodingProvider.Instance.GetEncoding(932) ?? throw new Exception("Failed to create cp932 encoding");
    string expected = cp932.GetString(new byte[] { 0xfb, 0xfc });

    using var connection = new MySqlConnection("Server=localhost;Password=***;User ID=root;");
    connection.Open();

    using var command = connection.CreateCommand();
    command.CommandText = "select id, name from a.test";

    using var dr = command.ExecuteReader();
    while (dr.Read())
    {
        Console.WriteLine(
            "id = {0}, name = {1} (first char = 0x{2:x}), expected {3}, is expected = {4}",
            dr[0],
            dr[1],
            (uint)((string)dr[1])[0],
            expected,
            expected == (string)dr[1]);
    }

    Console.WriteLine("Done");
}

Expected behavior
A U+9AD9 character (髙) is expected to be returned, however we get the regular question mark placeholder U+003F (?)

Additional context
MySqlConnector seems to configure the connection to receive all content in Unicode, and there must be a bug in MySQL because it sends the question mark in this case. I can see the 0x3f character in the buffer under the debugger.

However, the official connector receives the 0xFB, 0xFC bytes, and it converts them to the correct 髙 character. The same behavior can be achieved by setting the character_set_results variable as

command.CommandText = "set character_set_results=sjis";
command.ExecuteNonQuery();

but then we still get the question mark in the end because we try to read it as UTF8 string and this sequence does not encode a UTF8 character:

This is how I created the test data:

C:\exe\mysql-5.7.34-winx64\bin
$ chcp 932
Active code page: 932

C:\exe\mysql-5.7.34-winx64\bin
$ mysql -h localhost -u root -p
. . . . . . .
mysql> create table test_sjis(id int primary key, name varchar(99)) default charset=sjis;
Query OK, 0 rows affected (0.03 sec)

mysql> insert into test_sjis values(1, 0xfbfc);
Query OK, 1 row affected (0.00 sec)

mysql> select id, name, hex(name) from test_sjis;
+----+------+-----------+
| id | name | hex(name) |
+----+------+-----------+
|  1 | 髙   | FBFC      |
+----+------+-----------+
1 row in set (0.00 sec)

mysql>
@bgrainger
Copy link
Member

MySqlConnector seems to configure the connection to receive all content in Unicode

Yes, this is by design. MySqlConnector delegates all character set conversions to MySQL Server and only transmits UTF-8 on the wire.

there must be a bug in MySQL because it sends the question mark in this case

That sounds likely. Have you reported it at bugs.mysql.com?

@sergb213
Copy link
Author

sergb213 commented Apr 1, 2024

I have not reported a bug at mysql.com but I looked at it a bit further, and it seems that 0xFBFC character code is invalid in sjis encoding because:

  1. "The sjis character set corresponds to the Shift_JIS character set defined by IANA, which supports JIS X0201 and JIS X0208 characters" (https://dev.mysql.com/doc/refman/8.0/en/charset-cp932.html)
  2. The two-byte characters are covered by JIS X0208 standard which does not use the 0xFB prefix according to this: https://en.wikipedia.org/wiki/Shift_JIS

Then, MySQL uses a simple static arrays for mapping the sjis characters to Unicode, and the region with 0xFB.. prefix is also empty: see ctype-sjis.cc, this is probably why it does not convert them to utf8mb4 correctly.

However, customers use the sjis character set to store these extended characters, and MySQL alows that, and then these characters can be retrieved with the official dotnet connector as shown above, but not with MySqlConnector.

@bgrainger
Copy link
Member

This seems like a difference between Shift JIS (https://en.wikipedia.org/wiki/Shift_JIS) and CP932 (https://en.wikipedia.org/wiki/Code_page_932_(Microsoft_Windows)).

Arguably it's a bug to define a DB column as using sjis but store cp932 characters in it: if the DB doesn't understand them, how will it collate them correctly? Can you use the cp932 character set in the database instead? https://dev.mysql.com/doc/refman/8.0/en/charset-cp932.html

@sergb213
Copy link
Author

sergb213 commented Apr 2, 2024

Customers use our software with their own databases, and we do not control the data structure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants