[ticket/16985] Fix MYSQLi bug - Incorrect string value for non-BMP chars #6384

lionel-rowe · 2022-04-13T18:28:36Z

Fixes PHPBB3-16985 and will likely also fix a class of various related bugs, such as CUSTDB-826.

utf8_encode_ucr is idempotent for any string that doesn't contain unencoded non-BMP characters, so double-encoding won't be an issue for any strings already encoded using it.

Checklist:

Correct branch: master for new features; 3.3.x for fixes
Tests pass
Code follows coding guidelines: master and 3.3.x
Commit follows commit message format

Tracker ticket (set the ticket ID to your ticket ID):

https://tracker.phpbb.com/browse/PHPBB3-16985

PHPBB3-16985

3D-I · 2022-04-14T05:35:52Z

Invalid, see the tracker.

lionel-rowe · 2022-04-14T09:19:14Z

Invalid, see the tracker.

Still valid, see my response in the tracker.

lionel-rowe · 2022-04-25T17:11:15Z

How do I move this forward?

DavidIQ · 2022-04-25T17:15:06Z

phpBB/phpbb/db/driver/mysqli.php

-		return @mysqli_real_escape_string($this->db_connect_id, $msg);
+		return @mysqli_real_escape_string(
+			$this->db_connect_id,
+			utf8_encode_ucr($msg)


This is assuming the intention for all usage and all extensions that already use this function. What exactly happens when this function has already been called?

Because utf8_encode_ucr only modifies non-BMP unicode characters, and HTML entities are ASCII only (i.e. will always lie well within the BMP), that means it's idempotent for multiple calls. In other words, utf8_encode_ucr(utf8_encode_ucr(utf8_encode_ucr($str))) will always yield the same result as utf8_encode_ucr($str).

See this playground on repl.it (functions copied directly from includes/utf/utf_tools.php):

https://replit.com/@lionel_rowe/utf8encodeucr

Not convinced this is the right approach. It is arbitrarily deciding how everyone should store their data and will likely produce unexpected results. Probably better to do specific implementations where needed.

Probably better to do specific implementations where needed.

As it has been done so far, let's say there are no bugs yet uncovered or there are some areas that have not fallbacks. Everything has been covered until proven otherwise.

arbitrarily deciding how everyone should store their data

I wholeheartedly agree in principle, but unfortunately that ship has sailed long ago, back when utf8_encode_ucr was introduced into the codebase in the first place, and probably even before that — the vast majority of string fields in phpBB are already (unnecessarily) stored as HTML, which should really be handled at the templating layer. For example, if I create a user with the username "<jimmy>", this is stored in the database as <jimmy>. All this PR does is ensure that (for example) username "jimmy💩" would also be stored as jimmy💩 to avoid MYSQLi throwing errors all over the place, which I've already encountered multiple times (I use a lot of emojis 😂).

Worst case with this PR is that a user occasionally sees strings looking like "💩", as opposed to the current worst case, which is that they get full-page error messages and potential data loss (e.g. the action they were performing is lost).

If this was merged along with #6377, even that relatively minor tradeoff could also be eliminated.

I think there are many less-than-ideal things about the phpBB codebase, as surely everyone will agree.

Ok, you're putting new logic into a function that is core to the product (dbal) and have so far justified it with "the function exists so it should be used in a spot that affects any and all code that uses it (that ship has sailed)", "if it breaks someone's code then so be it (user will sometimes see HTML encoded representations of the text)", and "the function wasn't used here because the code base is less than ideal".

While these aren't really good reasons/arguments I realize this PR is for master anyways so it's probably ok as a potentially breaking change for the next major version given the expectation that some extensions may need some changes anyways.

@DavidIQ Do you think it is a good idea to have Emoji in usernames or other parts of the code? I don't think so as we have already discussed this and put fallbacks in place.

#5556

This cannot be a general fixing where the text parser is not already otherwise present, we have already considered all possibilities. If a new deficiency is discovered then the same function should be used (or its counterpart for HTML). This PR does not make logical sense.

if I create a user with the username "<jimmy>", this is stored in the database as <jimmy>. All this PR does is ensure that (for example) username "jimmy💩" would also be stored as jimmy💩 to avoid MYSQLi throwing errors all over the place

#5556 - We do not want Emoji in usernames and some other places.

It does not make sense, please try first. There are multiple configurations for usernames at registration time. I would suggest that you learn more about phpBB at the UI level. Names like <jimmy> are possible. https://www.phpbb.com/community/memberlist.php?first_char=other#memberlist

So do not confuse what is stored in the database with what is being rendered at runtime (html).

@3D-I the username thing was just an example. Also, if an admin wants to allow users to have emojis in their username, why shouldn't that be allowed?

@DavidIQ Ok, you're putting new logic into a function that is core to the product (dbal) and have so far justified it with "the function exists so it should be used in a spot that affects any and all code that uses it (that ship has sailed)", "if it breaks someone's code then so be it (user will sometimes see HTML encoded representations of the text)", and "the function wasn't used here because the code base is less than ideal".

What I'm saying is that any ways this breaks code will be less severe than what it fixed, i.e. it's better to occasionally show raw HTML entities to users than to occasionally show full-page MYSQLi errors to users that may also cause them to lose work.

However, I accept that there may be a better way to fix this, so I'll have a bit more of a think about this one.

3D-I · 2022-04-25T23:46:25Z

@JoshyPHP What do you think about that? Thanks.

JoshyPHP · 2022-04-26T14:53:46Z

I don't have any skin in the game but I would advise against using HTML to escape plain text, as it changes the nature of the data. If the data is already HTML then sure, it's just another valid representation. Otherwise the change is too impactful, especially considering this is a workaround to fix at the application level an issue that exists at the server level.

IMO, it would be better to catch errors related to MySQL encoding and direct the user to a KB article on how to migrate to utf8mb4.

3D-I · 2022-04-26T14:55:39Z

Thanks @JoshyPHP

lionel-rowe · 2022-04-26T15:00:35Z

@JoshyPHP IMO, it would be better to catch errors related to MySQL encoding and direct the user to a KB article on how to migrate to utf8mb4.

Unfortunately using utf8mb4 for the database doesn't fix it — the issue exists at the MYSQLi driver level. However, I accept now that this fix probably isn't the best way to go. I'll investigate more see if I can find a way to "convince" MYSQLi that the underlying database is indeed using utf8mb4 and therefore it doesn't have to freak out when it sees those scary-looking codepoints above U+FFFF.

JoshyPHP · 2022-04-26T15:22:23Z

As far as I can tell, it should work fine as long as MySQL is configured for utf8mb4. On my local MariaDB install, this is what I use:

[mysqld]
character-set-server=utf8mb4
character-set-client-handshake=false

I don't know of an easy way to trigger that error on phpBB so I only tested it in the client directly.

$ mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 7
Server version: 10.6.5-MariaDB-log Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> CREATE DATABASE test;
Query OK, 1 row affected (0.000 sec)

MariaDB [(none)]> USE test;
Database changed
MariaDB [test]> CREATE TABLE x SELECT '🍰' AS col;
Query OK, 1 row affected (0.007 sec)
Records: 1  Duplicates: 0  Warnings: 0

MariaDB [test]> SELECT * FROM x\G
*************************** 1. row ***************************
col: 🍰
1 row in set (0.000 sec)

lionel-rowe · 2022-04-26T16:09:28Z

@JoshyPHP As far as I can tell, it should work fine as long as MySQL is configured for utf8mb4.

It's actually a really flaky error — after more testing, I think I've isolated it to PM titles for which the string length with surrogate pairs used is ≥ 62 and which contain 1 or more non-BMP chars. So for example:

1. "💩___________________________________________________________" length 61 — OK
2. "💩ℵ__________________________________________________________" length 61 — OK
3. "💩____________________________________________________________" length 62 — error
4. "______________________________________________________________________" length 70 but no non-BMP chars — OK

Note that the standard UTF-8 byte length of 1 and 2 are 63 and 65, respectively, so the UTF-8 byte length doesn't seem to be the deciding factor.

lionel-rowe · 2022-04-26T20:49:41Z

OK, looks like my initial assumption that this is a MYSQLi issue was wrong — I think this might be actual root of the problem:

diff --git a/phpBB/phpbb/db/tools/tools.php b/phpBB/phpbb/db/tools/tools.php
index 1250a8901d..cc696b6c52 100644
--- a/phpBB/phpbb/db/tools/tools.php
+++ b/phpBB/phpbb/db/tools/tools.php
@@ -328,7 +328,7 @@ class tools implements tools_interface
 		{
 			case 'mysql_41':
 				// make sure the table is in UTF-8 mode
-				$table_sql .= "\n) CHARACTER SET `utf8` COLLATE `utf8_bin`;";
+				$table_sql .= "\n) CHARACTER SET `utf8mb4` COLLATE `utf8mb4_general_ci`;";
 				$statements[] = $table_sql;
 			break;

Despite using utf8mb4 at the database level, phpBB overrides that at the table level inside the dbtools::sql_create_table function. D'oh!

[ticket/16985] Fix MYSQLi bug - Incorrect string value for non-BMP chars

18a63b6

PHPBB3-16985

DavidIQ reviewed Apr 25, 2022

View reviewed changes

lionel-rowe marked this pull request as draft April 28, 2022 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ticket/16985] Fix MYSQLi bug - Incorrect string value for non-BMP chars #6384

[ticket/16985] Fix MYSQLi bug - Incorrect string value for non-BMP chars #6384

lionel-rowe commented Apr 13, 2022

3D-I commented Apr 14, 2022

lionel-rowe commented Apr 14, 2022

lionel-rowe commented Apr 25, 2022

DavidIQ Apr 25, 2022

lionel-rowe Apr 25, 2022

DavidIQ Apr 25, 2022

3D-I Apr 25, 2022

lionel-rowe Apr 25, 2022

DavidIQ Apr 26, 2022

3D-I Apr 26, 2022 •

edited

3D-I Apr 26, 2022 •

edited

lionel-rowe Apr 26, 2022

lionel-rowe Apr 26, 2022

3D-I commented Apr 25, 2022

JoshyPHP commented Apr 26, 2022 •

edited

3D-I commented Apr 26, 2022

lionel-rowe commented Apr 26, 2022 •

edited

JoshyPHP commented Apr 26, 2022

lionel-rowe commented Apr 26, 2022 •

edited

lionel-rowe commented Apr 26, 2022 •

edited

[ticket/16985] Fix MYSQLi bug - Incorrect string value for non-BMP chars #6384

Are you sure you want to change the base?

[ticket/16985] Fix MYSQLi bug - Incorrect string value for non-BMP chars #6384

Conversation

lionel-rowe commented Apr 13, 2022

3D-I commented Apr 14, 2022

lionel-rowe commented Apr 14, 2022

lionel-rowe commented Apr 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

3D-I Apr 26, 2022 • edited

Choose a reason for hiding this comment

3D-I Apr 26, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

3D-I commented Apr 25, 2022

JoshyPHP commented Apr 26, 2022 • edited

3D-I commented Apr 26, 2022

lionel-rowe commented Apr 26, 2022 • edited

JoshyPHP commented Apr 26, 2022

lionel-rowe commented Apr 26, 2022 • edited

lionel-rowe commented Apr 26, 2022 • edited

3D-I Apr 26, 2022 •

edited

3D-I Apr 26, 2022 •

edited

JoshyPHP commented Apr 26, 2022 •

edited

lionel-rowe commented Apr 26, 2022 •

edited

lionel-rowe commented Apr 26, 2022 •

edited

lionel-rowe commented Apr 26, 2022 •

edited