Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ticket/16985] Fix MYSQLi bug - Incorrect string value for non-BMP chars #6384

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lionel-rowe
Copy link
Contributor

Fixes PHPBB3-16985 and will likely also fix a class of various related bugs, such as CUSTDB-826.

utf8_encode_ucr is idempotent for any string that doesn't contain unencoded non-BMP characters, so double-encoding won't be an issue for any strings already encoded using it.

Checklist:

  • Correct branch: master for new features; 3.3.x for fixes
  • Tests pass
  • Code follows coding guidelines: master and 3.3.x
  • Commit follows commit message format

Tracker ticket (set the ticket ID to your ticket ID):

https://tracker.phpbb.com/browse/PHPBB3-16985

@3D-I
Copy link
Contributor

3D-I commented Apr 14, 2022

Invalid, see the tracker.

@lionel-rowe
Copy link
Contributor Author

Invalid, see the tracker.

Still valid, see my response in the tracker.

@lionel-rowe
Copy link
Contributor Author

How do I move this forward?

return @mysqli_real_escape_string($this->db_connect_id, $msg);
return @mysqli_real_escape_string(
$this->db_connect_id,
utf8_encode_ucr($msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming the intention for all usage and all extensions that already use this function. What exactly happens when this function has already been called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because utf8_encode_ucr only modifies non-BMP unicode characters, and HTML entities are ASCII only (i.e. will always lie well within the BMP), that means it's idempotent for multiple calls. In other words, utf8_encode_ucr(utf8_encode_ucr(utf8_encode_ucr($str))) will always yield the same result as utf8_encode_ucr($str).

See this playground on repl.it (functions copied directly from includes/utf/utf_tools.php):

https://replit.com/@lionel_rowe/utf8encodeucr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not convinced this is the right approach. It is arbitrarily deciding how everyone should store their data and will likely produce unexpected results. Probably better to do specific implementations where needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to do specific implementations where needed.

As it has been done so far, let's say there are no bugs yet uncovered or there are some areas that have not fallbacks. Everything has been covered until proven otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arbitrarily deciding how everyone should store their data

I wholeheartedly agree in principle, but unfortunately that ship has sailed long ago, back when utf8_encode_ucr was introduced into the codebase in the first place, and probably even before that — the vast majority of string fields in phpBB are already (unnecessarily) stored as HTML, which should really be handled at the templating layer. For example, if I create a user with the username "<jimmy>", this is stored in the database as &lt;jimmy&gt;. All this PR does is ensure that (for example) username "jimmy💩" would also be stored as jimmy&#128169; to avoid MYSQLi throwing errors all over the place, which I've already encountered multiple times (I use a lot of emojis 😂).

Worst case with this PR is that a user occasionally sees strings looking like "&#128169;", as opposed to the current worst case, which is that they get full-page error messages and potential data loss (e.g. the action they were performing is lost).

If this was merged along with #6377, even that relatively minor tradeoff could also be eliminated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are many less-than-ideal things about the phpBB codebase, as surely everyone will agree.

Ok, you're putting new logic into a function that is core to the product (dbal) and have so far justified it with "the function exists so it should be used in a spot that affects any and all code that uses it (that ship has sailed)", "if it breaks someone's code then so be it (user will sometimes see HTML encoded representations of the text)", and "the function wasn't used here because the code base is less than ideal".

While these aren't really good reasons/arguments I realize this PR is for master anyways so it's probably ok as a potentially breaking change for the next major version given the expectation that some extensions may need some changes anyways.

Copy link
Contributor

@3D-I 3D-I Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavidIQ Do you think it is a good idea to have Emoji in usernames or other parts of the code? I don't think so as we have already discussed this and put fallbacks in place.

#5556

This cannot be a general fixing where the text parser is not already otherwise present, we have already considered all possibilities. If a new deficiency is discovered then the same function should be used (or its counterpart for HTML). This PR does not make logical sense.

Copy link
Contributor

@3D-I 3D-I Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I create a user with the username "<jimmy>", this is stored in the database as &lt;jimmy&gt;. All this PR does is ensure that (for example) username "jimmy💩" would also be stored as jimmy&#128169; to avoid MYSQLi throwing errors all over the place

#5556 - We do not want Emoji in usernames and some other places.

It does not make sense, please try first. There are multiple configurations for usernames at registration time. I would suggest that you learn more about phpBB at the UI level. Names like <jimmy> are possible. https://www.phpbb.com/community/memberlist.php?first_char=other#memberlist

image

So do not confuse what is stored in the database with what is being rendered at runtime (html).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@3D-I the username thing was just an example. Also, if an admin wants to allow users to have emojis in their username, why shouldn't that be allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavidIQ Ok, you're putting new logic into a function that is core to the product (dbal) and have so far justified it with "the function exists so it should be used in a spot that affects any and all code that uses it (that ship has sailed)", "if it breaks someone's code then so be it (user will sometimes see HTML encoded representations of the text)", and "the function wasn't used here because the code base is less than ideal".

What I'm saying is that any ways this breaks code will be less severe than what it fixed, i.e. it's better to occasionally show raw HTML entities to users than to occasionally show full-page MYSQLi errors to users that may also cause them to lose work.

However, I accept that there may be a better way to fix this, so I'll have a bit more of a think about this one.

@3D-I
Copy link
Contributor

3D-I commented Apr 25, 2022

@JoshyPHP What do you think about that? Thanks.

@JoshyPHP
Copy link
Contributor

JoshyPHP commented Apr 26, 2022

I don't have any skin in the game but I would advise against using HTML to escape plain text, as it changes the nature of the data. If the data is already HTML then sure, it's just another valid representation. Otherwise the change is too impactful, especially considering this is a workaround to fix at the application level an issue that exists at the server level.

IMO, it would be better to catch errors related to MySQL encoding and direct the user to a KB article on how to migrate to utf8mb4.

@3D-I
Copy link
Contributor

3D-I commented Apr 26, 2022

Thanks @JoshyPHP

@lionel-rowe
Copy link
Contributor Author

lionel-rowe commented Apr 26, 2022

@JoshyPHP IMO, it would be better to catch errors related to MySQL encoding and direct the user to a KB article on how to migrate to utf8mb4.

Unfortunately using utf8mb4 for the database doesn't fix it — the issue exists at the MYSQLi driver level. However, I accept now that this fix probably isn't the best way to go. I'll investigate more see if I can find a way to "convince" MYSQLi that the underlying database is indeed using utf8mb4 and therefore it doesn't have to freak out when it sees those scary-looking codepoints above U+FFFF.

@JoshyPHP
Copy link
Contributor

As far as I can tell, it should work fine as long as MySQL is configured for utf8mb4. On my local MariaDB install, this is what I use:

[mysqld]
character-set-server=utf8mb4
character-set-client-handshake=false

I don't know of an easy way to trigger that error on phpBB so I only tested it in the client directly.

$ mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 7
Server version: 10.6.5-MariaDB-log Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> CREATE DATABASE test;
Query OK, 1 row affected (0.000 sec)

MariaDB [(none)]> USE test;
Database changed
MariaDB [test]> CREATE TABLE x SELECT '🍰' AS col;
Query OK, 1 row affected (0.007 sec)
Records: 1  Duplicates: 0  Warnings: 0

MariaDB [test]> SELECT * FROM x\G
*************************** 1. row ***************************
col: 🍰
1 row in set (0.000 sec)

@lionel-rowe
Copy link
Contributor Author

lionel-rowe commented Apr 26, 2022

@JoshyPHP As far as I can tell, it should work fine as long as MySQL is configured for utf8mb4.

It's actually a really flaky error — after more testing, I think I've isolated it to PM titles for which the string length with surrogate pairs used is ≥ 62 and which contain 1 or more non-BMP chars. So for example:

1. "💩___________________________________________________________" length 61 — OK
2. "💩ℵ__________________________________________________________" length 61 — OK
3. "💩____________________________________________________________" length 62 — error
4. "______________________________________________________________________" length 70 but no non-BMP chars — OK

Note that the standard UTF-8 byte length of 1 and 2 are 63 and 65, respectively, so the UTF-8 byte length doesn't seem to be the deciding factor.

@lionel-rowe
Copy link
Contributor Author

lionel-rowe commented Apr 26, 2022

OK, looks like my initial assumption that this is a MYSQLi issue was wrong — I think this might be actual root of the problem:

diff --git a/phpBB/phpbb/db/tools/tools.php b/phpBB/phpbb/db/tools/tools.php
index 1250a8901d..cc696b6c52 100644
--- a/phpBB/phpbb/db/tools/tools.php
+++ b/phpBB/phpbb/db/tools/tools.php
@@ -328,7 +328,7 @@ class tools implements tools_interface
 		{
 			case 'mysql_41':
 				// make sure the table is in UTF-8 mode
-				$table_sql .= "\n) CHARACTER SET `utf8` COLLATE `utf8_bin`;";
+				$table_sql .= "\n) CHARACTER SET `utf8mb4` COLLATE `utf8mb4_general_ci`;";
 				$statements[] = $table_sql;
 			break;

Despite using utf8mb4 at the database level, phpBB overrides that at the table level inside the dbtools::sql_create_table function. D'oh!

@lionel-rowe lionel-rowe marked this pull request as draft April 28, 2022 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants