New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ticket/16985] Fix MYSQLi bug - Incorrect string value for non-BMP chars #6384
base: master
Are you sure you want to change the base?
Conversation
Invalid, see the tracker. |
Still valid, see my response in the tracker. |
How do I move this forward? |
return @mysqli_real_escape_string($this->db_connect_id, $msg); | ||
return @mysqli_real_escape_string( | ||
$this->db_connect_id, | ||
utf8_encode_ucr($msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is assuming the intention for all usage and all extensions that already use this function. What exactly happens when this function has already been called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because utf8_encode_ucr
only modifies non-BMP unicode characters, and HTML entities are ASCII only (i.e. will always lie well within the BMP), that means it's idempotent for multiple calls. In other words, utf8_encode_ucr(utf8_encode_ucr(utf8_encode_ucr($str)))
will always yield the same result as utf8_encode_ucr($str)
.
See this playground on repl.it (functions copied directly from includes/utf/utf_tools.php
):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not convinced this is the right approach. It is arbitrarily deciding how everyone should store their data and will likely produce unexpected results. Probably better to do specific implementations where needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to do specific implementations where needed.
As it has been done so far, let's say there are no bugs yet uncovered or there are some areas that have not fallbacks. Everything has been covered until proven otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arbitrarily deciding how everyone should store their data
I wholeheartedly agree in principle, but unfortunately that ship has sailed long ago, back when utf8_encode_ucr
was introduced into the codebase in the first place, and probably even before that — the vast majority of string fields in phpBB are already (unnecessarily) stored as HTML, which should really be handled at the templating layer. For example, if I create a user with the username "<jimmy>", this is stored in the database as <jimmy>
. All this PR does is ensure that (for example) username "jimmy💩" would also be stored as jimmy💩
to avoid MYSQLi throwing errors all over the place, which I've already encountered multiple times (I use a lot of emojis 😂).
Worst case with this PR is that a user occasionally sees strings looking like "💩", as opposed to the current worst case, which is that they get full-page error messages and potential data loss (e.g. the action they were performing is lost).
If this was merged along with #6377, even that relatively minor tradeoff could also be eliminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are many less-than-ideal things about the phpBB codebase, as surely everyone will agree.
Ok, you're putting new logic into a function that is core to the product (dbal) and have so far justified it with "the function exists so it should be used in a spot that affects any and all code that uses it (that ship has sailed)", "if it breaks someone's code then so be it (user will sometimes see HTML encoded representations of the text)", and "the function wasn't used here because the code base is less than ideal".
While these aren't really good reasons/arguments I realize this PR is for master anyways so it's probably ok as a potentially breaking change for the next major version given the expectation that some extensions may need some changes anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DavidIQ Do you think it is a good idea to have Emoji in usernames or other parts of the code? I don't think so as we have already discussed this and put fallbacks in place.
This cannot be a general fixing where the text parser is not already otherwise present, we have already considered all possibilities. If a new deficiency is discovered then the same function should be used (or its counterpart for HTML). This PR does not make logical sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I create a user with the username "
<jimmy>
", this is stored in the database as<jimmy>
. All this PR does is ensure that (for example) username "jimmy💩" would also be stored asjimmy💩
to avoid MYSQLi throwing errors all over the place
#5556 - We do not want Emoji in usernames and some other places.
It does not make sense, please try first. There are multiple configurations for usernames at registration time. I would suggest that you learn more about phpBB at the UI level. Names like <jimmy>
are possible. https://www.phpbb.com/community/memberlist.php?first_char=other#memberlist
So do not confuse what is stored in the database with what is being rendered at runtime (html).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@3D-I the username thing was just an example. Also, if an admin wants to allow users to have emojis in their username, why shouldn't that be allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DavidIQ Ok, you're putting new logic into a function that is core to the product (dbal) and have so far justified it with "the function exists so it should be used in a spot that affects any and all code that uses it (that ship has sailed)", "if it breaks someone's code then so be it (user will sometimes see HTML encoded representations of the text)", and "the function wasn't used here because the code base is less than ideal".
What I'm saying is that any ways this breaks code will be less severe than what it fixed, i.e. it's better to occasionally show raw HTML entities to users than to occasionally show full-page MYSQLi errors to users that may also cause them to lose work.
However, I accept that there may be a better way to fix this, so I'll have a bit more of a think about this one.
@JoshyPHP What do you think about that? Thanks. |
I don't have any skin in the game but I would advise against using HTML to escape plain text, as it changes the nature of the data. If the data is already HTML then sure, it's just another valid representation. Otherwise the change is too impactful, especially considering this is a workaround to fix at the application level an issue that exists at the server level. IMO, it would be better to catch errors related to MySQL encoding and direct the user to a KB article on how to migrate to |
Thanks @JoshyPHP |
Unfortunately using |
As far as I can tell, it should work fine as long as MySQL is configured for
I don't know of an easy way to trigger that error on phpBB so I only tested it in the client directly.
|
It's actually a really flaky error — after more testing, I think I've isolated it to PM titles for which the string length with surrogate pairs used is ≥ 62 and which contain 1 or more non-BMP chars. So for example:
Note that the standard UTF-8 byte length of 1 and 2 are 63 and 65, respectively, so the UTF-8 byte length doesn't seem to be the deciding factor. |
OK, looks like my initial assumption that this is a MYSQLi issue was wrong — I think this might be actual root of the problem: diff --git a/phpBB/phpbb/db/tools/tools.php b/phpBB/phpbb/db/tools/tools.php
index 1250a8901d..cc696b6c52 100644
--- a/phpBB/phpbb/db/tools/tools.php
+++ b/phpBB/phpbb/db/tools/tools.php
@@ -328,7 +328,7 @@ class tools implements tools_interface
{
case 'mysql_41':
// make sure the table is in UTF-8 mode
- $table_sql .= "\n) CHARACTER SET `utf8` COLLATE `utf8_bin`;";
+ $table_sql .= "\n) CHARACTER SET `utf8mb4` COLLATE `utf8mb4_general_ci`;";
$statements[] = $table_sql;
break; Despite using |
Fixes PHPBB3-16985 and will likely also fix a class of various related bugs, such as CUSTDB-826.
utf8_encode_ucr
is idempotent for any string that doesn't contain unencoded non-BMP characters, so double-encoding won't be an issue for any strings already encoded using it.Checklist:
Tracker ticket (set the ticket ID to your ticket ID):
https://tracker.phpbb.com/browse/PHPBB3-16985