Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mb_strlen returns 0 instead of 1 for the char chr(254) #415

Open
alexchuin opened this issue Oct 18, 2022 · 7 comments
Open

mb_strlen returns 0 instead of 1 for the char chr(254) #415

alexchuin opened this issue Oct 18, 2022 · 7 comments

Comments

@alexchuin
Copy link

Hi,

With the polyfill, var_dump(mb_strlen(chr(254))) return 0.
With the php8.0-mbstring extension, var_dump(mb_strlen(chr(254))) return 1;

versions : * v1.26.0

Thanks,
Alex

@yannouche34490
Copy link

Hi,
I’ve made some tests and iconv_strlen, who’s used by mb_strlen, fails for every chr() from 128 to 255. This plus function’s return type int probably casting false value returned by iconv_strlen to 0 explains thé issue.

I’ve found a trick by using utf8_encode, but it will deprecated with php 8.2.

@stof
Copy link
Member

stof commented Jan 27, 2023

utf8_encode is not a solution to your problem. It just breaks other cases (the name of that function does not describe what it does)

@stof
Copy link
Member

stof commented Jan 27, 2023

@alexchuin the issue is that chr(254) is not a valid UTF-8 string. So trying to compute its length in the UTF-8 encoding does not make any sense.
Could it be that when you try that with the actual mbstring, you have a php.ini when the default encoding used by mbstring is not UTF-8 but something else ?

@yannouche34490
Copy link

utf8_encode is not a solution to your problem. It just breaks other cases (the name of that function does not describe what it does)

Yes, I know, that’s why I didn’t wrote a PR. My point is maybe somerhing from ut8_encode source code is « portable » in this context to handle edge-cases like this.

@stof
Copy link
Member

stof commented Jan 27, 2023

@yannouche34490 your code using utf8_encode is not fixing the issue. It only computes the length of a different string.

@yannouche34490
Copy link

@yannouche34490 your code using utf8_encode is not fixing the issue. It only computes the length of a different string.

Yes, we have to find a proper way to detect these cases.

@stof
Copy link
Member

stof commented Jan 27, 2023

@yannouche34490 utf8_encode is not the tool for that either. This function is really convert_latin1_to_utf8 (which is precisely why it is deprecated).
And if you really want to work with latin1, then use strlen($string) (as latin1 is not a multibyte encoding) or mb_strlen($string, 'ISO-8859-1').

The issue here is that it looks like mb_strlen does not validate that the string is valid in the encoding it uses when computing the length, while iconv_strlen does. So passing an invalid UTF-8 string produces a different behavior in the polyfill than in the extension. we would have to implement a fallback for cases where iconv fails to have an actual polyfill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants