Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange Contains and IndexOf handling of "\0" in .NET 5.0 #46569

Closed
xanatos opened this issue Jan 5, 2021 · 7 comments
Closed

Strange Contains and IndexOf handling of "\0" in .NET 5.0 #46569

xanatos opened this issue Jan 5, 2021 · 7 comments
Labels
area-System.Globalization untriaged New issue has not been triaged by the area owner

Comments

@xanatos
Copy link

xanatos commented Jan 5, 2021

The new ICU handling of strings seems to have a problem with "\0" in .NET 5.0

Console.WriteLine($"Ordinal Contains null char {"test".Contains("\0", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase Contains null char {"test".Contains("\0", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture Contains null char {"test".Contains("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture Contains null char {"test".Contains("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.InvariantCultureIgnoreCase)}");

Console.WriteLine($"Ordinal IndexOf null char {"test".IndexOf("\0", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCultureIgnoreCase)}");

I expect all of these to write false and -1, but the CurrentCulture, CurrentCultureIgnoreCase, InvariantCulture and InvariantCultureIgnoreCase return true and 0. This is a breaking change from 3.1 and quite illogical, considering that the \0 is a "nornal" character in .NET. I've noticed that if I use "\0test" the results are the same (true and 0)

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Globalization untriaged New issue has not been triaged by the area owner labels Jan 5, 2021
@ghost
Copy link

ghost commented Jan 5, 2021

Tagging subscribers to this area: @tarekgh, @safern, @krwq
See info in area-owners.md if you want to be subscribed.

Issue Details

The new ICU handling of strings seems to have a problem with "\0" in .NET 5.0

Console.WriteLine($"Ordinal Contains null char {"test".Contains("\0", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase Contains null char {"test".Contains("\0", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture Contains null char {"test".Contains("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture Contains null char {"test".Contains("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.InvariantCultureIgnoreCase)}");

Console.WriteLine($"Ordinal IndexOf null char {"test".IndexOf("\0", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCultureIgnoreCase)}");

I expect all of these to write false and -1, but the CurrentCulture, CurrentCultureIgnoreCase, InvariantCulture and InvariantCultureIgnoreCase return true and 0. This is a breaking change from 3.1 and quite illogical, considering that the \0 is a "nornal" character in .NET. I've notice that if I use "\0test" the result is the same (true and 0)

Author: xanatos
Assignees: -
Labels:

area-System.Globalization, untriaged

Milestone: -

@xanatos xanatos changed the title Contains and IndexOf handling of "\0" in .NET 5.0 Strange Contains and IndexOf handling of "\0" in .NET 5.0 Jan 5, 2021
@benaadams
Copy link
Member

/cc @GrabYourPitchforks @tarekgh

@safern
Copy link
Member

safern commented Jan 6, 2021

This is by design on ICU as "\0" is a weightless character on ICU, and was discussed on this issue: #4673 (comment)

This has been the behavior in .NET Core for Unix systems since .NET Core 2.0, and as of .NET 5.0 we decided to move to use ICU by default on Windows as well to bring behavior on pair across all OSs.

You can look at the doc https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu to learn more about the change using ICU. The doc has the info how you can switch back to NLS behavior if you need to do so (however it is not recommended as long term that will be legacy).

Also, #43956 to make this change less painful for .NET 6.0 which is our LTS.

This is also a long thread that might be helpful understand some of the implications and motivation for the breaking change: #43736 (comment)

I'm going to close this issue, please let us know if you have more questions and thank you for opening the issue.

@safern safern closed this as completed Jan 6, 2021
@tarekgh
Copy link
Member

tarekgh commented Jan 6, 2021

Just to add to what @safern mentioned:

Unicode collation has some characters which will be ignored during the cultural collation operations. Think about it as if these characters not exist at all in the string. The null character \0 is one of these characters. You can consult the Unicode standard for the whole list of ignored characters here https://www.unicode.org/charts/collation/chart_Ignored.html.

Usually for searching for such control characters, we always recommend using ordinal operation.

@xanatos feel free to send any question if you think there is anything here is unclear and thanks for reporting the issue.

@xanatos
Copy link
Author

xanatos commented Jan 6, 2021

No not it is quite clear. The chart is a little misleading in the glyphs shown for the 0080-009F block, because it shows glyps that in truth have been remapped. So by looking at the chart it seems that 0080 is the Euro symbol, but in truth the Euro Symbol is 20AC and 0080 is a control character (the first thing I thought while looking at the chart was: why the Unicode team thinks the European Euro is less important than the American Dollar 🤣). But what they did has a certain logic. Sadly now to make string comparisons you'll need a master's degree in Unicode Technologies, but that is another problem.

@benaadams
Copy link
Member

For single char control code; single quotes to use a char overload; which defaults to Ordinal, is easier? (also faster)

"test".IndexOf('\0')

@tarekgh
Copy link
Member

tarekgh commented Jan 6, 2021

So by looking at the chart it seems that 0080 is the Euro symbol, but in truth the Euro Symbol is 20AC and 0080 is a control character (the first thing I thought while looking at the chart was: why the Unicode team thinks the European Euro is less important than the American Dollar 🤣). But what they did has a certain logic.

I agree the chart can be confusing if you look at it without previous knowledge but at least the chart is listing all Unicode codepoints which are ignored which make it easy to check the behavior of such characters. I believe the chart used the euro sign in 0x80 for the reason which is, when the the euro sign initially introduced, was required to be supported in most of the codepages (not only Unicode). For most codepages, the character 0x80 is the euro sign. here is example https://en.wikipedia.org/wiki/Windows-1252. But still agree the chart is confusing.

Sadly now to make string comparisons you'll need a master's degree in Unicode Technologies, but that is another problem.

Linguistic operations can be very surprising for many languages especially if not familiar with such languages. That is why need to be conscious when doing such operations. You don't have to be expert in that at all but you need to evaluate your scenario which is using this operation. For example, if you are displaying a sorted list of strings in your app UI, would make sense to use the linguistic operation regardless of your knowledge about the details. That is because the list will be sorted according the user expectation. For search operations which you are looking for specific literal characters, then should be ordinal operations.

Feel free to send any more questions if you have any.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Globalization untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

5 participants