Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directional confusability: _א1 and _1א not detected (unicode 15.1's new TR55, and updated TR39) #12929

Open
mrluc opened this issue Sep 13, 2023 · 9 comments

Comments

@mrluc
Copy link
Contributor

mrluc commented Sep 13, 2023

Elixir and Erlang/OTP versions

otp 25, elixir 14

Operating system

osx

Current behavior

An issue to kick off convo -- I found a few implications for us upon reading 15.'s new technical report on source code, TR55 and the updated TR39.

As the title says -- in visual studio code and Emacs, I see the same thing I see in the issue title: what look like identical tokens. I can't paste this code into a selectable code sample here, because this editor does wacky things to it! But I'm pasting an image below.

image

Compiles and runs fine despite the tokens looking confusable.

Expected behavior

Should fail or warn as the variable names are visually confusable -- only when selecting text/moving cursor do you see the order change. I'll be curious to see if this survives in the issue title or does anything weird there, actually!

We would need to implement the brand new definition of Confusability -- LTR confusability in this case -- to catch it.

Found from reading a TR55 example -- they classify levels of confusability detection (we'd be Type 1, I guess), but they mention this case about Rust's detection:

Industry example: The Rust compiler implements type II confusable detection, by flagging any confusable identifiers within a crate, with the exception that it uses confusability rather than LTR-confusability (so that it would fail to diagnose the confusability of the identifiers aא1 and a1א).

There's a lot in there that's relevant, actually, in addition to big changes to TR39. This is the group of people/technical report that's actually thinking closest to 1:1 about the whole experience for PL creators <> users writing in other languages.

  • For instance, they describe a less-restrictive-while-still-secure notion of 'confusable chunk detection' that we may want to take a look at, because of the better UX it gives.
  • Similarly they mention whole languages that are excluded without an addition to the General Security Profile for zero-width joiner characters -- ie you can't write the Persian word for 'update' in Elixir currently as an identifier (and other langs ofc).
  • @josevalim someone might want to make an issue to look through TR55 and the updated TR39 and see where we stand -- we could make an issue for that and assign it to me if we wanted.

There's some stuff we may disagree on

  • for instance, tucked away in both TR55 and TR39 is now a claim that all programming languages must allow direction to change within identifiers, ie a single variable where the direction you're reading changes. Which I'm not sure everyone would agree on, kinda feels like a language design question? Maybe everyone's doing it wrong. 😆

I'd have bandwidth to look into it after EOM probably; I'm doing a decently painful project that ships around then.

@josevalim
Copy link
Member

josevalim commented Sep 14, 2023

Hi @mrluc, thank you for a great initial description. My initial notes:

  1. We should protect against LTR confusability indeed. An initial PR is very welcome!

  2. I would still prefer to hold on introducing support for ZWJC

  3. I am happy with introducing chunked mixed-script detection. However, aliases do not allow unicode, and fooBAR variables are uncommon in Elixir, so I would only support chunking around the _ character (which is simple and clear)

  4. I could not find the bits about allowing direction changes within identifiers. Do they mention it in general, as part of general confusion or mixed-script confusion? If it is part of chunking, then it is alright as long as it respects the _ boundary. Otherwise I will need to learn more. References are welcome. :)

Feel free to add/remove/disagree. Thank you!

@mrluc
Copy link
Contributor Author

mrluc commented Sep 20, 2023

Very cool. I'd love to contribute when I can circle back to this, maybe via sketching out a reference implementation PR.

As an aside -- the project/product I'm doing right now is an absolute joy, thanks in large part to the many person-years of ❤️ that the Elixir team has put into things. LiveView, Nx, Vix, eVision, and Image are the gifts I'm most thankful for this morning.

Re: notes above:

I would still prefer to hold on introducing support for ZWJC

^^^ super 👍 atm

I would only support chunking around the _ character (which is simple and clear)

Wonderful, yeah that would likely make things much simpler.

[...bidi in identifiers] If it is part of chunking, then it is alright as long as it respect the _ boundary

^^^ yes, they only say it in that context I believe -- in the context of identifiers using 'chunked mixed script detection'.

Planning to follow up in Oct-Nov if nobody dives in before then

@josevalim
Copy link
Member

Hi @mrluc, a quick ping in case you still plan to tackle this. :)

@mrluc
Copy link
Contributor Author

mrluc commented Feb 15, 2024 via email

@kipcole9
Copy link
Contributor

I'm willing to take a stab at this if there are no other takers.

@kipcole9
Copy link
Contributor

I could not find the bits about allowing direction changes within identifiers

Perhaps @mrluc is referring to section 5.1.6 in TR55?

Implementations should not prohibit the use of the directional formatting characters; they are useful in ensuring the correct display of bidirectional text, as illustrated in this document. However, in order to avoid disruption when the code is displayed as plain text, it may be useful to warn when the effect of the explicit directional formatting character extends across atoms. The algorithm described in Section 5.2, Conversion to Plain Text, includes such a diagnostic.

@mrluc
Copy link
Contributor Author

mrluc commented Feb 15, 2024

Example from the updated/now-way-more-complex UTS 39: "LTR, and RTL, and FS confusability should be used when it is inappropriate to enforce that strings be single-script, or at least single-directionality; this is the case in programming language identifiers. See Section 5.1, Confusability Mitigation Diagnostics, in Unicode Technical Standard #55, Unicode Source Code Handling [UTS55]." -- from uts39/tr39

@mrluc
Copy link
Contributor Author

mrluc commented Feb 16, 2024

To fix this issue, I think I remember would mostly consist of implementing the new confusability skeleton algorithm, "bidiSkeleton", in the updated uts 39. It doesn't look as trivial any more, haha, and they call out that it's costlier to run, but there are some points that are hopeful for an implementation:

  • as they point out, most identifiers will be all one direction, and in that case bidiSkeleton(X) == skeleton(X)
  • also there's a potential benefit in that I think that confusability checks are the only check we do as its own full pass, instead of while tokenizing 😄 so it may be more self-contained.

That would catch the example mentioned above.

But then, separately, there's that question of changing what script mixing Elixir allows in identifiers, if we want to follow the new suggestions that imply loosening the Restriction Level, which would allow eg. http_सर्वर.

Currently, we only allow 'Highly Restrictive' script mixing in identifiers, which basically allows only the combinations of Latin script + some of several asian languages. That's why, in the example for this issue, I could only use non-hebrew characters that are shared by all scriptsets (_ and 1).

The new versions of UTS 39 and 55 express the strong opinion that it is inappropriate to limit identifiers to single-script or single-direction in programming language identifiers, which equates to a strong claim that 'no programming language should be Highly Restrictive'; an example from UTS 39:

Example from the updated/now-way-more-complex UTS 39: "LTR, and RTL, and FS confusability should be used when it is inappropriate to enforce that strings be single-script, or at least single-directionality; this is the case in programming language identifiers. See Section 5.1, Confusability Mitigation Diagnostics, in Unicode Technical Standard #55, Unicode Source Code Handling [UTS55]." -- from uts39/tr39

They're careful to say, in UTS 55 when defining chunk confusability in identifiers, that the guidance is for "if you're looser than Highly Restrictive, and also X and Y are true ..." -- but on the other hand, as you see above, they strongly word things. It gives to understand "you're restricting people from doing things that aren't actually dangerous, just because they're from Russia, India, or any other non-Highly-Restrictive-blessed script mixing region!" 😆 The horror.

They're right about that of course; the question is how to do it. I think we landed on Highly Restrictive because it was a clearly-defined 'safe' restriction level to start our compliance from, which at the same time didn't impose any additional effort on maybe 4-5B people (latin script users + specified asian script users) -- others can still write identifiers in their languages, but can't mix with Latin letters. As they point out, it's better to come up with a better definition of "what's confusing?" instead of the hammer of disallowing the script mixing for some and allowing it for other scriptsets.

So, what's better than what we've got now? This new stuff:

Mixed-script detection, as described in Unicode Technical Standard #39, Unicode Security Mechanisms [UTS39], should not directly be applied to computer language identifiers; indeed, it is often expected to mix scripts in these identifiers, because they may refer to technical terms in a different script than the one used for the bulk of the program. For instance, a Russian HTTP server may use the identifier HTTPЗапрос (HTTPRequest).

Instead, identifiers should be subdivided into visually recognizable chunks based either on both case and punctuation; one can then ensure that these chunks are either single-script, or are visibly mixed-script (in which case the reader is not misled about the string being single-script).

It's cool, do the same check, but for "chunks" within an identifier, instead of the whole identifier.

They introduce the concept of 'identifier chunks', and mention that there are 'chunk' separators within identifiers that we humans can parse, and that a machine can recognize, and thus that there are identifiers that as a whole string are 'Mixed Script' (accumulated intersection of scriptsets is empty if accumulated over every character in the string), but that are not actually dangerous from a script mixing perspective, because each separate chunk does not mix scripts within itself!

To modify one of their examples, http_सर्वर is 'mixed script', but not 'confusing'.

  • However, I note that their suggestions would treat these two differently: microᖯ and micro_ᖯ. One's 'confusing', because it's a separate chunk, one's not. That's a small issue, though, because normal confusability should catch the case of an all-latin micro_b appearing in scope and clashing.
  • Also I love specs -- they define the word 'confusing' for these purposes.

Wrap-up

As I understand it, when we fix the issue that originated this thread, probably by supporting bidiSkeleton, we would be compliant with the new version of UTS 39 again.

We'd still be using Highly Restrictive! Which is 😱 now, maybe. So then, if we wanted to move from Highly Restrictive to relying on chunked identifier confusability, that'd mean implementing:

UTS55-C4 An implementation claiming to implement Mixed-Script Detection in Identifier Chunks shall do so in accordance with the specifications in Section 5.1.2.2, Mixed-Script Detection in Identifier Chunks.

And that may be able to be done, literally, by just resetting the scriptset every time we encounter an underscore when parsing an identifier! 😆 That'd be amazing, but I'd doubt it will be that simple.

As a side note -- I vastly prefer josevalim's recco of underscores as separators for this, for a reason I didn't see mentioned as a risk in the UTS:

  • Don't forget, some characters can invade their neighbor's space! And that's a kind of confusability that I don't believe there are usable protections for: multi-character confusables. For instance, not the best example because it was the first thing I could think of, adding a thai vowel over this l่ might be overlooked and confused with a normal l. Today we catch it:

    Mixed-script identifiers are not supported for security reasons. 'l่' is made of the following scripts:
    
      \u006C l {Latin}
      \u0E48 ่ {Thai}
    

cc @josevalim @kipcole9 -- also kipcole ty so much for your work on so many useful libraries 🥲

@josevalim
Copy link
Member

I am still of the opinion of being conservative, because it is clear the best practices are still evolving and we don’t want to end up in a place we have to revert recommendations. To recap:

I am happy with introducing chunked mixed-script detection. However, aliases do not allow unicode, and fooBAR variables are uncommon in Elixir, so I would only support chunking around the _ character (which is simple and clear)

and perhaps use the same for bidi, but that can come later. Anyone against?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants