Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") #28979

Closed
DemiMarie opened this issue Oct 12, 2015 · 59 comments
Closed

Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") #28979

DemiMarie opened this issue Oct 12, 2015 · 59 comments
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. P-low Low priority T-lang Relevant to the language team, which will review and decide on the PR/issue.

Comments

@DemiMarie
Copy link
Contributor

Non-ASCII identifiers are currently feature gated. Handling of them should be fixed and the feature gate removed.

@steveklabnik
Copy link
Member

/cc @rust-lang/lang

@pnkfelix pnkfelix added the T-lang Relevant to the language team, which will review and decide on the PR/issue. label Oct 29, 2015
@pnkfelix
Copy link
Member

nominating

@nrc
Copy link
Member

nrc commented Oct 30, 2015

cc @SimonSapin

Apparently we implement this: http://www.unicode.org/reports/tr31/ or something like it.

I would like to see this stabilised, but it will take some work to persuade ourselves that we are doing the right thing.

@SimonSapin
Copy link
Contributor

I have no idea what the right thing is here. In addition to Unicode recommendations, we might want to look at what other languages actually do, and what related bug reports or criticism they get. Or was this already done when the feature was first introduced?

@petrochenkov
Copy link
Contributor

@SimonSapin
C and C++ use http://unicode.org/reports/tr31/#Alternative_Identifier_Syntax (with some minor restrictions) and I haven't seen any complaints about it on isocpp forums or issue lists :)
Overview of the problem: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm
Implementation in Clang: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Lex/UnicodeCharSets.h?view=markup
cc #4928

There's also a problem with normalization of identifiers and mapping unicode mod names to the filesystem names (on OS X, IIRC), but I can't find the relevant link here it is: #2253. (In the worst case non-inline mods and extern crates can be forced to be ASCII)

@pnkfelix
Copy link
Member

pnkfelix commented Nov 1, 2015

Yes #2253 is the big issue I know of that makes me worry about premature stabilization of non-unicode identifiers.

(The discussion there is more broad and arguably could be forked off into two threads; e.g. we could take one normalization path for identifiers and another for string literal contents.)

@pnkfelix
Copy link
Member

pnkfelix commented Nov 1, 2015

we may want to migrate This discussion to the RFCS repo, e.g. at rust-lang/rfcs#802

@bstrie
Copy link
Contributor

bstrie commented Nov 4, 2015

I agree that this is a feature that deserves to be put through the RFC process.

@aturon aturon changed the title Fix non-ASCII identifiers Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") Nov 4, 2015
@aturon aturon added the B-unstable Blocker: Implemented in the nightly compiler and unstable. label Nov 4, 2015
@aturon
Copy link
Member

aturon commented Nov 4, 2015

I've repurposed this issue to track stabilization (or deprecation, etc) of the non_ascii_idents feature gate.

@nikomatsakis
Copy link
Contributor

After discussion in the lang team meeting, we decided that yes, an RFC would be the proper way forward here. We need something that collects the solutions from other languages, analyzes their pros/cons, and suggests the appropriate choice for Rust. This is controversial and complex enough that it should be brought to the community at large -- especially as many of us hacking on Rust on a daily basis don't have a lot of experience with non-ASCII anyhow.

@nikomatsakis
Copy link
Contributor

triage: P-low

Marking as low as there is no RFC at present and hence no actionable content.

@rust-highfive rust-highfive added P-low Low priority and removed I-nominated labels Nov 5, 2015
@huonw
Copy link
Member

huonw commented Nov 5, 2015

cc #7539

@kberov
Copy link
Contributor

kberov commented Jan 8, 2017

In JavaScript, Perl 5 and Perl 6 this feature is available.
JavaScript (Firefox 50)

function Слово(стойност) {
  this.стойност = стойност;
}
var здрасти = new Слово("Здравей, свят");
console.log(здрасти.стойност) //Здравей, свят

Perl >=5.12

use utf8;
{
  package Слово;
  sub new {
    my $self = bless {}, shift;
    $self->{стойност} = shift;
    $self
  }
};
my $здрасти = Слово->new("здравей, свят");
say ucfirst($здрасти->{стойност}); #Здравей, свят

Perl6 (this is not just next version of Perl. This is a new language)

class Слово {
  has $.стойност;
}

my $здрасти = Слово.new(стойност => 'здравей, свят');
say $здрасти.tc; #Здравей, свят

I would be happy to see it in Rust too.

@SimonSapin
Copy link
Contributor

For what it’s worth identifiers in ECMAScript 2015 are based on the Default Identifier Syntax from Unicode Standard Annex #31.

Perl with use utf8; uses the regexp below, with XID_Start and XID_Continue presumably also from UAX # 31.

/ (?[ ( \p{Word} & \p{XID_Start} ) + [_] ])
        (?[ ( \p{Word} & \p{XID_Continue} ) ]) *    /x

@kberov
Copy link
Contributor

kberov commented Jan 8, 2017

Yes! Thanks @SimonSapin!

@SimonSapin
Copy link
Contributor

For Python it’s <XID_Start> <XID_Continue>*.

So it looks like many programming languages that allow non-ASCII identifiers are based on the same standard, but in the details they each do something slightly different…

@mjbshaw
Copy link
Contributor

mjbshaw commented Feb 8, 2017

I would personally love to see support for math-related identifiers. For example, ∅ (and set operators, like ∩ and ∪). Translating equations from research papers/specifications into code is often a terrible process resulting in verbose and difficult to read code. Being able to use the same identifiers in the code that are in the paper's math equations would simplify implementation and would make the code easier to check and compare against the paper's equations.

@DoumanAsh
Copy link

What's point of this feature exactly? Aside from adding possibility to create truly ugly mix of different languages in your code(english is the only truly international language), it gives no benefits to language functionality wise. Or is it support of unicode for the sake of supporting unicode?

@steveklabnik
Copy link
Member

I'd like to cross-link this comment: #4928 (comment)

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 17, 2018

I haven't seen the possibility of enabling homoglyph-based attacks here (If somebody mentioned them please ignore the noise), but I just filled a clippy issue to request a lint that warns on code like this:

#![feature(non_ascii_idents)]
fn main() {
    let a = 2;
    let а = 3;
    assert_eq!(a, 2);  // OK
    assert_eq!(а, 3);  // OK
}

In a nutshell, those two as are different unicode characters so the second let binding does not shadow the first one, and both asserts pass (the playground doesn't seem to support unicode identifiers though so the only way to try this is locally; works for me).

This "feature" can be used to introduce exploits in Rust programs that are harder to detect, in particular given that shadowing let bindings are considered idiomatic Rust by many, myself included.

P.S.: this "feature" might be useful in underhanded Rust contests, although that #![feature(non_ascii_idents)] should raise some eyebrows :)

@ketsuban
Copy link
Contributor

@gnzlbg I believe there's already some support for confusables detection to stop people swapping out your semicolons for Greek question marks and such, but I don't know if it applies to identifiers. If it does, then that solves that problem; if it doesn't, at least we have the tooling to do it ready to go.

I'm a little concerned that this is a candidate for being closed and the code removed from the compiler because it's not had significant movement for a while and requires an RFC. I care a fair amount about Rust being a language of the 21st century, which means Unicode, and about Rust being friendly to non-English-speaking programmers. What I lack is the ability to actually write an RFC.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 18, 2018

@ketsuban

I believe there's already some support for confusables detection to stop people swapping out your semicolons for Greek question marks and such, but I don't know if it applies to identifiers.

yes, I think that, as suggested by @oli-obk in the clippy issue, Rust implementation would instead just use the latest official confusable list:

http://www.unicode.org/Public/security/revision-06/confusables.txt

homoglyph-based attacks can be prevented. This list would need to be kept in sync though, but that is something that can be automated as part of the build system.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 18, 2018

@ketsuban

If you care about this, there are other languages that support unicode in their identifiers, and these languages have processes similar to the RFC process. You could start by checking those. Who knows, maybe you can just merge them together with the feedback in this issue, and get a pre-RFC in the internals forum going? From that point on, it is just about incorporating/arguing feedback with others, and before you know it you will have an RFC ready.

Mark-Simulacrum pushed a commit to Mark-Simulacrum/rust that referenced this issue May 17, 2018
The grammar defines identifiers in terms of XID_start and XID_continue,
but this is referring to the unstable non_ascii_idents feature.
The documentation implies that non_ascii_idents is forthcoming, but this
is left over from pre-1.0 documentation; in reality, non_ascii_idents
has been without even an RFC for several years now, and will not be
stabilized anytime soon. Furthermore, according to the tracking issue at
rust-lang#28979 , it's highly
questionable whether or not this feature will use XID_start or
XID_continue even when or if non_ascii_idents is stabilized.
This commit fixes this by respecifying identifiers as the usual
[a-zA-Z_][a-zA-Z0-9_]*
Mark-Simulacrum added a commit to Mark-Simulacrum/rust that referenced this issue May 17, 2018
Fix grammar documentation wrt Unicode identifiers

The grammar defines identifiers in terms of XID_start and XID_continue,
but this is referring to the unstable non_ascii_idents feature.
The documentation implies that non_ascii_idents is forthcoming, but this
is left over from pre-1.0 documentation; in reality, non_ascii_idents
has been without even an RFC for several years now, and will not be
stabilized anytime soon. Furthermore, according to the tracking issue at
rust-lang#28979 , it's highly
questionable whether or not this feature will use XID_start or
XID_continue even when or if non_ascii_idents is stabilized.
This commit fixes this by respecifying identifiers as the usual
[a-zA-Z_][a-zA-Z0-9_]*
@mitsuhiko
Copy link
Contributor

In a way I hope we stick with ASCII identifiers forever. Handling unicode identifiers is such a massive interoperability pain. Some of the more bizarre examples of NFKC mappings is that things like this map to the same identifier:

>>>  = 1
>>> H
1
>>>  = 42
>>> IX
42
>>>  = 23
>>> N
23
>>> import math
>>>  = math.e
>>> e
2.718281828459045
>>>  = 2
>>> Z
2

@Serentty
Copy link
Contributor

@mitsuhiko The real world has that kind of pain. We can't just ignore this problem because it's hard to deal with and involves a feature that you personally have no use for.

@Ixrec
Copy link
Contributor

Ixrec commented Jul 28, 2018

Also, the current RFC explicitly proposes NFC over NFKC, after a lot of discussion about examples very similar to those.

@Centril
Copy link
Contributor

Centril commented Oct 29, 2018

Closing in favor of #55467.

@Centril Centril closed this as completed Oct 29, 2018
kennytm pushed a commit to kennytm/rust that referenced this issue Dec 29, 2018
kennytm added a commit to kennytm/rust that referenced this issue Dec 29, 2018
Update references to closed issue

Issue rust-lang#28979 was closed with a link to rust-lang#55467.
JohnHeitmann pushed a commit to JohnHeitmann/rust that referenced this issue Jan 5, 2019
@arcturusannamalai
Copy link

It's 2021 and we should be more inclusive in language design and whats allowed in identifiers; coming from Python world Python3 support for unicode identifiers/functions/module names is truly great progress and I wish this for Rust community as well.

@steveklabnik
Copy link
Member

@arcturusannamalai please see the final comment here, this work is still ongoing, and including that is the plan.

@sanmai-NL
Copy link

It's 2021 and we should be more inclusive in language design and whats allowed in identifiers; coming from Python world Python3 support for unicode identifiers/functions/module names is truly great progress and I wish this for Rust community as well.

CPython 3's support for non-ASCII identifiers is pretty spotty, variable and hard to determine. See https://tjol.eu/blog/unicode-identifiers.html

One interesting problem is that with Unicode changes between versions, valid identifiers in Python source code can become invalid.

@arcturusannamalai
Copy link

I'm actually comfortable programming in 1's and 0's too just happen to want non-ASCII .. but English suffices for systems level work.

@steveklabnik
Copy link
Member

Just to be clear, this feature landed in stable Rust in 1.53.0, almost two years ago #83799

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. P-low Low priority T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests