Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect forced encoding for Regexp with a Unicode property/script #2620

Open
andrykonchin opened this issue Mar 22, 2024 · 1 comment
Open
Labels
bug Something isn't working

Comments

@andrykonchin
Copy link
Member

andrykonchin commented Mar 22, 2024

for instance the following regexps have UTF-8 encoding in CRuby:

/\p{Arabic}/.encoding # => #<Encoding:UTF-8>
/\p{L}/.encoding # => #<Encoding:UTF-8>

but Prism sets the forced_us_ascii_encoding flag:

bin/parse -e '/\p{L}/'
@ ProgramNode (location: (1,0)-(1,7))
├── locals: []
└── statements:
    @ StatementsNode (location: (1,0)-(1,7))
    └── body: (length: 1)
        └── @ RegularExpressionNode (location: (1,0)-(1,7))
            ├── flags: forced_us_ascii_encoding
            ├── opening_loc: (1,0)-(1,1) = "/"
            ├── content_loc: (1,1)-(1,6) = "\\p{L}"
            ├── closing_loc: (1,6)-(1,7) = "/"
            └── unescaped: "\\p{L}"

and

bin/parse -e '/\p{Arabic}/'
@ ProgramNode (location: (1,0)-(1,12))
├── locals: []
└── statements:
    @ StatementsNode (location: (1,0)-(1,12))
    └── body: (length: 1)
        └── @ RegularExpressionNode (location: (1,0)-(1,12))
            ├── flags: forced_us_ascii_encoding
            ├── opening_loc: (1,0)-(1,1) = "/"
            ├── content_loc: (1,1)-(1,11) = "\\p{Arabic}"
            ├── closing_loc: (1,11)-(1,12) = "/"
            └── unescaped: "\\p{Arabic}"

Related issue - #1997

@andrykonchin
Copy link
Member Author

A side note. CRuby raises SyntaxError when source encoding isn't UTF-8:

# encoding: US-ASCII

puts /\p{Arabic}/.encoding
# => test.rb:3: invalid character property name {Arabic}: /\p{Arabic}/ (SyntaxError)

@andrykonchin andrykonchin changed the title Incorrect forced encoding for Regexp with Unicode property/script Incorrect forced encoding for Regexp with a Unicode property/script Mar 22, 2024
@kddnewton kddnewton added the bug Something isn't working label Mar 25, 2024
@kddnewton kddnewton added this to the Other unblocked milestone Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants