Implement Stage 3 proposal Intl.Locale #5675

jackhorton · 2018-09-07T23:55:14Z

There are still a few things to do, but I wanted to get this out the door before heading home for the day:

Combine langtagToParts and parseLangtag
Implement minimize/maximize
Get rid of my silly inline bubblesort in favor of JavascriptArray::EntrySort
Add more tests
Update CanonicalizeLocaleList to accept Intl.Locale object
(Discuss) put this behind a feature flag

This currently passes 160/184 Intl.Locale spec tests. The 24 tests that do not pass are all cases where we shell out to ICU which does non-Intl-spec-compliant behavior (mostly related to the differences between platform.normalizeLanguageTag and CanonicalizeLanguageTag

/cc @littledan

jackhorton · 2018-09-10T01:23:52Z

Interesting, looks like ICU 55 doesnt minimize und-Hant to und-Hant, while at least 57 (i believe whats installed on the Macs) and 61 locally do.

littledan · 2018-09-10T09:45:15Z

Maybe we should share some notes across implementations about where it is or isn't possible to reuse ICU's locale processing; cc @srl295 @jungshik @gsathya

jackhorton · 2018-09-10T15:44:41Z

Also see comments of #5674 with @jefgen. ICU canonicalization code does a lot that Intl doesn't really care about, but its handling of UTS35/RFC5646-style canonicalization is a bit all over the place. I believe spidermonkey implements their own locale processing, and I am not sure what jsc does. I looked into doing the locale processing manually and it wouldn't be too difficult (at least, not on top of all of the processing that we already do in the abstract operations), so I don't know if its worth getting the code into ICU if the only people worrying about it are Intl implementers.

sethbrenith · 2018-09-10T15:59:15Z

lib/Runtime/Library/InJavascript/Intl.js

-        const LANG_TAG_RE         = new RegExp(`^${LANG_TAG}$`,      'i'); // [1] language; [2] script; [3] region; [4] variants; [5] extensions;
+                let unicodeExtensionsEnd;
+                for (unicodeExtensionsEnd = unicodeExtensionStart + 1; unicodeExtensionsEnd < extensionParts.length && extensionParts[unicodeExtensionsEnd].length > 1; unicodeExtensionsEnd++) {
+                    // do nothing, we just want k to equal the index of the next element whose length is 1


what are you referring to with k? I don't see any variables with that name (which is nice, by the way).

oops, renamed the variable without updating the comment.

Maybe we should share some notes across implementations about where it is or isn't possible to reuse ICU's locale processing; c

v8 is about to make a switch to ICU API from regexes to validate input tags. There are two groups of issues:

ICU's handling of grandfathered tags and deprecated region/language code is outdated. This is a data issue (the latest version of the IANA language tag registry should be used) . I've filed a series of bugs against the ICU. I'm assigned to them and I do have patches (that have been applied to v8's copy of ICU since this spring)

I also discovered a couple of bugs and put up a PR against the ICU (it's approved, but not yet merged).

With the above two issues resolved, v8 will make a switch to ICU with a couple of extra pre/post-processing.

I spoke too early. What I wrote above is mainly structure-validation and canonicalization. Min/max also work fine.

Last I had checked V8 (not sure which version) it was still converting unicode attributes to unicode keywords with value = "yes", and converting en-GB-oed to en-GB-x-oed rather than en-GB-oxendict. Chakra does this as well as a direct result of allowing ICU to do the canonicalization, so I assumed V8 used ICU in the same way that we did. In other words, I thought both V8 and Chakra suffered primarily from data issues because of ICU, not structure issues.

Thank you for reminding me of 'yes' issue. That's still outstanding. It's
https://unicode-org.atlassian.net/browse/ICU-13730 .

In case of en-GB-oed, it's fixed at least in geCanonicalLocales() along with a number of cases arising from data. ICU upstream hasn't been fixed, though. [1]

var loc1=Intl.getCanonicalLocales("en-GB-oed") undefined loc1 ["en-GB-oxendict"]

Let me check Intl.Locale.

[1] https://unicode-org.atlassian.net/browse/ICU-13721
13719, 13720, 13723, 13726 are other bugs about the date update. Perhaps, I'd better consolidate them all into one and make a PR ( the v8/Chromium patch is
https://cs.chromium.org/chromium/src/third_party/icu/patches/locid_map.patch ).

Intl.locale is also fixed (as it should be):

$ d8 --harmony-locale d8> new Intl.Locale("en-gb-oed").toString() "en-GB-oxendict"

As for the structural validity, my PR for https://unicode-org.atlassian.net/browse/ICU-20098 was just merged to the ICU tot. When I replaced custom regular expressions for BCP 47 structural validity in v8 with ICU uloc_forLanguageTag/uloc_toLanguageTag with the above PR applied locally, at least there's no regression and one failing test begins to pass.

Ah, that's great! The Chromium patch + ICU PR covers all of the cases I tested that ICU didn't handle correctly except for three (plus the -yes issue):

The UTS35 Likely Subtags algorithm notes that sh -> sr_Latn and mo -> ro_MD, but in my testing ICU didn't handle that. The UTS page is unclear about where that data comes from -- its not in the subtag registry, and I don't know enough about the CLDR layout to find it there, either.

und-Arab-AF maximizes to ar-Arab-AF, which seems wrong -- Wikipedia says the two primary languages of Afghanistan are Dari (prs) and Pashto (ps), not Arabic (ar). I am not sure if the Arabic script bit of the tag is causing the language to be Arabic, but the above Likely Subtags algorithm says und-Arab-AF should be maximized to fa-Arab-AF (fa == Persian), which also seems more reasonable since at least Dari is a Persian dialect.

The UTS35 page also mentions that when the script is Zzzz or the region is ZZ, it should be removed from the tag entirely, but ICU seems to accept it.

I can file ICU issues for any/all three if theyre actually incorrect behavior.

sethbrenith

Ms2ger · 2018-09-12T12:45:45Z

I see you've written some tests here; can you submit them upstream or put a license on them that'll allow me to?

jackhorton · 2018-09-12T18:18:55Z

Does the MIT license that we use preclude that? I asked some people internally and it wasn't clear, but I will keep asking around. Additionally, I only looked at the test262 cases that failed during development, but from what I can tell everything in these tests that isn't in test262 came from my interpretation of UTS35's Likely Subtags section. I personally would want to resolve some of the questions and commented out bits of the tests I wrote before porting them to test262.

dilijev · 2018-09-14T05:46:21Z

Will take a look

dilijev · 2018-09-17T23:01:01Z

test/Intl/Locale.js

+            // the UTS35 example says the maximized version should be fa-Arab-AF?
+            test("und-Arab-AF", "und-Arab-AF", "und-Arab-AF", "ar-Arab-AF");
+
+            // Chakra performs incorrect canonicalization, so the following cases don't pass.


Chakra [](start = 15, length = 6)

make it more clear that this is actually because of ICU's logic

dilijev · 2018-09-17T23:14:27Z

lib/Parser/rterrors.h

@@ -257,6 +257,7 @@ RT_ERROR_MSG(JSERR_MissingCurrencyCode, 5123, "", "Currency code was not specifi
 RT_ERROR_MSG(JSERR_InvalidDate, 5124, "", "Invalid Date", kjstRangeError, 0)
 RT_ERROR_MSG(JSERR_IntlNotAvailable, 5125, "", "Intl is not available.", kjstTypeError, 0)
 RT_ERROR_MSG(JSERR_IntlNotImplemented, 5126, "", "Intl operation '%s' is not implemented.", kjstTypeError, 0)
+RT_ERROR_MSG(JSERR_InvalidPrivateOrGrandfatheredTag, 5127, "", "The arguments provided to Intl.Locale form an invalid privateuse or grandfathered language tag", kjstRangeError, 0)


invalid privateuse or grandfathered language tag [](start = 110, length = 48)

nit: This language confuses me, but it might make sense with the algo, or we can fix the message later -- so meh

Any suggestion? From the spec, I would suggest its fairly clear -- there are specific cases where the spec says "if tag matches the grandfathered production or the privateuse production, throw a RangeError"

dilijev · 2018-09-17T23:21:04Z

lib/Runtime/Library/InJavascript/Intl.js

@@ -122,6 +122,16 @@
        concat(array, ...els) { return callInstanceFunc(platform.builtInJavascriptArrayEntryConcat, array, ...els); },
        filter(array, func) { return callInstanceFunc(platform.builtInJavascriptArrayEntryFilter, array, func); },
        unique(array) { return _.filter(array, (v, i) => _.arrayIndexOf(array, v) === i); },
+        any(array, func) {


any(array, func) { [](start = 8, length = 18)

@jdalton FYI lodash method reimplementation :P

dilijev · 2018-09-17T23:27:26Z

lib/Runtime/Library/InJavascript/Intl.js

+    const scriptREString = `\\b(?:${ALPHA}{4})\\b`;                                     // script        = 4ALPHA
+    const extlangREString = `\\b(?:${ALPHA}{3}\\b(?:-${ALPHA}{3}){0,2})\\b`;            // extlang       = 3ALPHA *2("-" 3ALPHA)
+
+    const languageREString = '\\b(?:'       +                                         // language      =


[](start = 84, length = 2)

nit indentation

dilijev

jackhorton added the Intl-ICU label Sep 7, 2018

jackhorton requested review from dilijev and jefgen September 7, 2018 23:55

jackhorton force-pushed the intl/locale branch 2 times, most recently from f4d5cec to 6a69de3 Compare September 8, 2018 08:31

jackhorton added the Bytecode-Update This PR updates bytecode and will cause merge conflicts with other PRs with this label label Sep 8, 2018

jackhorton added 2 commits September 9, 2018 14:00

Implement Intl.Locale

33b8cad

Update bytecode

5c3e427

jackhorton force-pushed the intl/locale branch from 6a69de3 to 5c3e427 Compare September 9, 2018 21:00

sethbrenith reviewed Sep 10, 2018

View reviewed changes

sethbrenith approved these changes Sep 10, 2018

View reviewed changes

dilijev reviewed Sep 17, 2018

View reviewed changes

dilijev approved these changes Sep 18, 2018

View reviewed changes

jahzielv mentioned this pull request May 2, 2019

Intl.Locale updates mdn/browser-compat-data#4052

Merged

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Stage 3 proposal Intl.Locale #5675

Implement Stage 3 proposal Intl.Locale #5675

jackhorton commented Sep 7, 2018 •

edited by wyrichte

jackhorton commented Sep 10, 2018

littledan commented Sep 10, 2018

jackhorton commented Sep 10, 2018

sethbrenith Sep 10, 2018

jackhorton Sep 10, 2018

jungshik Sep 11, 2018 •

edited

jungshik Sep 11, 2018

jackhorton Sep 11, 2018

jungshik Sep 11, 2018

jungshik Sep 11, 2018

jungshik Sep 11, 2018

jackhorton Sep 11, 2018

sethbrenith left a comment

Ms2ger commented Sep 12, 2018

jackhorton commented Sep 12, 2018

dilijev commented Sep 14, 2018

dilijev Sep 17, 2018

dilijev Sep 17, 2018

jackhorton Sep 18, 2018

dilijev Sep 17, 2018

dilijev Sep 17, 2018

dilijev left a comment

Implement Stage 3 proposal Intl.Locale #5675

Are you sure you want to change the base?

Implement Stage 3 proposal Intl.Locale #5675

Conversation

jackhorton commented Sep 7, 2018 • edited by wyrichte

jackhorton commented Sep 10, 2018

littledan commented Sep 10, 2018

jackhorton commented Sep 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jungshik Sep 11, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethbrenith left a comment

Choose a reason for hiding this comment

Ms2ger commented Sep 12, 2018

jackhorton commented Sep 12, 2018

dilijev commented Sep 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilijev left a comment

Choose a reason for hiding this comment

jackhorton commented Sep 7, 2018 •

edited by wyrichte

jungshik Sep 11, 2018 •

edited