Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Stage 3 proposal Intl.Locale #5675

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jackhorton
Copy link
Contributor

@jackhorton jackhorton commented Sep 7, 2018

There are still a few things to do, but I wanted to get this out the door before heading home for the day:

  • Combine langtagToParts and parseLangtag
  • Implement minimize/maximize
  • Get rid of my silly inline bubblesort in favor of JavascriptArray::EntrySort
  • Add more tests
  • Update CanonicalizeLocaleList to accept Intl.Locale object
  • (Discuss) put this behind a feature flag

This currently passes 160/184 Intl.Locale spec tests. The 24 tests that do not pass are all cases where we shell out to ICU which does non-Intl-spec-compliant behavior (mostly related to the differences between platform.normalizeLanguageTag and CanonicalizeLanguageTag

/cc @littledan

@jackhorton jackhorton force-pushed the intl/locale branch 2 times, most recently from f4d5cec to 6a69de3 Compare September 8, 2018 08:31
@jackhorton jackhorton added the Bytecode-Update This PR updates bytecode and will cause merge conflicts with other PRs with this label label Sep 8, 2018
@jackhorton
Copy link
Contributor Author

Interesting, looks like ICU 55 doesnt minimize und-Hant to und-Hant, while at least 57 (i believe whats installed on the Macs) and 61 locally do.

@littledan
Copy link

Maybe we should share some notes across implementations about where it is or isn't possible to reuse ICU's locale processing; cc @srl295 @jungshik @gsathya

@jackhorton
Copy link
Contributor Author

Also see comments of #5674 with @jefgen. ICU canonicalization code does a lot that Intl doesn't really care about, but its handling of UTS35/RFC5646-style canonicalization is a bit all over the place. I believe spidermonkey implements their own locale processing, and I am not sure what jsc does. I looked into doing the locale processing manually and it wouldn't be too difficult (at least, not on top of all of the processing that we already do in the abstract operations), so I don't know if its worth getting the code into ICU if the only people worrying about it are Intl implementers.

const LANG_TAG_RE = new RegExp(`^${LANG_TAG}$`, 'i'); // [1] language; [2] script; [3] region; [4] variants; [5] extensions;
let unicodeExtensionsEnd;
for (unicodeExtensionsEnd = unicodeExtensionStart + 1; unicodeExtensionsEnd < extensionParts.length && extensionParts[unicodeExtensionsEnd].length > 1; unicodeExtensionsEnd++) {
// do nothing, we just want k to equal the index of the next element whose length is 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you referring to with k? I don't see any variables with that name (which is nice, by the way).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, renamed the variable without updating the comment.

Copy link

@jungshik jungshik Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should share some notes across implementations about where it is or isn't possible to reuse ICU's locale processing; c

v8 is about to make a switch to ICU API from regexes to validate input tags. There are two groups of issues:

  1. ICU's handling of grandfathered tags and deprecated region/language code is outdated. This is a data issue (the latest version of the IANA language tag registry should be used) . I've filed a series of bugs against the ICU. I'm assigned to them and I do have patches (that have been applied to v8's copy of ICU since this spring)

  2. I also discovered a couple of bugs and put up a PR against the ICU (it's approved, but not yet merged).

With the above two issues resolved, v8 will make a switch to ICU with a couple of extra pre/post-processing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke too early. What I wrote above is mainly structure-validation and canonicalization. Min/max also work fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last I had checked V8 (not sure which version) it was still converting unicode attributes to unicode keywords with value = "yes", and converting en-GB-oed to en-GB-x-oed rather than en-GB-oxendict. Chakra does this as well as a direct result of allowing ICU to do the canonicalization, so I assumed V8 used ICU in the same way that we did. In other words, I thought both V8 and Chakra suffered primarily from data issues because of ICU, not structure issues.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reminding me of 'yes' issue. That's still outstanding. It's
https://unicode-org.atlassian.net/browse/ICU-13730 .

In case of en-GB-oed, it's fixed at least in geCanonicalLocales() along with a number of cases arising from data. ICU upstream hasn't been fixed, though. [1]

var loc1=Intl.getCanonicalLocales("en-GB-oed")
undefined
loc1
["en-GB-oxendict"]

Let me check Intl.Locale.

[1] https://unicode-org.atlassian.net/browse/ICU-13721
13719, 13720, 13723, 13726 are other bugs about the date update. Perhaps, I'd better consolidate them all into one and make a PR ( the v8/Chromium patch is
https://cs.chromium.org/chromium/src/third_party/icu/patches/locid_map.patch ).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intl.locale is also fixed (as it should be):

$ d8 --harmony-locale
d8> new Intl.Locale("en-gb-oed").toString()
"en-GB-oxendict"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the structural validity, my PR for https://unicode-org.atlassian.net/browse/ICU-20098 was just merged to the ICU tot. When I replaced custom regular expressions for BCP 47 structural validity in v8 with ICU uloc_forLanguageTag/uloc_toLanguageTag with the above PR applied locally, at least there's no regression and one failing test begins to pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's great! The Chromium patch + ICU PR covers all of the cases I tested that ICU didn't handle correctly except for three (plus the -yes issue):

  1. The UTS35 Likely Subtags algorithm notes that sh -> sr_Latn and mo -> ro_MD, but in my testing ICU didn't handle that. The UTS page is unclear about where that data comes from -- its not in the subtag registry, and I don't know enough about the CLDR layout to find it there, either.
  2. und-Arab-AF maximizes to ar-Arab-AF, which seems wrong -- Wikipedia says the two primary languages of Afghanistan are Dari (prs) and Pashto (ps), not Arabic (ar). I am not sure if the Arabic script bit of the tag is causing the language to be Arabic, but the above Likely Subtags algorithm says und-Arab-AF should be maximized to fa-Arab-AF (fa == Persian), which also seems more reasonable since at least Dari is a Persian dialect.
  3. The UTS35 page also mentions that when the script is Zzzz or the region is ZZ, it should be removed from the tag entirely, but ICU seems to accept it.

I can file ICU issues for any/all three if theyre actually incorrect behavior.

Copy link
Contributor

@sethbrenith sethbrenith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Ms2ger
Copy link

Ms2ger commented Sep 12, 2018

I see you've written some tests here; can you submit them upstream or put a license on them that'll allow me to?

@jackhorton
Copy link
Contributor Author

Does the MIT license that we use preclude that? I asked some people internally and it wasn't clear, but I will keep asking around. Additionally, I only looked at the test262 cases that failed during development, but from what I can tell everything in these tests that isn't in test262 came from my interpretation of UTS35's Likely Subtags section. I personally would want to resolve some of the questions and commented out bits of the tests I wrote before porting them to test262.

@dilijev
Copy link
Contributor

dilijev commented Sep 14, 2018

Will take a look

// the UTS35 example says the maximized version should be fa-Arab-AF?
test("und-Arab-AF", "und-Arab-AF", "und-Arab-AF", "ar-Arab-AF");

// Chakra performs incorrect canonicalization, so the following cases don't pass.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chakra [](start = 15, length = 6)

make it more clear that this is actually because of ICU's logic

@@ -257,6 +257,7 @@ RT_ERROR_MSG(JSERR_MissingCurrencyCode, 5123, "", "Currency code was not specifi
RT_ERROR_MSG(JSERR_InvalidDate, 5124, "", "Invalid Date", kjstRangeError, 0)
RT_ERROR_MSG(JSERR_IntlNotAvailable, 5125, "", "Intl is not available.", kjstTypeError, 0)
RT_ERROR_MSG(JSERR_IntlNotImplemented, 5126, "", "Intl operation '%s' is not implemented.", kjstTypeError, 0)
RT_ERROR_MSG(JSERR_InvalidPrivateOrGrandfatheredTag, 5127, "", "The arguments provided to Intl.Locale form an invalid privateuse or grandfathered language tag", kjstRangeError, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalid privateuse or grandfathered language tag [](start = 110, length = 48)

nit: This language confuses me, but it might make sense with the algo, or we can fix the message later -- so meh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestion? From the spec, I would suggest its fairly clear -- there are specific cases where the spec says "if tag matches the grandfathered production or the privateuse production, throw a RangeError"

@@ -122,6 +122,16 @@
concat(array, ...els) { return callInstanceFunc(platform.builtInJavascriptArrayEntryConcat, array, ...els); },
filter(array, func) { return callInstanceFunc(platform.builtInJavascriptArrayEntryFilter, array, func); },
unique(array) { return _.filter(array, (v, i) => _.arrayIndexOf(array, v) === i); },
any(array, func) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any(array, func) { [](start = 8, length = 18)

@jdalton FYI lodash method reimplementation :P

const scriptREString = `\\b(?:${ALPHA}{4})\\b`; // script = 4ALPHA
const extlangREString = `\\b(?:${ALPHA}{3}\\b(?:-${ALPHA}{3}){0,2})\\b`; // extlang = 3ALPHA *2("-" 3ALPHA)

const languageREString = '\\b(?:' + // language =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 84, length = 2)

nit indentation

Copy link
Contributor

@dilijev dilijev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bytecode-Update This PR updates bytecode and will cause merge conflicts with other PRs with this label Intl-ICU
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants