Fix Tokenizer.prototype.tokenizeFrom string length after normalizing #1628

brandon-gong · 2021-11-07T03:17:55Z

This pull request addresses #1627, in which I was getting strange bugs on a particular character.

Currently tokenizeFrom normalizes the given source string to Unicode Normalization Form C, then stores the original string's length in a separate variable this.len. However, calling .normalize() on a string can change its length, so its necessary this.len reflects the length of the newly normalized string to avoid lexing errors.

This issue turns out to be pretty prevalent, and I've found a slew of characters that cause the same error in Pyret right now. Below is a small sample of them that I found with a small script, but there are a lot more (even common characters with accent marks, like é, may have this issue).

'̈́क़ख़ग़ज़ड़ढ़फ़य़ড়ঢ়য়ਲ਼ਸ਼ਖ਼ਗ਼ਜ਼ਫ਼ଡ଼ଢ଼གྷཌྷདྷབྷཛྷཀྵཱཱིུྲྀླཱྀྀྒྷྜྷྡྷྦྷྫྷྐྵ⫝̸𤋮𢡊𢡄𣏕𥉉𥳐𧻓יִײַשׁשׂשּׁשּׂאַאָאּבּגּדּהּוּזּטּיּךּכּלּמּנּסּףּפּצּקּרּשּתּוֹבֿכֿפֿ𝅗𝅥𝅘𝅥𝅘𝅥𝅮𝅘𝅥𝅯𝅘𝅥𝅰𝅘𝅥𝅱𝅘𝅥𝅲𝆹𝅥𝆺𝅥𝆹𝅥𝅮𝆺𝅥𝅮𝆹𝅥𝅯𝆺𝅥𝅯 क़ ख़ ग़ ज़ ड़ ढ़ फ़ य़ ড় ঢ় য় ਲ਼ ਸ਼ ਖ਼ ਗ਼ ਜ਼ ਫ਼ ଡ଼ ଢ଼ གྷ ཌྷ དྷ བྷ ཛྷ ཀྵ ཱི ཱུ ྲྀ ླྀ ཱྀ ྒྷ ྜྷ ྡྷ ྦྷ ྫྷ ྐྵ ⫝̸ 𤋮 𢡊 𢡄 𣏕 𥉉 𥳐 𧻓 יִ ײַ שׁ שׂ שּׁ שּׂ אַ אָ אּ בּ גּ דּ הּ וּ זּ טּ יּ ךּ כּ לּ מּ נּ סּ ףּ פּ צּ קּ רּ שּ תּ וֹ בֿ כֿ פֿ 𝅗𝅥 𝅘𝅥 𝅘𝅥𝅮 𝅘𝅥𝅯 𝅘𝅥𝅰 𝅘𝅥𝅱 𝅘𝅥𝅲 𝆹𝅥 𝆺𝅥 𝆹𝅥𝅮 𝆺𝅥𝅮 𝆹𝅥𝅯 𝆺𝅥𝅯

In addition to fixing this issue by simply having this.len reflect the normalized string's length, I've also written two tests in the areas that I've found this to be an issue, namely block comments and string literals. If they're misplaced / unnecessary / not enough I can certainly change them!

blerner · 2021-11-07T12:53:55Z

Thanks! As you saw in the comment on the line you changed -- the length property here isn't unicode-aware, for sure. I'm pretty sure I used the original length of the string because the lexer iterates character-by-character, and needs to supply source locations to tokens in such a way that sourcestring.substring(start, end) correctly extracts the entirety of the token, and at least at the time I was writing the lexer, the normalized length was wrong.

This is a particularly fiddly property to get right (see https://hsivonen.fi/string-length/, for an amusing example) and I mostly just punted on this when originally writing the lexer. Pyret gets this example weird, since CodeMirror doesn't handle the characters consistently with how they're output, either:

I genuinely don't know what the best thing to do here is.

brandon-gong · 2021-11-07T16:29:29Z

Thanks for the article, I thought it was an interesting read!

I really don't understand the consequences of normalizing enough to make any serious argument for one way or another, but from my newbie point of view, the current behavior (of these characters causing errors in code) can cause unnecessary confusion especially for students less comfortable with Pyret because they're far more likely to believe they made a mistake in their code somewhere rather than some magic character sitting in a string or comment causing issues. (I'm speaking from personal experience 😅).

Also, I'm not sure how/why it changed since you wrote the lexer, but it looks like using the original length causes substring to not extract the whole token now, as that normalized-length-2 character causes Pyret to not parse string-length all the way? This screenshot is from code.pyret.org this morning:

Separately, I'm also curious -- what happens if we don't deal with normalizing the string at all? Everything still passes with make test on my computer, but I'm not sure if problems come up with CodeMirror or the browser doing anything to text input. This has the added benefit of allowing for string-length("𢡊") == 1, which is less of a "lie" than quietly normalizing it and saying string-length("𢡊") == 2. And maybe we avoid some Unicode quirks as well.

Anyway, I'm definitely out of my area of expertise here. Thanks so much for taking the time to review my pull request!

brandon-gong added 2 commits November 6, 2021 22:51

Fix tokenizeFrom string length changes on normalize + add tests

46231e5

fix mismatched whitespace in tests

a0d327f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Tokenizer.prototype.tokenizeFrom string length after normalizing #1628

Fix Tokenizer.prototype.tokenizeFrom string length after normalizing #1628

brandon-gong commented Nov 7, 2021

blerner commented Nov 7, 2021

brandon-gong commented Nov 7, 2021

Fix Tokenizer.prototype.tokenizeFrom string length after normalizing #1628

Are you sure you want to change the base?

Fix Tokenizer.prototype.tokenizeFrom string length after normalizing #1628

Conversation

brandon-gong commented Nov 7, 2021

blerner commented Nov 7, 2021

brandon-gong commented Nov 7, 2021