Disable P8SCII unescaping to fix mangling of emoji characters #106

simonwulf · 2022-08-10T00:04:59Z

TL;DR

This PR addresses an issue where any emoji symbols in the input lua script would be replaced by a garbled sequence of characters. The proposed solution is to remove picotool's current handling of P8SCII escape sequences which does not seem to function as intended.

The Details

I encountered an issue where any use of the 🅾️ emoji in my lua script would be replaced by "ユか✽ゆヤま◆" after building a .p8 cart with picotool. The cause of this issue seems to stem from P8SCII being treated as an encoding in itself. In practice, this treatment boils down to two steps:

When parsing a string literal, the lexer replaces any numerical P8SCII escape sequence it encounters with a byte of the specified value, seemingly hoping that this results in a "pure" P8SCII string.
Later, the P8 formatter calls lua.p8scii_to_unicode, which seems meant to convert all P8SCII characters in the passed string to their utf-8 counterparts. The formatter assumes, at this point, that the lua script is P8SCII encoded. As a side note, this substitution routine runs on the entire script and not just on the string tokens that had their escape sequences converted by the lexer in step 1.

Both of the above steps have inherent issues:

Replacing P8SCII escape sequences with their corresponding byte values does not turn the input string in its entirety into a P8SCII encoded string as the majority of the string retains its original encoding (utf-8). What we end up with instead is a mix of utf-8 and P8SCII.
The assumption that the passed string is P8SCII encoded is incorrect. It is, In fact, mostly utf-8 with a few dashes of P8SCII encoded characters as a result of step 1. When this conversion routine encounters the seven byte long utf-8 character for 🅾️, it will replace each of the seven bytes with a new utf-8 character, resulting in "ユか✽ゆヤま◆".

Future Improvements

I would argue against treating P8SCII as a text encoding, instead merely treating it as a collection of escape sequences that hold a special meaning when passed to Pico-8's print function and passing them through unchanged. If pre-interpreting these escape sequences is still a desired feature, I'd suggest it be done in one go when parsing or writing the string tokens instead of passing through an intermediate format.
There are probably additional code paths or data structures that are made dead by this change and could be removed.

...in order to preserve utf-8 emoji characters

simonwulf added 3 commits July 19, 2022 23:47

Disable all unescaping of P8SCII escaped characters

8a8c745

...in order to preserve utf-8 emoji characters

Add test to ensure that emojis are preserved through p8 formatting

50f8744

Remove test connected to removed unescaping functionality

5f1576e

simonwulf mentioned this pull request Aug 10, 2022

Preserving of P8SCII Control Codes #89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable P8SCII unescaping to fix mangling of emoji characters #106

Disable P8SCII unescaping to fix mangling of emoji characters #106

simonwulf commented Aug 10, 2022 •

edited

Disable P8SCII unescaping to fix mangling of emoji characters #106

Are you sure you want to change the base?

Disable P8SCII unescaping to fix mangling of emoji characters #106

Conversation

simonwulf commented Aug 10, 2022 • edited

TL;DR

The Details

Future Improvements

simonwulf commented Aug 10, 2022 •

edited