Save/load local file from code editor corrupts non-ASCII characters #283

ticalc-travis · 2023-09-03T01:11:10Z

I discovered today that non-ASCII UTF-8 characters do not survive round trips through the IDE's save and load to file functions, instead getting silently corrupted in a non-reversible manner. Although the Espruino may not have full Unicode support, I think Unicode code comments at least ought to be considered legitimate, and the IDE will mangle those irreparably if they include characters in other scripts/languages.

Specifically, in Firefox, both save and load will corrupt the characters. In Chrome/Chromium, UTF-8 files appear to load correctly but not save without corruption.

For Chrome/Chromium, I believe ticalc-travis@af96174 would fix the save issue (disclaimer: only tested briefly). The problem here is that it's trying to convert the text to a Blob by instantiating a Uint8Array and trying to write each byte individually in a loop using String.charCodeAt, which for Unicode strings may return values above 255. Trying to store those as Uint8 will naturally mangle them. Passing the string itself directly to the Blob constructor instead should preserve the encoding.

For Firefox/non-Chrome, there is a separate code path for save/load which, as far as I can tell, calls into code in the EspruinoTools repo, specifically fileOpenDialog and fileSaveDialog. This code has the same issue described above, but there's a comment in fileOpenDialog that suggests it's specifically trying to avoid UTF-8 interpretation. I'm not sure why, as I can't imagine how arbitrarily clipping Unicode points to a 0…255 range would be useful. Am I missing some context? At any rate, since that's in a separate repository, I'm not sure how to live-test changes to this code with the IDE.

The text was updated successfully, but these errors were encountered:

gfwilliams · 2023-09-04T15:58:50Z

Specifically, in Firefox,

How do you communicate with the device using this? It doesn't have Web Bluetooth... Are you using the Remote IDE or something like that?

The issue we have is that there really are binary files on the Bangle.js too. So for instance an image is binary - that may contain sequences of bytes that are UTF8 escape sequences, and if if was loaded and interpreted as UTF8 then it would be corrupted.

What we actually want to do is avoid UTF8 completely. We want to take the binary data, byte for byte, straight out of the device (which should be received as base64 to avoid any confusion) and then to write that, byte for byte, into a file.

I really don't think that starting to interpret files as UTF8 is the right solution here, because it's entirely possible for something like an image to have an invalid UTF8 escape sequence in it, in which case the download of the file would actually fail.

The other thing to bear in mind is that Bangle.js supports UTF8, but not all other boards do at the moment - there isn't enough flash memory for them to do so. So when uploading a JS file, if the IDE is going to start letting UTF8 through it might be that it'll have to try and figure out when it can and can't do it (maybe process.env.UTF8=true needs to be added to the Bangles).

ticalc-travis · 2023-09-04T16:40:37Z

This is just about saving and loading source files between the code editor in the IDE to files on the local computer running the IDE, not sending and receiving them from the device (which I actually didn't think to test). Sorry if I didn't make that clear. (My mention of the Espruino might have been a bit misplaced; I just wasn't sure if any of the code I was looking at could have somehow been reused for transfers to/from the device.)

The thing about preserving binary byte sequences makes sense, but this is not really happening in the case I see of saving a source file that was written in the IDE. The code is essentially taking a multi-byte character and then trying to rewrite it into a single byte, which doesn't work, and this actually corrupts the byte sequence rather than preserve it. As a result, if you load a source file from disk with some UTF-8 characters into the IDE and then save it back, you end up with a file on disk that does not byte-for-byte match the original file you loaded. It seems that the code assumes charCodeAt will always return individual byte values, but this assumption is not guaranteed for UTF-8 strings (because charCodeAt actually returns Unicode code points, I think, rather than bytes; i.e., a multi-byte character can be returned as a single value >255, which the routine then tries to shove into a single byte in the Uint8 array anyway).

gfwilliams · 2023-09-06T07:06:34Z

Ahh, gotcha. Sorry for the confusion.

Then yes, absolutely - we should be able to load a file into CodeMirror with non-ascii chars and save it and it be identical (apart from formatting like spaces on the end of lines/etc). I think fileXDialog functions should take/return strings with char code >255 without issue - when preparing the code to send to Espruino that may get changed around a bit, but I'm pretty sure those functions are fine with handling/converting non-ascii chars.

The change ticalc-travis@af96174 looks good - are you sure Blob defaults to UTF8, or do we need to be explicit about it?

In terms of testing EspruinoTools changes, you just need to fork that repo, make changes, and then change your EspruinoWebIDE repo to use that submodule (or maybe just downloading them locally and hosting there is easier!)

ticalc-travis · 2023-09-06T19:04:16Z

The sites I've looked at, including MDN and the W3C spec, all say that Blob() expects a well-formed Unicode string and will accept UTF-16 JS strings and internally reencode them as UTF-8. In fact, it doesn't appear to support anything else as an option; you'd have to use another library and do your own encoding/decoding to an array, I think. But the ticalc-travis@af96174 commit seems to work fine saving UTF-8 source files from CodeMirror on my end. I'll have to look into the EspruinoTools parts.

I went ahead and tested what happens when I transfer files to the Bangle.js 2 device I have by saving a source file to Storage and then viewing the hex dump it displays when viewing the resulting file. In this case it simply appears to try to translate the source to ISO-8859-1, with any unsupported characters being replaced with “?”. (Doesn't seem to support the additional Windows-1252 characters like curly quotes and the Euro, just base 8859-1.) Assuming this is the default character set that most of the on-device software expects, this seems reasonable, though it's a bit awkward that doing some searches through the Espruino/Bangle.js docs for keywords like “encoding” and “character set” I couldn't seem to easily find any specific confirmation about which encodings are expected where, especially for things like the built-in Bangle fonts and such. That would be important to know if I want to use extended characters in apps on the device. But I'm assuming 8859-1 or possibly Windows-1252?

There's a setting for encoding strings as UTF-8 in the Communication section of the IDE, and I noticed that that one seems to replace any non-ASCII characters within string literals in source code with backslash hex escapes for UTF-8 in order to preserve them, and that also looks reasonable and smart to do in case anyone really wanted to process them directly in that form on the device.

gfwilliams · 2023-09-07T08:28:46Z

Ok, that sounds great then - I've just merged in those changes.

What happens if you upload direct to storage (by clicking the Storage icon and uploading from there)? Does that mess with characters too?

What you say about the string literals/etc sounds about right when uploading JS, but I'd hope that when uploading files via the Storage window we would upload them byte-by-byte?

... of course the other part to this is I'm not sure the Terminal itself supports UTF8 either - so when uploading JS potentially the state machine for char decode in the IDE would need fixing too so we could properly see UTF8 chars being printed.

ticalc-travis · 2023-09-07T18:41:45Z

If I send raw files using the Storage menu, it seems to be fine. Write it to flash, view it, save it back out—the bytes stay unmodified, and the output file matches the input. This stays the same regardless of the “encode strings as Unicode” setting, as expected.

As for the terminal, it doesn't look like it does support UTF-8 right now. Regardless of the Unicode setting, it looks like it just interprets printed strings as latin-1/8859-1, so UTF8-encoded strings do not display correctly.

ticalc-travis changed the title ~~Save/load file corrupts non-ASCII characters~~ Save/load local file from code editor corrupts non-ASCII characters Sep 4, 2023

gfwilliams mentioned this issue Sep 7, 2023

Chrome: Fix corrupted non-ASCII characters on file save #284

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save/load local file from code editor corrupts non-ASCII characters #283

Save/load local file from code editor corrupts non-ASCII characters #283

ticalc-travis commented Sep 3, 2023 •

edited

gfwilliams commented Sep 4, 2023

ticalc-travis commented Sep 4, 2023 •

edited

gfwilliams commented Sep 6, 2023

ticalc-travis commented Sep 6, 2023

gfwilliams commented Sep 7, 2023

ticalc-travis commented Sep 7, 2023

Save/load local file from code editor corrupts non-ASCII characters #283

Save/load local file from code editor corrupts non-ASCII characters #283

Comments

ticalc-travis commented Sep 3, 2023 • edited

gfwilliams commented Sep 4, 2023

ticalc-travis commented Sep 4, 2023 • edited

gfwilliams commented Sep 6, 2023

ticalc-travis commented Sep 6, 2023

gfwilliams commented Sep 7, 2023

ticalc-travis commented Sep 7, 2023

ticalc-travis commented Sep 3, 2023 •

edited

ticalc-travis commented Sep 4, 2023 •

edited