Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[shared] [webassembly] pyexec_event_repl_process_char unable to understand unicode #14255

Open
2 tasks done
WebReflection opened this issue Apr 5, 2024 · 3 comments
Open
2 tasks done
Labels

Comments

@WebReflection
Copy link

Checks

  • I agree to follow the MicroPython Code of Conduct to ensure a safe and respectful space for everyone.

  • I've searched for existing issues matching this bug, and didn't find any.

Port, board and/or hardware

webassembly, linux shell

MicroPython version

latest

Reproduction

Open a MicroPython REPL or visit this page (which is half patched, but not fully): https://webreflection.github.io/coincident/test/micropython.html

try to type in it the following:

print("µpython")

on a native shell you'll see python instead of µpython, on the Web REPL you see even less because the count goes off due replProcessChar (even the Asyncify one) and this is the tip of the iceberg ... now try a combined emoji:

print("👩‍❤️‍👨")

... see emptiness or awkward results ...

Most emoji are indeed just broken out of the box unless you ask for these as an input(...):

fam = input("> ")
# type 👩‍❤️‍👨
print(fam) # 👩‍❤️‍👨
fam # '\U0001f469\u200d\u2764\ufe0f\u200d\U0001f468'

Coincidentally, if you explicitly go into "REPL paste mode" (\5) you can past anything you like then get out (\4) and see all code pasted had no issues in being processed, just like the input(...) case.

Related PR that fixes at least the output side of affairs pyscript/pyscript#2018 but it cannot fix users' typing on the terminal somehow as replProcessChar misses chars in the process (and yes, it has no linebuffer but it's the same with linebuffer, the issue is within the code behind replProcessChar to me).

Expected behaviour

if I type the following in the REPL I expect things to just work and output the correct result:

print("µpython")
# µpython
print("👩‍❤️‍👨")
# 👩‍❤️‍👨

Observed behaviour

if I type the following in the REPL this happens instead:

print("µpython")
# python or thon
print("👩‍❤️‍👨")
# ... nothing, awkward state

Additional Information

Pinging @dpgeorge as I've done already in Discord but this looks and feels like a broader issue with REPL because it's possible to reproduce it via native Linux port.

@felixdoerre
Copy link
Contributor

felixdoerre commented Apr 7, 2024

because it's possible to reproduce it via native Linux port.

I think I can elaborate on what does not work/what is missing: The micropython-implemented "readline" functionality does not handle unicode characters at all (maybe this also explains why "raw" mode and "paste" mode works). You can see it in this implementation: After escape-sequences are handled, any non-ascii character is ignored (which is, how the first few chars of those emojis would look like):

https://github.com/micropython/micropython/blob/master/shared/readline/readline.c#L279

I'm not sure, where the webassembly port exactly interfaces with the rest of micropython, because it seems like I cannot interact with readline, even though the call path that I traced seems to lead to readline (For example, it does not react to Ctrl-R, and Tab-completion, Ctrl-C seems to work, maybe thats implemented explicitly, and the other special characters are not input completely?)

Probably extending that statement to parse multi-byte UTF-8 characters and handle them correctly, would solve this problem at its root.

(I'm also interested in unicode input working in other instances of the micropython-REPL)

@dpgeorge
Copy link
Member

dpgeorge commented Apr 7, 2024

I think this is a duplicate of #2789.

@WebReflection
Copy link
Author

WebReflection commented Apr 8, 2024

apologies I wasn't sure it was strictly REPL related but @felixdoerre explained it well (with code) and @dpgeorge knows this since 6+ years ago (I was able to debug up to the pyexec_event_repl_process_char behind replProcessChar but no more).

If you feel like closing it I will update related issues to point at that 2017 issue but I hope that handling at least UTF-8 without caring much about arrows and deletion would be a very welcomed first step: we can tell our users those are known limitations but we can't really tell our users "please just speak English or see surprises in your live/REPL code". Thanks for understanding and hopefully moving that old issue forward incrementally 🙏


P.S. for @felixdoerre it is possible the editor or the browser intercepts those ctrl+X chars without explicit preventDefault on all combinations so that's more on us than on the WASM REPL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants