Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should non-ASCII contents in .asci blocks be rejected? #257

Open
Fuuzetsu opened this issue May 10, 2023 · 4 comments
Open

Should non-ASCII contents in .asci blocks be rejected? #257

Fuuzetsu opened this issue May 10, 2023 · 4 comments

Comments

@Fuuzetsu
Copy link

I've made an issue at Decompollaborate/spimdisasm#121 which shows that there are sometimes .asciz blocks emitted with non-ASCII characters in them.

m2c then happily takes them and outputs bytes from UTF-8 encoding (as far as I can tell).

That is, given asciz "奩" it will happily spit out e5 a5 a9 00. A brief grep brings me to parse_ascii_directive which at the very top says something about being wrong w.r.t. encodings: I guess this is exactly the issue it's talking about? I see few lines later a very explicit c.encode("utf-8"). Is this what MIPS assemblers would usually do or would they just interpret the whole input as ASCII to start with? I don't know.

@Fuuzetsu
Copy link
Author

It was pointed out to me in linked ticket that this is a valid use-case of .asciz. So it's presumably also a valid use-case for m2c.

I guess however it's not the end of the story: m2c should presumably be aware of the original encoding.

So:

  • m2c presumably needs some flag and encoding support that it uses in parse_ascii_directive.
  • splat needs to use the same string_encoding for m2c as it used for spimdisasm.

So I guess first m2c would have to make the encoding more flexible and then splat would need to make use of the new flag?

@simonlindholm
Copy link
Collaborator

I think assemblers typically read asm byte by byte; m2c should perhaps be refactored to do the same. I think the current behavior here isn't too bad though: for ASCII and UTF-8 it correctly preserves bytes, while for anything else it will cleanly fail with an error. And if you have something non-UTF-8 in your source code in 2023 I'd argue you're doing something wrong, and you should be using \x escapes for offending string literals. (Otherwise random tools will break, like git and decomp.me.) I suspect the comment in parse_ascii_directive may be wrong and the code actually correct.

The other place where string encodings come up is in the output layer, for string literals. Here we decode bytes as UTF-8 with fallback to \x escapes, which isn't necessarily correct, though non-UTF-8 successfully parsing as UTF-8 at least ought to be fairly rare... After this we have Python printing to stdout, the stdout copied into source code somehow, maybe git transforming the encoding in the process, and then the compiler eventually compiling the source, each step of which could result in encoding bugs. (But with UTF-8 it ought to be mostly reliable.)

So I think I could possibly be convinced to take a PR that adds a string encoding cmdline flag that works like you describe if there's a concrete use case, but I'm also pretty happy with the current behavior. A flag just for disabling the "test if it's UTF-8" behavior of the string literal formatting may be a better choice, if anything.

@Fuuzetsu
Copy link
Author

And if you have something non-UTF-8 in your source code in 2023 I'd argue you're doing something wrong, and you should be using \x escapes for offending string literals.

Normally I'd agree but all this started with splatting out a 25+ year old game so if the original strings in that are not UTF-8 (in my case it turned out that they were so that's nice), there's not much choice, right? Unless whatever original encoding gets processed and then output as UTF-8 or something so that everything downstream just works?

@simonlindholm
Copy link
Collaborator

There is choice. You can either \x-escape non-ASCII chars in your strings to keep all .s in ASCII (maybe with strings transcoded into UTF-8 in a comment on the same line for ease of understandability), or do some fancy transcode-on-the-fly as part of the build chain (https://github.com/simonlindholm/asm-processor enables doing this for N64 games).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants