Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VIA/Zhaoxin Padlock: 64-bit montmul, rep xsha512, GMI, partial decode #279

Open
tremalrik opened this issue Dec 6, 2021 · 3 comments
Open
Labels
A-decoder Area: Decoder C-enhancement Category: Enhancement of existing features

Comments

@tremalrik
Copy link

Having gotten hold of a box with a Zhaoxin KX-6580 CPU (Chinese x86 cpu vendor; formed as a joint venture of VIA and Shanghai; their designs are mostly a continuation of the VIA C3/C7/Nano series cores, mainly for the Chinese market but have started showing up elsewhere) I decided to do a whole bunch of testing on its PadLock functionality - and in doing so, I've made a number of findings of various undocumented and underdocumented features. The ones most relevant for disassembly tools like, say, Zydis, so far appear to be:

  • The rep montmul instruction takes, much to my surprise, a mandatory 67h address size prefix in 64-bit mode (!!). This is observed by the sequence f3 0f a6 c0 consistently producing an #UD exception, while something like f3 67 0f a6 c0 does not. The issue appears to be that rep montmul takes a pointer in rSI to a data structure that contains 5 pointers to various buffers needed by this instruction - this data structure does not appear to have ever been updated to work with 64-bit pointers, and so the 67h prefix is needed to force 32-bit addressing for the instruction. This makes the instruction fairly inconvenient to set up, since it becomes necessary to make sure that this structure and all its buffers reside in the bottom 4GB of virtual address space, but once that is done, the instruction variant with the 67h prefix (but not without) will execute a Montgomery multiply just fine.

  • The instruction encoding f3 0f a6 e0 is a seemingly undocumented instruction to accelerate SHA-512 hashing. In my testing, it appears to take the following arguments:

    • rCX = number of 128-byte blocks to hash
    • ES:rSI = pointer to source data
    • ES:rDI = pointer to a 64-byte digest to update
      I haven't been able to find this instruction documented anywhere, but OpenSSL clearly knows about it (see https://github.com/openssl/openssl/blob/master/engines/asm/e_padlock-x86.pl , line 597), referring to it as rep xsha512. The instruction encoding f3 0f a6 d8 also appears to be an alias of this instruction.
  • The instruction encoding f3 0f a6 e8 is a Zhaoxin-specific "GMI" instruction: ccs_hash. This instruction is documented ( https://github.com/ZXOpenSource/OpenSSL-ZX-GMI/blob/master/GMI%20User%20Manual%20V1.0.pdf - in Chinese, but gets pretty readable after a trip through google translate) to provide support for the Chinese SM3 hashing algorithm - in my testing, it also provides undocumented support for SHA-1/256/512 that can be obtained by setting rBX to values in the range 0x10 to 0x15.

  • The instruction encoding f3 0f a7 f0 is another Zhaoxin-specific "GMI" instruction: ccs_encrypt. This instruction is documented to provide support for the Chinese SM4 encryption algorithm - it also provides undocumented support for AES-128/192/256 that can be obtained by setting rAX to values in the range 0x10 to 0x15.

  • The instruction encodings f3 0f a6 f0 and f3 0f a6 f8 are undocumented and I haven't been able to figure out what they might do. They produce a #GP exception for all sorts of arguments I've been trying to pass them, suggesting that they either expect a really odd input data format or are privileged instructions.

  • At least on this specifc CPU, the xstore instruction accepts the repne prefix, and treats it as a synonym for rep - f2 0f a7 c0 produces the same output as I would expect from rep xstore f3 0f a7 c0. None of the other Padlock instructions accept this prefix (#UD). The instruction encoding f3 0f a7 f8 appears to be an alias of rep xstore, however it doesn't accept repne.

  • From what I can find, all of the instructions in the Padlock space (0f a6 c0-ff and 0f a7 c0-ff) exhibit partial decode, where the bottom 3 bits of the last byte of the instruction are ignored - e.g. f3 0f a7 f7 is accepted as a valid instruction and behaves identically to f3 0f a7 f0.

@flobernd flobernd added A-decoder Area: Decoder C-enhancement Category: Enhancement of existing features labels Dec 7, 2021
@flobernd
Copy link
Member

flobernd commented Dec 7, 2021

Good findings :-)

Maybe we can add a ZYDIS_DECODER_MODE_ZHAOXIN to support some of the Zhaoxin special cases.

@tremalrik
Copy link
Author

Such a decoder mode makes sense for the two GMI instructions at least.

I'm much less sure about whether any of the other items I've found are truly Zhaoxin-specific, though - Christopher Domas's Sandsifter tool ran into the partial decode behavior of xstore on two different VIA processors ( https://github.com/xoreaxeaxeax/sandsifter/blob/master/references/domas_breaking_the_x86_isa_wp.pdf , page 5), and the rep xsha512 instruction is listed in a mailing list item from as far back as 2011 ( https://mail.gnu.org/archive/html/gnutls-commit/2011-11/msg00085.html )

@tremalrik
Copy link
Author

I've done a bit more testing, and made a few more minor findings:

  • rep montmul, in addition to lacking support for 64-bit addressing, also appears to lack support for 16-bit addressing. As such, the instruction requires the 67h address override prefix in 16-bit mode, or else it will #UD. (This includes real mode). Conversely, in 32-bit mode, the 67h prefix is not allowed and causes #UD if used.
  • rep montmul takes, in ES:ESI, a pointer to a data structure. Zydis currently reports this as a 4-byte memory operand; its actual accessed size (as measured by placing it next to an unmapped memory page) is 24 bytes.
  • Many of the Padlock instructions are officially documented as causing an Invalid Instruction Exception (#UD) if the operand size prefix 66h is used. This does not check out in my testing - I've been able to get every Padlock instruction to run with the 66h prefix - it does not appear to have any discernible effect on the execution of any of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-decoder Area: Decoder C-enhancement Category: Enhancement of existing features
Projects
None yet
Development

No branches or pull requests

2 participants