Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Intrinsic implementation #116

Open
wants to merge 23 commits into
base: v2.1
Choose a base branch
from
Open

[RFC] Intrinsic implementation #116

wants to merge 23 commits into from

Conversation

fsfod
Copy link

@fsfod fsfod commented Dec 4, 2015

This is an implementation of #39 and limited to x86/x64 for Windows and Linux ABIs for time being.
There are some working toy examples in test/test.lua and test/intrinsic_spec.lua of the current API. JIT support for support for Vector register will be left as NYI because it needs various change to the JITs systems. If you feeling brave you can try out a experimental branch with JIT vector support.

An intrinsic can either be single machine instruction that LuaJIT might have some specialized understanding of or an opaque blob of 1 or more machine instructions that may be user supplied. Intrinsics will behave like a callable function in the interpreter There argument order will be the same order that input registers were declared in the register list.

API
ffi.cdef([[int32_t popcnt(int32_t n) __mcode("f30fb8rM");]])

Declaring an vector opcode intrinsic with immediate control byte

ffi.cdef([[
  typedef float float4 __attribute__((__vector_size__(16)));
  float4 shufps_rev(float4 v1, float4 v2) __mcode("0FC6rMU", 0x1b);
]])

Declaring an opcode with both a prefix and immediate byte, that takes an address and has memory side effects.

ffi.cdef[[void atomicadd1(int32_t* nptr) __mcode("830mIUPS", 0xF0, 0x01);]]
Running intrinsics in the interpreter

To allow calling intrinsics in the interpreter an internal wrapper function is generated using part of the existing JIT engine, in theory the full JIT engine could be used by generating IR instead of using the raw emit system but would probably require lots fixes where its assumed the code is being generated for a trace. The wrapper is called with two pointers the first is the input context structure that contains the values(or pointers for vectors) to the values of the input registers and the second is the Lua stack to write the results to . After the intrinsics code in the wrapper has run the wrapper writes output registers directly to the Lua stack if they are 32bit signed numbers otherwise it copies the output registers into the pre-created(before the wrapper is called) cdata that's on the Lua stack.

Intrinsics in the JIT

Three new IR instructions are added for intrinsics:

  • IR_INTRN Representing the intrinsic. It is considered to have side effects with respect to DCE and as a potential load for DSE. If the intrinsic is flagged as having memory side effects a memory barrier(XBAR) will be emitted by the intrinsic trace recorder after the intrinsic. op1 points to an ordered chain CARG opcodes holding the values of the input registers and op2 points to the intrinsic ctype id.
  • IR_ASMRET that is used to represent any extra output registers of an intrinsic and apply any register shuffling that will be needed for fixed registers. op1 points to the ASM op that the output values belongs to. ASMRET is skipped for the first dynamic output register of opcode intrinsics.
    op2(literal) holds the fixed register id that the output value gets written to.
    ASMRET for fixed registers have matching register hints set in register hint prepass.
  • IR_ASMEND Is used as the tail of the linked list of ASMRET instructions and the starting IR instruction to emit a intrinsic with multiple output registers.
Design notes

The mcode api/system was generalized to allow more than one mcode area since the existing JIT one is flushed when a full trace flush happens, while the generated wrappers need to stay around until state is closed. In theory the FFI callback stubs could also live in this mcode area as well instead of living in fixed size memory.
Currently arguments passed to an intrinsic in the interpreter are handled using a data drive approach in which they are converted and packed into a context like the FFI system uses to call C functions. If the input values were treated as strongly typed(direct ctype id match for cdata or built-in Lua type) the need to save and load input values into the context could be skipped by the wrapper directly loading the values off the Lua stack and moving them into registers.
Currently the only way to express memory side effects that a intrinsic does is XBAR when all might be needed is a fake store of particular size that the pointer aliasing system understands also see previous discussion of how s/l/mfence could work.

Tasks
  • Signed 32 bit input and out register.
  • double/float input and out registers.
  • Box non 32 integer/fp output registers into cdata.
  • 128/256 bit vector registers(interpreter only).
  • Handle mcode area reallocations when generating wrapper.
  • Reuse parts of the FFI c parsing system to fully declare a intrinsic in string form(__mcode(opstring, prefix, immediate)).
  • Dynamic ModRM generation for input/output registers of single opcode intrinsics that are flagged to support it.
  • Option to treat large user intrinsics as callable function in the jit with custom register calling convention and know modified registers.
  • support for immediate control byte when constructing opcode intrinsics.

@fsfod fsfod force-pushed the intrinsicpr branch 4 times, most recently from 2ed3d2e to ef61fa2 Compare December 23, 2015 17:46
@fsfod fsfod force-pushed the intrinsicpr branch 3 times, most recently from 563a59c to fccea5a Compare January 8, 2016 19:32
@fsfod fsfod force-pushed the intrinsicpr branch 3 times, most recently from 072540e to b5f446d Compare January 19, 2016 15:59
@fsfod fsfod force-pushed the intrinsicpr branch 2 times, most recently from d17c803 to 0312429 Compare January 27, 2016 04:56
@fsfod fsfod changed the title [WIP, RFC] Intrinsic implementation [RFC] Intrinsic implementation Jan 27, 2016
@fsfod fsfod force-pushed the intrinsicpr branch 4 times, most recently from 99c5915 to 2102433 Compare February 5, 2016 07:25
@fsfod fsfod force-pushed the intrinsicpr branch 6 times, most recently from 8a198e6 to 97f8c97 Compare February 11, 2016 07:02
@fsfod fsfod force-pushed the intrinsicpr branch 2 times, most recently from 5bc3850 to f57039e Compare March 28, 2016 15:36
…_tv by using a special cast flag(CCF_INTRINS_ARG) for intrinsic vector arguments
…abled DCE of intrinsics

Intrinsics are now assumed to have no side effects unless flagged to with either memory side effects(S) or non memory side effects(s)
…trinsics that have no side effects and are not forced indirect ModRM which could be a load or store
…us ways.

Fix wrappers truncating GCobj pointers in GC64 mode when loading them from the stack to store output registers in to cdata.
Fix the stack for intrinsics not being adjusted correctly in there interpreter wrapper when it uses the RID_DISPATCH register on GC64 because RSET_GPR does not contain it
…g RID_DISPATCH

Make RID_DISPATCH an unallocatable register for intrinsics when building as GC64.

Fix trying to evict RID_DISPATCH for LJ_GC64 builds on x64 for intrinsics and
add some asserts that we never try to again.
Don't set register hints for intrinsic input\output registers that are RID_DISPATCH.
Restore RID_DISPATCH first when handling output registers and defer it
till last for input registers of intrinsics in the JIT.
…s to allow pointer based intrinsics to work in both 64 bit and 32 bit with the same definion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants