Extended multiply horizontal add instruction #382

omnisip · 2020-10-13T01:45:26Z

Introduction

This proposal introduces an extended horizontal multiply and add instruction that is used extensively in colorspace conversion and in the implementation of encoders and decoder for video processing. It mirrors the proposal @Maratyszcza put forth in #127 by adding an additional instruction for u8 -> i16 conversion. It maps to 3 instructions on ARM64, and 4 on ARMv7-a+neon. It's extremely similar to pmaddusbw that is supported on the Intel chipset, except that it's not signed by unsigned multiplication. This provides unsigned by unsigned multiplication.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

i16x8.dot_i8x16_u
- y = i16x8.dot_i8x16_u(x, y) is lowered to:

        vmovdqa xmm2, [wasm_splat_i16x8(0x00ff)]
        vpand   xmm3, xmm_x, xmm2
        vpand   xmm2, xmm_y, xmm2
        vpmullw xmm2, xmm2, xmm3
        vpsrlw  xmm_x, xmm_x, 8 # if register clobbering is a concern, replace the first operand
        vpsrlw  xmm_y, xmm_y, 8 # of each of these right shifts with a xmm temporary
        vpmullw xmm_out, xmm_x, xmm_y
        vpaddw  xmm_out, xmm2, xmm_out

x86/x86-64 processors with SSE2 instruction set

i16x8.dot_i8x16_u
- y = i16x8.dot_i8x16_u(x, y) is lowered to:

        movdqa  xmm3, [wasm_splat_i16x8(0x00ff)]
        movdqa  xmm2, xmm_x
        pand    xmm2, xmm3
        pand    xmm3, xmm_y
        pmullw  xmm2, xmm3
        psrlw   xmm_x, 8 # Use movdqa with xmm_x and xmm_y here 
        psrlw   xmm_y, 8 # if it's unsafe to overwrite the input values.
        pmullw  xmm1, xmm_x
        paddw   xmm2, xmm_y
        movdqa  xmm_x, xmm2

ARM64 processors

i16x8.dot_i8x16_u
- y = i16x8.dot_i8x16_u is lowered to:

        umull   v2.8h, v0.8b, v1.8b
        umull2  v0.8h, v0.16b, v1.16b
        addp    v0.8h, v2.8h, v0.8h

ARMv7 processors with NEON instruction set

i16x8.dot_i8x16_u
- y = i16x8.dot_i8x16_u(x, y) is lowered to:

        vmull.u8        q10, d18, d17
        vmull.u8        q8, d19, d16
        vpadd.i16       d19, d20, d21
        vpadd.i16       d18, d16, d17

Maratyszcza · 2020-10-13T04:21:32Z

SSSE3 lowering mismatch others. PMADDUBSW does unsigned by signed multiplication.

omnisip · 2020-10-13T04:45:18Z

SSSE3 lowering mismatch others. PMADDUBSW does unsigned by signed multiplication.

You're totally right. Nice catch.

omnisip · 2020-10-13T19:08:52Z

@Maratyszcza I think that fixes it... I was stunned when I did the testing about how challenging pmaddusbw was to work with. It treats each operand differently such that unless both operands are guaranteed to be 127 or less, the ordering matters and the results will differ.

Maratyszcza · 2020-10-13T19:31:14Z

By analogy with #127, this instruction should be named i16x8.dot_i8x16_u

omnisip · 2020-10-15T16:58:39Z

By analogy with #127, this instruction should be named i16x8.dot_i8x16_u

Haven't forgotten about this. Will take care of it today.

omnisip · 2020-11-18T18:47:56Z

This proposal is efficient on ARM64, but isn't efficient on x64. The original objective was to see if pmaddubsw could be implemented portably since it would have provided an option to do the RGBA conversions on x64 chips in 1-2 ops. Unfortunately, the behavior of pmaddubsw isn't portable, and the workarounds required to get it to work efficiently, are less efficient than expanding the integer types, multiplying, and adding. As a side effect, I'd like to withdraw this proposal in favor of integer sign/zero extension proposal. If someone else has a need for this proposal before standardization, please write back here. For documentation on the issues with pmaddubsw, please see this thread on stackoverflow

omnisip added 2 commits October 13, 2020 01:32

extended multiply horizontal add instruction

6b12c6b

fix nomenclature

4d0a236

Update nomenclature to match @Maratyszcza proposal in WebAssembly#127

e699ff6

tlively added the post SIMD MVP label Feb 2, 2021

hi-ogawa mentioned this pull request Feb 21, 2021

Classical Evaluation Improved, but Search is no longer "Tuned" for it. [Regression on Classical-only Search] official-stockfish/Stockfish#3365

Closed

akirilov-arm mentioned this pull request Jun 30, 2021

Add extend-add-pairwise instructions x64 bytecodealliance/wasmtime#3031

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extended multiply horizontal add instruction #382

Extended multiply horizontal add instruction #382

omnisip commented Oct 13, 2020 •

edited

Maratyszcza commented Oct 13, 2020 •

edited

omnisip commented Oct 13, 2020

omnisip commented Oct 13, 2020

Maratyszcza commented Oct 13, 2020

omnisip commented Oct 15, 2020

omnisip commented Nov 18, 2020

Extended multiply horizontal add instruction #382

Are you sure you want to change the base?

Extended multiply horizontal add instruction #382

Conversation

omnisip commented Oct 13, 2020 • edited

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

Maratyszcza commented Oct 13, 2020 • edited

omnisip commented Oct 13, 2020

omnisip commented Oct 13, 2020

Maratyszcza commented Oct 13, 2020

omnisip commented Oct 15, 2020

omnisip commented Nov 18, 2020

omnisip commented Oct 13, 2020 •

edited

Maratyszcza commented Oct 13, 2020 •

edited