Skip to content
This repository has been archived by the owner on Sep 2, 2023. It is now read-only.

Add facilities for masked vector load/store #108

Open
mbitsnbites opened this issue Aug 28, 2020 · 1 comment
Open

Add facilities for masked vector load/store #108

mbitsnbites opened this issue Aug 28, 2020 · 1 comment

Comments

@mbitsnbites
Copy link
Member

mbitsnbites commented Aug 28, 2020

Masking vector store operations is a very useful feature (e.g. think about 3D frustum culling). It is not clear how it should be implemented in the MRISC32 ISA, but using one vector register as a mask register (essentially treat it as a byte select mask) during the store would probably do the trick.

The mask

We should be able to utilize the fact that we already need three vector register file read ports for other instructions (SEL (bitwise select) and FMA (fused multiply-accumulate)), and use the register file read ports as follows:

  • 1R (scalar) for the base address.
  • 1R (vector) for the data to store.
  • 1R (vector, optional) for the address offset (for scatter store).
  • 1R (vector) for the mask register.

The mask register could be interpreted as a byte mask, allowing it to act on half-words and bytes as well as full words. Thus any s[cc][.h|.b] instruction (along with other logical instructions such as and/or/xor) can be used to produce a valid store mask.

The instruction(s)

There are several alternatives for how to encode & interpret the special "store vector register with mask" instruction. For instance:

  1. Dedicate one (or more?) vector register as a mask register, and call it VM. Just like the scalar VL register, it can be used as a regular register by all instructions (and a HW implementation may chose to keep a separate copy of the relevant bits of the VM register in order to avoid using a regular register read port). The special store-with-mask instruction would then implicitly use that register as the mask. E.g:
    a. stw/m v6, [r8, #4] ; Store v6 to address r8, with stride 4, and use vm as the mask register
  2. Split the 32 vector registers into 16+16 registers, where one half (e.g. odd registers, or registers v16-v31) are implicitly used as mask register for the store instruction, e.g. as follows:
    a. stw/m v6:v7, [r8, #4] ; Store v6 to address r8, with stride 4, and use v7 as the mask register

We could also repurpose the folding vector mode (VM=01) for masking, since folding is not supported by load/store instructions anyway.

Outstanding questions

  • For stride based stores, should the address be incremented for non-stored elements?
    • There are probably pros and cons with both variants, but incrementing the address for both stored and non-stored elements is simpler to implement in hardware, and has a nice logic to it - especially if we want to support masked bytes / half-words too.
    • For scatter stores, the solution is trivial (ignore the addresses for the non-stored elements).
  • Do we need/want to support other sizes than word (i.e. do we need masked versions of sth and stb in addtion to stw)?
  • Do we need/want to use masking for other instructions than store?
    • All instructions that could raise exceptions (or produce "undefined"/NaN results) are candidates, e.g. load (could raise a page fault) and div/mod/fdiv (division by zero).
    • If implemented, what should the result be for masked elements? Preserving the old value could mean that we need to read the original value from the register file (or somewhere else), adding register file read ports and forwarding dependencies. Just zero:ing the result is simpler, and probably the right thing (TM) for some instructions (e.g. loads).
    • In a serial (as opposed to parallel) hardware vector implementation, skipping elements can gain performance (elements that are not computed are zero-cycle no-ops). One example is the Convex C series machines from the 80's, apparently. On the other hand, adding support for skipping elements in a parallel implementation is more work and does not improve performance.
@mbitsnbites
Copy link
Member Author

mbitsnbites commented Jan 30, 2021

Idea: We don't use the folding vector mode for load/store. Can we repurpose it for masking? E.g. sth/m v3, [r7, v1*2]

Only works for gather-scatter though (although stride based load/store can easily be emulated using gather-scatter in combination with LDEA to produce the address stride).

@mbitsnbites mbitsnbites changed the title Add facilities for masked vector store Add facilities for masked vector load/store Jan 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant