Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push constant proposal #4612

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

shaoboyan
Copy link
Contributor

Push constants proposal. Related issue #75

Copy link
Contributor

github-actions bot commented Apr 30, 2024

Previews, as seen when this build job started (f15e36a):
WebGPU webgpu.idl | Explainer | Correspondence Reference
WGSL grammar.js | wgsl.lalr.txt

dictionary GPUPipelineLayoutDescriptor
: GPUObjectDescriptorBase {
required sequence<GPUBindGroupLayout> bindGroupLayouts;
uint32_t pushConstantsSize = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be explicit what size this is (byte size? number of 32-bit parameters?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pushConstantBytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to immediateDataRangeByteSize :)

uint32_t pushConstantsSize = 0;
};
```
two pipeline layouts are defined to be “compatible for push constants” if they were created with identical push constant size. It means push constants values can share between pipeline layouts that are compatible for push constants.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's a compile error for a module to use push constants outside of its defined size range?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts is that we follow out-of-bounds access here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can validate that the shader module uses no more than the push constants allowed by the module so we should be fine. Also dynamic indexing of push constants is not possible so in the shader we know that all accesses are in bounds.

| --- | --- | --- | --- | --- | --- | --- |
| `var<push_constant>` | Immutable | [Module](https://www.w3.org/TR/WGSL/#module-scope) | [Concrete](https://www.w3.org/TR/WGSL/#type-concrete) [constructible](https://www.w3.org/TR/WGSL/#constructible) [host-shareable](https://www.w3.org/TR/WGSL/#host-shareable) excludes [array types](https://www.w3.org/TR/WGSL/#array-types) and [structure types](https://www.w3.org/TR/WGSL/#struct-types) contains array members | Disallowed | | Yes. [uniform buffer](https://www.w3.org/TR/WGSL/#uniform-buffer) |

NOTE: Each [entry point](https://www.w3.org/TR/WGSL/#entry-point) can statically use at most one push constant variable.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why, exactly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make implementation a simpler. So we don't need to introduce @offset in wgsl to describe multiple push constant variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to enforce a single push constant interface we should consider an alternate design of making it a parameter to the entry point (identified via an attribute). If you did consider it, what did you see as the draw backs?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would prefer push constants to be like bindings, and not tied to a specific entry point. I tend to keep my bindings/push constants in a separate file I can reuse across passes like so.

HLSL seems to have requirements around having push constants be structs anyways gfx-rs/wgpu#5571

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was my strawman in #75 (comment) considered at all? If we assume 4-byte elements and alignment, I think this could become a lot easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magcius in that proposal, where are @parameter variable declared? At the global scope of as function arguments? It seems very similar to specifying an offset, except that it stays in "words" instead of bytes. It'd be nice to support smaller data types in push constants eventually so IMHO it's still be better to act on bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to enforce a single push constant interface we should consider an alternate design of making it a parameter to the entry point (identified via an attribute). If you did consider it, what did you see as the draw backs?

I agree it is an optional design. Personally, I'm slightly prefer current address space mode proposal because:

  • It defines a global scope variable (like uniform), so that the helper functions could use it directly. ( instead of accepting attributes as parameter, if we pass push constant variable as entry point function parameter)
  • From API aspect, pushConstantBytes is related with PipelineLayout rather than ShaderModule. Take it as a special binding resource is natural.


```javascript
interface GPUCommandEncoder {
void setPushConstantU32(uint32_t offset, uint32_t value);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What units is offset in? If bytes, are there any restrictions? If I do setPushConstantU32(1, 0x12345678);, will it overwrite bytes 1, 2, 3, 4 and leave byte 0 alone? Should there be alignment restrictions?

Copy link
Contributor Author

@shaoboyan shaoboyan Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same as bufferOffset )
It is in bytes, and it should be multiple of 4 bytes. And the alignment should follow this(https://gpuweb.github.io/gpuweb/wgsl/#alignment-and-size). So it means a the pushconstantSize = variable sizes + padding sizes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about f16? Should it just be set via a packed shared buffer?

Copy link
Contributor Author

@shaoboyan shaoboyan May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F16 should be set via packed shared buffer, or setPushConstantU32 so that we don't change the bits.

void setPushConstantU32(uint32_t offset, uint32_t value);
void setPushConstantI32(uint32_t offset, int32_t value);
void setPushConstantF32(uin32_t offset, float32_t value);
void setPushConstants(uint32_t offset, uint32_t count, ArrayBuffer data);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably take a AllowSharedBufferSource

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could.


| Address space | Sharing among invocations | Default access mode | Notes |
| --- | --- | --- | --- |
| `push_constant` | Invocations in [shader stage](https://www.w3.org/TR/WGSL/#shader-stages) | [read](https://www.w3.org/TR/WGSL/#access-read) | For [uniform buffer](https://www.w3.org/TR/WGSL/#uniform-buffer) variables exclude [array types](https://www.w3.org/TR/WGSL/#array-types) variable or [structure types](https://www.w3.org/TR/WGSL/#struct-types) variable contains [array types](https://www.w3.org/TR/WGSL/#array-types) attributes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the uniform buffer reference a copy paste error?

Separately, are arrays excluded to avoid Vulkan's default uniform access requirements? It will seem arbitrary to users unfamiliar with Vulkan. Is this a better resolution than requiring non-uniform indexing (probably)? Is it better than having a uniformity requirement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the uniform buffer reference a copy paste error?

No, I think uniform buffer variable defined in wgsl aligns with what I want with push constant. I think push constant variables should be host-shareable constructible type, and must satisfy the address space layout constraints.

Separately, are arrays excluded to avoid Vulkan's default uniform access requirements?

Not only Vulkan, but also DX12. DX12 says Arrays are not permitted in constant buffers that get mapped onto root constants since dynamic indexing in the root signature space is not supported (https://learn.microsoft.com/en-us/windows/win32/direct3d12/using-constants-directly-in-the-root-signature)

| --- | --- | --- | --- | --- | --- | --- |
| `var<push_constant>` | Immutable | [Module](https://www.w3.org/TR/WGSL/#module-scope) | [Concrete](https://www.w3.org/TR/WGSL/#type-concrete) [constructible](https://www.w3.org/TR/WGSL/#constructible) [host-shareable](https://www.w3.org/TR/WGSL/#host-shareable) excludes [array types](https://www.w3.org/TR/WGSL/#array-types) and [structure types](https://www.w3.org/TR/WGSL/#struct-types) contains array members | Disallowed | | Yes. [uniform buffer](https://www.w3.org/TR/WGSL/#uniform-buffer) |

NOTE: Each [entry point](https://www.w3.org/TR/WGSL/#entry-point) can statically use at most one push constant variable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to enforce a single push constant interface we should consider an alternate design of making it a parameter to the entry point (identified via an attribute). If you did consider it, what did you see as the draw backs?


```javascript
interface GPUCommandEncoder {
void setPushConstantU32(uint32_t offset, uint32_t value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about f16? Should it just be set via a packed shared buffer?

Four new functions in `GPUCommandEncoder`.

```javascript
interface GPUCommandEncoder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like these commands should either take a pipeline or be on render/compute encoders.

Copy link
Contributor Author

@shaoboyan shaoboyan May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On render/compute encoders. And that's why I tend to add these APIs in GPUCommandEncoder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands wouldn't all pipelines need the same push constants. That's more what I was getting at of needing to associate the command with a particular pipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPUCommandEncoder is the top-level encoder. If you want to add these methods to render/compute encoders you need to add them to GPUBindingCommandsMixin (assuming they be used in render bundles).

We could associate push constant values with a particular pipeline by passing them as part of setPipeline() (kind of parallel to overrides in createPipeline()), but since IIUC setPushConstant() is often going to be called multiple times per setPipeline(), I think it makes more sense to make it parallel setBindGroup(): there are separate states for the current {pipeline, bind groups, push constants} and we capture all the current states each time we hit a draw/dispatch call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands wouldn't all pipelines need the same push constants. That's more what I was getting at of needing to associate the command with a particular pipeline.

Yes. My expected use case is:

passEncoder.setPipeline().
passEncoder.setBindGroup();
passEncoder.setPushConstant();
...
// setBindGroup();
// setPushConstant();

If you want to add these methods to render/compute encoders you need to add them to GPUBindingCommandsMixin (assuming they be used in render bundles).

Thanks! That's the place I expected.

proposals/push-constants.md Outdated Show resolved Hide resolved
@kdashg kdashg added the wgsl WebGPU Shading Language Issues label Apr 30, 2024
@kainino0x kainino0x added api WebGPU API proposal labels Apr 30, 2024
@shaoboyan shaoboyan requested a review from toji May 7, 2024 08:09
@mwyrzykowski mwyrzykowski self-requested a review May 8, 2024 23:29
@kdashg kdashg added this to the Milestone 2 milestone May 8, 2024
void setPushConstantU32(uint32_t offset, uint32_t value);
void setPushConstantI32(uint32_t offset, int32_t value);
void setPushConstantF32(uin32_t offset, float32_t value);
void setPushConstants(uint32_t offset, uint32_t count, AllowSharedBufferSource data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kelsey points out we would want to add a dataOffset and dataCount (like other entry points that take BufferSources) so we can point directly into a larger array buffer instead of copying it into a new ArrayBuffer).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the API shape to void setImmediateDataRange(uint32_t rangeOffset, AllowSharedBufferSource data, optional dataOffset, optional size);

void setPushConstantU32(uint32_t offset, uint32_t value);
void setPushConstantI32(uint32_t offset, int32_t value);
void setPushConstantF32(uin32_t offset, float32_t value);
void setPushConstants(uint32_t offset, uint32_t count, AllowSharedBufferSource data);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks good, but I think we only need the last signature taking an AllowSharedBufferSource and not the other overloads (U32, I32, F32, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But tend to keep these U32, I32, F32 for convenience , especially we tend to support small amount of data uploading.

@mwyrzykowski
Copy link

I think this will have impact on the already complex matrix of GPUSupportedLimits.

Have we considered making setVertexBuffer take an ArrayBufferSource? Then we don't need to change the limits.

I didn't take a look at the D3D or Vulkan backends yet so maybe that is the motivation? On Metal, this is as simple as calling setVertexBytes: / setFragmentBytes:

@litherum
Copy link
Contributor

litherum commented May 9, 2024

Assuming that push constants would be implemented using D3D12's root constants...

If I'm not mistaken, D3D12 has a fixed (per-device?) amount of space in the root signature. In shipping WebGPU, the WebGPU implementation gets to decide how to use that space.

For example, if the WebGPU Pipeline Layout includes 2 textures, the WebGPU implementation can place those 2 texture descriptors directly inside the D3D12 root signature. Or, alternatively, the WebGPU implementation can choose to put both texture descriptors into a descriptor table, and make the root signature point to the descriptor table. The WebGPU implementation has freedom here to use this fixed root signature space as it desires.

Okay, now let's add in support for push constants. Push constants would also live directly within the fixed root signature space. But what if all the root signature space is already occupied (e.g. by those texture descriptors from the above paragraph)?

Even if WebGPU only allows a single push constant, that doesn't actually solve the problem. The root signature space may be 100% occupied by descriptors, leaving no space at all for push constants.

There are a variety of potential solutions to this problem:

  • Make push constants space allocation be best-effort. If there isn't space within the root signature, fail to set the push constant. Different devices will fail in different situations.
  • Tell every implementation that it must leave X bytes of extra space in the root signature for push constants, where "X" is a literal value spelled out in the spec. This is unfortunate because it pessimizes - the app might not need all X bytes, and the remaining bytes might be better spent on descriptors
  • Have the app request how much push constant space it needs, within the WebGPU Pipeline Layout, and add platform-agnostic validation rules to make sure the app doesn't ask for too much space. Inventing such validation rules will be tricky, because the rules would have to balance the freedom of WebGPU implementations to place descriptors freely, with the desire to allow applications to request as much push constant memory as they want/need. Another reason I don't like this option is that, if the application requests too much push constant space and pipeline layout creation fails, that doesn't mean they can't access data of the requested size, but it instead just means they can't access data of the requested size as push constants.
  • (My favorite option): Let push constants be an implementation detail, rather than an explicit part of the WebGPU API, and treat the root signature space as just another part of the pipeline layout.
    • As @mwyrzykowski suggests, the group could add a new method to GPUBindingCommandsMixin (as a sibling to setBindGroup()) that accepts immediate data (in the form of an ArrayBuffer) rather than a BindGroup. (Or, maybe, alternatively, this could be an overload of createBindGroup() that takes immediate data.)
    • The group could even add a maxBindingSize field to GPUBufferBindingLayout, which would allow the the WebGPU pipeline creation routine can know whether the data is small enough that it can guarantee that the data fits into push constant space.
    • Limits accounting doesn't change at all, which is appealing

@litherum
Copy link
Contributor

litherum commented May 9, 2024

@Kangz
Copy link
Contributor

Kangz commented May 14, 2024

GPU Web WG 2024-05-08 Pacific-time
  • SY: Added a new address space called “push_constant”. Similar to uniform but:
    • Each entry point can only use one push_constant binding, for easier layout.
    • Cannot index arrays in push_constant space.
  • SY: Question of how large the space can be. We need to reserve some space for the implementation. Limit is small on backends (e.g. root descriptor in d3d12)
  • SY: Also question of how to upload push constants. Proposed to add where setBindGroup is available (render pass, compute pass, and render bundle encoders).
  • SY: Improved performance over uniform buffer
  • KG: Mozilla needs some time to review this
  • SY: Alan is asking about lifting the restriction for one push_constant per entry point.
  • MM: Isn’t the amount of available space dependent on the number of bind groups (which also take space in the root descriptor)?
    • SY: Yes, d3d12 64 DWORDs. Even a small limit is useful, though prefer to make it larger
    • MM: If you add accounting that bind groups + push constants is under some limit (think this is a good idea) then the mapping from bind groups to root constants needs to be described in our spec. Restricts implementations ability to hoist push constants in/out of tables.
    • SY: Would you mind leaving a comment on the PR?
    • MM: Sure
    • MW: Is there a reason we are designing this proposal instead of setVertexBuffer with dynamic data or similar?
    • MM: 3 or 4 years ago we discussed this; we ended up on the trail of whether push constants was actually faster than putting the data in a buffer. Why can’t it be polyfilled by the application. Not sure if performance was proven to be better. Then the discussion veered off into whether, if we’re going to specify push constants, should we say there are no limits on push constants, and the implementation would only use backend push constants if it was small enough. Otherwise silently promote to a buffer.
    • MW: Do think the performance is better (at least in metal backend for small sizes). Question is why introduce new comment instead of allowing attaching an ArrayBuffer as a vertex buffer.
      • Other thoughts:
        • I mean we could have either:
        • void setPushConstants(uint32_t offset, uint32_t count, AllowSharedBufferSource data);
        • or
        • setVertexBuffer(GPUIndex32 slot, AllowSharedBufferSource data, optional GPUSize64 offset, optional GPUSize64 size);
        • The latter seems simpler so we don't have to rework the GPUSupportedLimits as this will impact those. The backend internally can decide whether or not to create a buffer or use push constants.
        • There are also implications where if this is supported by GPURenderBundle, then we would likely create a buffer in any case
    • MM: In SPIR-V, shader needs to know whether it’s reading from a push constant. Would need to know at compile time where it’s pulling the data from.
    • KG: For why we have setPushConstantU32/I32/F32 instead of just the AllowSharedBufferSource overload, still can be debated.
  • KG: (This issue here is M2, so making this M2 also)
  • KG: We will take a closer look.

@shaoboyan
Copy link
Contributor Author

Let push constants be an implementation detail, rather than an explicit part of the WebGPU API, and treat the root signature space as just another part of the pipeline layout.

Just point out possible implementation issues I observed:

  • On Metal this works fine since it is super align with metal API SetVertexBytes() and so on. But it might have some problem for Vulkan and D3D12. Both Vulkan and D3D12 defines push constants or root constants in pipelineLayout. When calling SetPushConstants(), we need to ensure the pipeline has such push constants range. I think this implementation hints that we need to allocate remain root space as root constants and we need to always keep a internal binding point for each bind group to handle the case that "User setPushConstants() but we don't have enough space, so fallback to a buffer".

@mwyrzykowski
Copy link

I think this implementation hints that we need to allocate remain root space as root constants and we need to always keep a internal binding point for each bind group to handle the case that "User setPushConstants() but we don't have enough space, so fallback to a buffer".

It sounds like D3D and Vulkan can be optimized to not use a buffer when the push constants are small enough and there is space. With Metal, we always need to reserve a buffer slot for the push constants. Though it would be possible to map them all into one buffer slot.

Abstracting this concept seems to further justify that the term 'push constants' does not necessarily need to exist in WebGPU. Rather, the D3D and Vulkan backends can implement small, dynamic setBindGroup calls into push constants and on Metal, we would use set[Vertex/Fragment]Bytes unless the data is large, then we would make a buffer

@JMS55
Copy link

JMS55 commented May 15, 2024

If you tie push constants to specifically render pipelines, then you can't use them for compute shaders (and later, other types of pipelines), which would be unfortunate.

@litherum
Copy link
Contributor

@JMS55

Yeah, people in this thread are using a variety of different terminology. Above, when people say "setVertexBytes()" or "setFragmentBytes()," I believe they are referring to the Metal functions of the same name, as an example of a function that takes immediate data (rather than a method taking resources or bind groups). Even in Metal, there's a way to supply immediate data to compute pipelines.

@litherum
Copy link
Contributor

tl;dr: Thinking deeper, I think the only way it can work in Vulkan is either to just not add the feature at all, or add an explicit API for it.

I think the biggest unanswered question here regarding "let push constants be an implementation detail" is this: Consider such a program:

let renderPassEncoder = commandEncoder.beginRenderPass(...);
renderPassEncoder.setPipeline(...);
renderPassEncoder.setImmediateData(new Int32Array([42])); // Unclear how this will interact with bind groups, but let's assume that problem is surmountable...
renderPassEncoder.draw(...);
renderPassEncoder.setImmediateData(new Int32Array(/* a very long array */));
renderPassEncoder.draw(...);

The question is: What should the code in the shader be to handle both of these cases? In the first case, we'd want to promote the 42 to a push constant (probably? AFAICT it hasn't been demonstrated whether push constants are actually consistently better to use cross-platform), but in the second case, the data is too big to fit in push constants. Both situations would (probably) need to be supported by a single shader - setPipeline() is only called once in the above code sample.

In SPIR-V, the text of the shader is actually different depending on whether or not you're accessing a push constant or not. For example, in GLSL (which I'm using here just as a readable form of SPIR-V), you'd either have this:

layout(push_constant) uniform Constants {
	int pushConstantData;
};

Or this:

layout(set = 0, binding = 0) uniform Constants {
	int uboData;
};

If we operate under the constraint that no amount of recompilation is allowed to happen while commands are being recorded into the render pass - not even supplying specialization constants - then the only way this can work is if both blocks are present in the shader, and each access to the data is guarded by a flag - in software - to determine whether the data was bound to pushConstantData or uboData. Adding this selection in software probably defeats the whole purpose of trying to use push constants in the first place. It's probably a dealbreaker. (But its performance should probably be measured nevertheless.)

Assuming it is in fact a dealbreaker, it seems like the only way this could ever work (again, assuming that the committee does want to make it work, which I'm still not convinced has been sufficiently justified) would be to come up with a system where the implementation can guarantee that the immediate data being set by the app will fit into push constant space. This is why I suggested a maxBindingSize field within the pipeline layout. Such a field could allow the implementation to know that the data provided by the application will fit into push constants.

But there's even an additional problem - what about the following program:

let renderPassEncoder = commandEncoder.beginRenderPass(...);
renderPassEncoder.setPipeline(...);
renderPassEncoder.setImmediateData(new Int32Array([42]));
renderPassEncoder.draw(...);
renderPassEncoder.setBindGroup(/* clobbering the immediate data above */);
renderPassEncoder.draw(...);

How should this work, assuming no amount of recompilation is acceptable? Imagine the bind group contains a buffer whose contents is populated by a previous compute shader - the CPU has never seen this data, and therefore such data cannot be hoisted to push constants. There would have to be some way, at the API level, to distinguish the immediate data from the bind group data, so that the shader wouldn't have to handle both cases in software at runtime.

I think the only way this could work is to add a new type of descriptor in the bind group layout. That way, an entry in the bind group layout could either refer to a buffer, or refer to immediate data, but not both. Any data set by immediates for one draw call would be incapable of being clobbered by a GPU-populated buffer for a subsequent draw call - thereby allowing the shader compiler to know that it's safe to only emit the layout(push_constant) block above.

So then that begs the question: What should this new type of descriptor in the bind group layout be named? Perhaps it should be named ... "push constant"...

@magcius
Copy link

magcius commented May 15, 2024

This is turning into a bit of a bikeshed opportunity, but I'll add my two cents to the jar:

  • I do not want shader recompilation at command buffer runtime, after pipelines have been created. This would be a very unfortunate thing to require implementations to do, after we spent the effort to put PSOs into the spec. Expensive synchronous recompilations cause stalls that we are trying to avoid in WebGPU.
  • The big value I see in push constants is boot-strapping indices into other buffers or bind groups; this is where I would expect to see them come in the most handy.
  • Just because D3D12 offers an ability to embed some maximum number of root constants, does not mean that going all the way up to that number is necessarily free. My proposal was limited to 4 u32's for a reason.
  • Making push constants transparent to the application would be extra difficult to support uses that mix both bind groups and push constants. See below for more details.
  • Push constants might be especially relevant when talking about newer tools like "bindless", where shaders index arrays of resources like textures or buffers.

To give a bit of an unprompted history lesson and awareness of how these features are implemented in practice...

The simplified AMD interpretation of push constants (not applicable to all drivers and hardware, just giving one mental model): The hardware has some general-purpose 32-bit scalar registers (SGPRs); these registers are active storage for threads, but also can be initialized with user constants set up by the hardware that launches threads. Some of these registers are used by the driver itself for internal purposes (e.g. the "base draw offset" feature from Vulkan would get passed by an SGPR), some of them are used by the driver to implement binding models (D3D12 CBVs might get mapped directly to an SGPR), and some of them are used to store memory addresses to binding tables (e.g. if you have a lot of textures, the driver will allocate a block of memory which has the texture descriptors, and store a pointer to the table in an SGPR; these are D3D12's "descriptor tables").

The D3D12 idea was that all root signature parameters (root constants, CBVs, descriptor tables) would map to SGPRs; however, as the driver sometimes needs some for its own purposes, this isn't a 1:1 mapping, and sometimes the driver has to jettison some of the parameters to a memory block and do some finagling to rearrange all of the binding updates. Note that since the usage of an SGPR is arbitrary (it can contain a memory address or a value, just depending on how it's used), the usage needs to be determined at root signature time; in practice, the AMD compiler effectively reserves a few driver SGPRs that work across all platforms.

So, in order to make "transparent push constants" work, we can't simply replace the first N bytes of a uniform buffer of a bind group with immediate data; using it with a bind group will require a pointer in the SGPRs, while using it with push constants will require the values in SGPRs. Instead, we'll need syntax that lets us optimistically compile the first N push constants to SGPRs, and then the rest get jettisoned to a memory buffer. This is enabled a bit more by the extra pipeline layout field so we know up-front what will fit, but it still seems more complicated than just having a small fixed-size limit in the specification. We can always expand to support bigger sizes through emulation later.

@litherum
Copy link
Contributor

litherum commented May 15, 2024

The big value I see in push constants is boot-strapping indices into other buffers or bind groups

shaders index arrays of resources like textures or buffers.

I think we have to be careful here - if the biggest use case is simply "make some indices available to a shader" then a simple API convenience method would be sufficient, without changing the binding model or adding push constants at all.

@litherum
Copy link
Contributor

(I just realized that we already have that simple API convenience method: GPUQueue.writeBuffer(). I think the burden of proof needs to be "why push constants in addition to writeBuffer()?"

@shaoboyan
Copy link
Contributor Author

shaoboyan commented May 17, 2024

I think the burden of proof needs to be "why push constants in addition to writeBuffer()?"

A quick thoughts is that developer don't need to call CreateBuffer() and manage that buffer to hold these small amount of data.

@magcius
Copy link

magcius commented May 17, 2024

I think the burden of proof needs to be "why push constants in addition to writeBuffer()?"

writeBuffer cannot be used while inside of a render pass, as it executes at the queue level

@litherum
Copy link
Contributor

Aha! You’re right.

So I guess the alternative would be “something like writeBuffer but works in a render pass”

@magcius
Copy link

magcius commented May 17, 2024

There is no way to support writeBuffer-but-it-works-in-a-render-pass easily in D3D12 and Vulkan. You can support a limited amount of data being sent immediately to the shader through D3D12 root constants and Vulkan push constants, but that limit is fairly small, and it requires special shader changes.

One alternative to push constants on Vulkan is dynamic uniform buffer offsets, but those have heavy limitations on alignment that can be detrimental.

If you go down the end of this road, you end up at OpenGL uniforms/uniform buffers, and D3D11 constant buffers, both of which have numerous, well-documented pitfalls and performance hazards. The easiest and safest way to use D3D11 constant buffers were with the DISCARD/NOOVERWRITE flags which effectively turned them into our dynamic uniform buffers. They were replaced in D3D12/Vulkan.

if the biggest use case is simply "make some indices available to a shader" then a simple API convenience method would be sufficient, without changing the binding model or adding push constants at all.

What kind of features are you thinking of?

dictionary GPUPipelineLayoutDescriptor
: GPUObjectDescriptorBase {
required sequence<GPUBindGroupLayout> bindGroupLayouts;
uint32_t pushConstantsByte = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is pushConstantsByte? Should this be pushConstantsSize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe pushConstantsBytes? The previous comments is that we don't know the pushConstantSize is defined as bytes or number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to immediateDataRangeByteSize :)

@@ -0,0 +1,73 @@
# PushConstants

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really use the term 'push constants' which seems like a non-universal, backend specific name as opposed to something like 'immediate data'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@mwyrzykowski
Copy link

There is no way to support writeBuffer-but-it-works-in-a-render-pass easily in D3D12 and Vulkan. You can support a limited amount of data being sent immediately to the shader through D3D12 root constants and Vulkan push constants, but that limit is fairly small, and it requires special shader changes.

We could also make setBindGroup take an ArrayBuffer and document that the corresponding pipeline's immediate / push constant size in bytes must be >= size in bytes of the given ArrayBuffer or ArrayBuffers bound using this immediate mode.

I didn't check the D3D specifics but that seems like it would work in Vulkan and Metal

@shaoboyan
Copy link
Contributor Author

We could also make setBindGroup take an ArrayBuffer and document that the corresponding pipeline's immediate / push constant size in bytes must be >= size in bytes of the given ArrayBuffer or ArrayBuffers bound using this immediate mode.

My understanding is that the benefit of this is we don't introduce new API name?
Because this seems identical to void setPushConstants(uint32_t offset, uint32_t count, ArrayBuffer data);.

@Kangz
Copy link
Contributor

Kangz commented May 20, 2024

Recapping a bunch of the discussion above (hopefully the non-controversial points):

  • We see the need to pass in a few per-draw indices in the shader to bootstrap access to other resources. Dynamic uniform buffers are the current solution but a bit too expensive (due to the alignment requirement) and annoying to use.
  • The constraint on no pipeline recompilation means that something has to be declared at pipeline compilation time, likely in the pipeline layout.
  • Push constant as a name is a Vulkan-ism. We'll need to bikeshed the name at some point.

The "immediate" terminology @litherum used is nice because it is related to what the developer intends to do (immediately set this piece of data) (also similar to immediates in assembly instructions like "move") instead of a potential hardware mechanism (push constants).

IMHO it would be better to have immediates set directly instead of via an aggregate command like setBindGroup because there are likely many cases (alluded to by @magcius) where no other resource needs to change, just information as to where to index in these resources. We are also fairly constrained by Vulkan because the WebGPU spec uses the 4 guaranteed Vulkan descriptor set slots, so we can't really use a new one just for immediates.

@mwyrzykowski isn't your implementation of WebGPU using set*Bytes already? In Dawn we needed to pass some resource sizes, viewport bounds and a few other things in there. It would be possible to have a buffer slot used for both the developer's immediates and the implementation's immediates (by concatenating all of them).

@mwyrzykowski
Copy link

@mwyrzykowski isn't your implementation of WebGPU using set*Bytes already? In Dawn we needed to pass some resource sizes, viewport bounds and a few other things in there. It would be possible to have a buffer slot used for both the developer's immediates and the implementation's immediates (by concatenating all of them).

Yes we already call setBytes/setVertexBytes/setFragmentBytes. This is a good point, we could reuse the slot assuming this is limited to a certain number of bytes (like 1024 or less). So then no impact to the limits 👍

I agree the term 'immediate' is very much preferred over 'push constants'.

@litherum
Copy link
Contributor

  • The constraint on no pipeline recompilation means that something has to be declared at pipeline compilation time, likely in the pipeline layout.

I don't think this has been proven yet. This would be true if:

  1. Performance was the motivation for push constants, rather than convenience, and
  2. Dynamically selecting between push constants and buffers in software in the shader was demonstrated to be too slow.

@litherum
Copy link
Contributor

litherum commented May 20, 2024

There is no way to support writeBuffer-but-it-works-in-a-render-pass easily in D3D12 and Vulkan.

if the biggest use case is simply "make some indices available to a shader" then a simple API convenience method would be sufficient, without changing the binding model or adding push constants at all.

What kind of features are you thinking of?

You're right that, from within a render pass, it's impossible* to update a buffer in a well-defined way (where "well-defined" means: a set of known draw calls from early in the render pass don't see the update, and a set of draw calls from later in the render pass do see the update).

(*For some definition of "impossible.")

Therefore, if the goal was just convenience, it would have to work like this:

let renderPassEncoder = commandEncoder.beginRenderPass(...);
renderPassEncoder.setPipeline(...);
let newBindGroup = device.createBindGroup(..., new Int32Array([42]), other, resources, etc); // immediate data gets either put into temp buffer storage (which IIRC all implementations already use) and/or push constants
renderPassEncoder.setBindGroup(0, newBindGroup); // gets turned into both vkCmdBindDescriptorSets() and vkCmdPushConstants()
renderPassEncoder.draw(...);
let secondNewBindGroup = device.copyBindGroupWithUpdate(newBindGroup, new Int32Array([43])); // only update the immediate data
renderPassEncoder.setBindGroup(0, secondNewBindGroup); // gets turned into _just_ vkCmdPushConstants(), because the previously-bound Vulkan descriptor set is still correct
renderPassEncoder.draw(...);

(Note that the second call to setBindGroup() only ends up setting push constants, which solves @Kangz's concern "there are likely many cases ... where no other resource needs to change.")

@Kangz
Copy link
Contributor

Kangz commented May 21, 2024

I'm not sure that copyBindGroupWithUpdate is that much more convenient than setImmediateU32(0, 42) TBH. Also won't you need to make placeholder value in bindgroup creation in most cases since indices into that bindgroup are not known? (or specify it is defaulted to 0)

@litherum
Copy link
Contributor

I'm not sure that copyBindGroupWithUpdate is that much more convenient than setImmediateU32(0, 42) TBH.

Sorry, I don't think I was being clear. From where I stand, it looks like there are these possible paths forward:

  1. (The null hypothesis) Implementations make no changes. Don't use push constants.
  2. Don't make any change to the WebGPU spec, but implementations opportunistically and silently promote specific calling patterns to use push constants under-the-hood. Implementations compile shaders with both codepaths (push constants and traditional buffers) with a software flag to switch between them at runtime.
  3. Same as (2), but add additional "immediate" methods to WebGPU, with their semantics defined using the existing binding model (and therefore can always be implemented on top of traditional buffers), but whose semantics make it clear that the provided data would be good candidates to (silently!) hoist to push constants.
  4. Same as (3), but make the "immediate" methods only work with a new kind of binding layout type specifically for immediate data. If the data provided is too large, silently fall back to using buffers under the hood. Shaders still have to be compiled with both runtime-switchable code paths, but the API implementation gets a little simpler (by essentially treating buffers and immediate data as namespaced separately - rather than both ultimately being backed by buffers).
  5. Same as (4), but limit the size of immediate data so that it's guaranteed to fit into push constants. Shaders only need to be compiled with a single path. Add a bunch of validation rules to guarantee the immediate data can coexist with the rest of the pipeline layout.

If performance is the goal (e.g. performance with push constants is shown to be consistently better than buffers on a variety of vendors' cards), and it's demonstrated that a software flag to switch between both codepaths is too slow, then option 5 is the only acceptable option.

However, if convenience is the goal, or a software codepath flag is fast enough, options 2-4 are better than options 0 or 5.

@magcius
Copy link

magcius commented May 29, 2024

Don't make any change to the WebGPU spec, but implementations opportunistically and silently promote specific calling patterns to use push constants under-the-hood.

If e.g. Chrome does this but WebKit does not, this means that performance can vary rapidly between vendor, leading to a real issue with performance portability. Making portable web applications that run acceptably across vendors is already hard enough.

While I don't think there's any spec guarantee that createShaderModule/createRenderPipeline only compiles a single shader, I think it's a reasonable assumption. Having web vendors spend precious startup time compiling multiple variations of the same shader "just in case" seems silly when I could instead tell the API what I want.

From that perspective, options 2, 3, and 4 are unattractive to me as a web developer and would be more inconvenient to me than options 0 and 5, because now I have multiple ways of doing something and the performance characteristics can change not just based on the GPU vendor, but also the browser vendor as well.

@litherum
Copy link
Contributor

litherum commented May 29, 2024

If e.g. Chrome does this but WebKit does not, this means that performance can vary rapidly between vendor, leading to a real issue with performance portability.

This seems to assume that performance is the motivation for push constants, which I'm not sure is true. But, if it is, you are correct, so long as the spec doesn't describe what the hoisting heuristic is (which the spec could do!)

I don't think there's any spec guarantee that createShaderModule/createRenderPipeline only compiles a single shader

None of the options I listed involve a single call to createShaderModule()/createRenderPipeline() compiling multiple shaders. I'm describing a single shader being compiled, which contains both layout(push_constant) uniform Constants1 { stuff } constants1 and layout(set = 0, binding = 0) uniform Constants2 { stuff } constants2 with accesses to stuff being guarded with a software flag.

the performance characteristics can change not just based on the GPU vendor, but also the browser vendor as well.

I think this is reductive. The amount of push constants space in Vulkan will be different than the amount of root constant space in D3D12 (which will be different than any potential driver optimizations of setVertexBytes() in Metal would use). If the spec limits WebGPU push constants to only fit into the smaller of the two sizes, implementations would be leaving either performance, or memory, on the table. I understand the desire for portable performance, but that also needs to be weighed against going as fast as the underlying API allows for.

@shaoboyan
Copy link
Contributor Author

This seems to assume that performance is the motivation for push constants, which I'm not sure is true

I'll run my RootConstant Sample on various vendors GPUs to see the results.

Sorry, I don't think I was being clear. From where I stand, it looks like there are these possible paths forward:

I'm still concern about the implementation burdens we take. Especially on Vulkan backend we don't have extra descriptor set slot for implicit uploading buffer. And a switch-branch in shader has concerns too.

I think this feature doesn't want to encourage developers to uploading large data. That's why I'm not a fan of 3, 4. I don't want to provide a method that hints people "We can upload large amount of data in render/compute passes". To avoid confusing developers about "which uploading approach should I choose ?", a LIMIT helps hint "Immediate data" is only for small amount of data, even we do a silently fallback in implementation in final state (For some large limit size).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api WebGPU API proposal wgsl WebGPU Shading Language Issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants