Skip to content

Enigmatisms/cuda-pt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUDA-PT


Unidirectional Path Tracing implemented in CUDA, together with C++17 traits and is templated whenever possible.

This will definitely be benchmarked with AdaPT and, well CPU based renders like pbrt-v3 (generic accelerators) and tungsten (Intel Embree).

Since I have no intention making this a extensive project (like AdaPT, taking care of all the user-friendly aspect) and I am doing this just to challenge myself for more difficult parallel program design, this repo will not be so user friendly and the scalability will be far worse than that of AdaPT. I will try to keep the chores minimal and focus on heterogeneous program design.

  • Toy CUDA depth renderer with profiling:

  • Unidirectional path tracing with AABB culling. Full traversal without spatial partition. In this stage, shared memory and constant memory will be made use of. Special kind of variant will be of use (since std::variant is not supported by CUDA, for std::visit will either crash or be rejected by the compiler). This version of UDPT can be 3-8x faster than my AdaPT renderer (Taichi lang, JIT CUDA backend).

Depth Renderer Unidirection PT
  • CUDA texture bindings (with normal or UV maps)
  • GPU side BVH implementation. This will be the most difficult part, since "it is always easy to write your program with parallelism, but difficult to make it fast".
shared / constant / texture memory acceleration
  • For naive full-traversal based implementation, multiple threads can batch geometries (together with UV-coords, normals, etc.) and copy them to shared memory. Since shared memory is limited (on my device, 49152 Bytes) and we need appropriate number of blocks to occupy the stream processors, batch size should be experimented with.
  • constant memory can be used to store object information (color, emission), since it will not occupy too much memory (65536 Bytes constant memory on my device is obviously sufficient).
  • texture memory: I have never tried this before. Excited! CUDA texture bindings offer hardware BILERP, amazing.
warp level operations & stream multi-processing

For now, not applicable it seems.

variant based polymorphism

Polymorphism can be easily achieved with virtual functions/classes, yet I don't think this is a good choice for GPU programming: extra vptr will

  • Add another global memory access, which can be slow (without possibility to coalesce memory access for fewer memory transactions)
  • Prevent compiler from inlining the function, and the stack procedures for calling a non-inline function can introduce overhead.

Polymorphism based on variant (union like type) might avoid the above overhead.

Spatial partition

For a scene with complex geometries, BVH (or KD-tree) should be implemented to accelerate ray-intersection. For CPUs, these acceleration structures are easy to implement and can be fast naturally, while for GPUs, branching efficiency and memory access pattern should be carefully considered, in order to run as fast as it can.


Current State

This repo originated from: w3ntao/smallpt-megakernel. I answered his question on stackexchange computer graphics and tweaked his code, so I thought to myself... why not base on this repo and try to make it better (though, I won't call it small-pt, since it definitely won't be small after I heavily optimize the code). After solving the problems in his code, I am able to render around 20x faster than CPU (don't remember how many threads I used, GPU is RTX TITAN, though):

For detailed analysis, please refer to my answer post given in the above link.