Plans on releasing builds? #1

32bitx64bit · 2024-03-11T23:22:49Z

This looks cool, a GPU based powder toy would make it run so much faster.

tugrul512bit · 2024-03-30T20:39:09Z

It will take a lot of time to converge sand behaviors in parallel because original PowderToy has sands directly depend on each other which makes conversion a bit hard. Perhaps there could be an approximation equation and it could be optimized by artificial intelligence, maybe.

GPU has so much more performance that it could become 3D too but its not fully visible (as screens are 2D) so a lot of compute budget will be available per frame. But communication bandwidth is very limited (PCIE 16-32 GB/s). Kernels should move minimal amount of data per frame and re-compute things if necessary.

We need some kind of interaction kernel that is both parallel and can take any logic without depending on serial operations yet upgradable by new particle types and effects.

Possibly, instead of using cells of a matrix representing universe, direct particle-based (similar to physics-simulations) would make a lot of speedup but could diverge from the original behavior of sand. Simplest approach should be exclusion-principle (two particles can't fill same volume and push each other outwards). So instead of pushing themselves towards other cells, they could just freely move each other in space with non-integer coordinates.

On the other hand, cell-based integer-coordinate approach requires a lot of data copying due to checking many patterns of sands, etc that create all the rules of automata. Cellular automata is also easier to use for chemical reactions while other version would require explicit atomic-interactions like bonding between atoms, etc that require extra care to keep the simulation non-exploding (because it's easy to make simulation have exponentially gain more energy due to non-optimal time-step). But cell-based (integer-coordinates) approach does not have this explosion problem (unless explicitly added).

When using single-GPU, there is abundant bandwidth like 300-500 GB/s on mainstream cards and 1-2 TB/s on high-end cards. But on multiple-GPU, the communication is through pcie (because using OpenCL for not having proprietaryship-dependency) and slow while compute power is doubled or tripled compared to single card. Not all algorithms are offloadable efficiently on two cards. For example, atomic functions could work fast within same GPU but when there is another GPU, atomically accessing is orders of magnitude slower (if not impossible) and require extra driver/hardware dependency. So, algorithms should be minimally bandwidth-dependent while having maximum computational load (like re-computing same thing on both GPUs instead of copying to each other, sometimes this is faster). This will need a lot of benchmarking for each kernel developed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plans on releasing builds? #1

Plans on releasing builds? #1

32bitx64bit commented Mar 11, 2024

tugrul512bit commented Mar 30, 2024

Plans on releasing builds? #1

Plans on releasing builds? #1

Comments

32bitx64bit commented Mar 11, 2024

tugrul512bit commented Mar 30, 2024