Compute Shader based particle 4k demo

Psycho

Regular
Supporter
56850.jpg

download: http://loonies.dk/demos/bin/lns-michigan.zip (sry, no source this time)
http://www.youtube.com/watch?v=nUwFYu8ANLQ

I guess this is more interesting in here than the normal ones...

Everything is done in compute shaders, there is no graphics pipeline in use.
There are 1 mill. threads running, each simulating a particle. When the particle has been updated (or maybe respawned) it is rasterized into a 32 bit combined z and color buffer (11+7+7+7 bit). No hard sync on the zbuffer (ie it's just if (z<zbuffer) zbuffer=z) , so it may be incorrect some places, but not very likely.
After this there is another CS pass (actually it's the same shader, just branching on different parameter values to save space) doing simple blur/glow post processing reading from the zbuffer and rendering on the screen.
Performance is almost entirely bound by number of particle-pixels, ie the rasterizing into a structured buffer part. So it's good to have many but small particles.

The scenes are defined as analytical distance functions (just like the slew of current ray marching 4kbs) - particles may be spawned at intersection points by ray marching from the camera, and they can be attracted to, get color from, or reflected on the surface easily. Particles are not interacting with each other, only the scene and some global affectors (turbulence, gravity, damping). Particles are respawned in a generational fashion, having 512 generations and a constant spawnrate per effect, and the spawn ratio between "screenspace" and "free" particles can be set.
Unfortunately is necessary to be ping-ponging the particle buffer, as it's not possible (I really can't see the hw reason for this) to loop dynamically (ie the rasterizing) based on values (the radius) from a UAV, so the read buffer has to be a SRV.

CPU is doing nothing but choosing the right line of 25 parameters from the script, update the constant buffer with these (and current time) and dispatch - it's consuming like 5 seconds of cpu time after the music has been generated (our gpu based synth is not ready yet).
 
Last edited by a moderator:
Brilliant demo.

I particularly liked the idea of using UAV and rendering particles directly, and using non-atomic writes.
For example, NBodyGravityCS11 demo from DX SDK can render only 230M stars/sec (without gravity interaction) on 480gtx,
which is consistent with other observations suggesting 1 triangle/clock for non-tessellated geometry. (1 star = 2 triangle).
Im sure one can render billions without pipeline!

Spawning particles at ray - zero-distance intersection is also very interesting idea!
One can render ex. fire and water with much less particles than usually, not wasting time calculating and rendering
invisible/occluded/distant particles. O(n^2) particles instead of O(n^3) !

Also, there is a possibility of arranging multiple layers of low-bit-res Z-buffer, something not supported by regular pipeline.
 
Looks pretty, wish I had a clue what it was doing :) Averaged 158fps on a GTX 580, using fraps from the start to the end of the music.
 
Fantastic little intro!
On my HD6970 it's averaging 95FPS at stock and 101FPS at 950/1450.

This algorithm is more sensitive to memory clock adjustment than core on my card. I wonder if that's the reason why GTX580 is so much faster here.

PS. 720p result @950/1450 = 112FPS
 
It's very much pixel limited (ie reads and writes from the zbuffer while rasterizing), so I guess gf100's better caches are helping, just like the smaller wavefronts (particles in a wavefront ends up having different size due to randomness).
It's size, not performance optimized after all.. ;) For instance it would most probably be faster to draw a single(/few) pixel per particle and then fill out the rest by post filtering.

My 6870 is doing 84 fps avg at 720p. The SM4 version (now doing the same 1mil particles, just updated the archive) is doing 130 fps. That's of course mostly limited by the geometryshader. (computeshader for simulation, geometryshader producing quads from the particle buffer and then pixelshaders for drawing). On the GT550M both versions are 26 fps.
 
Last edited by a moderator:
im having lots of problems with it
first avg tells me its a trojan
i tried to take a screenshot of the error msg but my printscrn key wouldnt work so i ran fraps to grap a screenshot and it crashes with a couldnt initialise dx9, the i ran the 1280x720 but its runnning at 5292x1050
 
Yeah, the demo also set off AVG on my work computer...gave our IT guy a bit of a scare! I'm guessing that whatever voodoo you guys are doing to keep the executable size down is tripping up the scanners.
 
AVG making false positives is essentially the result of making your 4k executable packer (crinkler) public. Some people COULD (and maybe have since avg triggers) write malicious code and hide it in a packed executable making it hard to the AV to scan the content. So now AVG just triggers on all crinkler packed files :(

So when running it outside the normal (unprotected) location on my laptop I first have to run it, just to ctrl-alt-delete and stop it, to press the allow button in avg, and THEN I can start it, right clicking and selecting 'run on nvidia gpu' (damn optimus - I'm opening a hw level 11 device, guess which gpu I would like to use then? ;) ).
 
Hi Psycho,

That Demo is really awesome in it's class, even though F-Secure tried to scare me away from it just like AVG. :)

Using the updated version in 1280 x 800, I am getting the following results with Fraps set to stop after 206 seconds on my GTX 480 (750 MHz):
DX11: 140 Fps
DX10: 167 Fps
DX10/512k: 277 Fps

Two questions:
1) Do you have an explanation for the DX10 version being quite a bit faster? DX11-CS are supposed to be more efficient after all.
2) Is there any way (or maybe could you integrate one) to use different screen resolutions?

Thanks for the demo!
 
dx11 version is rasterizing in the compute shader. That's obviously quite a bit slower than sending pixels through the ROPs. So while the dx11 version is limited by the number of pixels to draw (there have to be quite a bit overdraw to fill the screen with "random" particles), the dx10 version is mostly limited by the geometry shader (generating 1 mio. quads/frame) and definitely not by the number of pixels.

dx11 version:
a) computeshader simulating and drawing into z/color-buffer UAV
b) computeshader filtering on z/color-buffer UAV, blending in background and writing to the screen UAV

dx10 version:
a) computeshader simulating particles
b) geometryshader (with dummy input) reading particles from buffer and generating quads on screen
c) (first quad is fullscreen and drawing the background)
d) pixelshader rendering particles to texture
e) seperate fullscreen pixelshader pass filtering and drawing to screen

(I guess it's not hard to see why only the dx11 version is below 4k :) )

At least the dx11 version is not straightforward to make completely resolution independent.
 
Does the compute shader based version exercise any feature not available in pixel shaders? If that's not the case it probably makes more sense to use a full screen pass instead in order to get optimal scheduling on all HW out there. With compute shaders you never know what's going to work well on GPU A and GPU B and.. you get the idea.
 
I guess the fastest way would be to render a single pixel (z+color+radius) pr particle, and then use 2 seperate full screen filter passes (horizontal/vertical) to expand the particles (while taking z values into account). I just scrapped that idea early on for size reasons (but I don't think it would be that bad with the current setup).

And yes, the compute shader is doing a lot more than pixel shaders can, and dx11 pixel shaders are barely an option for serious 4k (big setup for the graphics pipeline).
 
Back
Top