Fluid Demo Movie (RapidMind)

Anyway, as Andy said the most ressource consuming part of this demo is polygons rendering, not sure cell would help.
 
Anyway, as Andy said the most ressource consuming part of this demo is polygons rendering, not sure cell would help.

'course it would. You can do all the vertex data generation/transformation on Cell, do the culling, and then stream the results to the RSX, which should be up to the task from there on.
 
'course it would. You can do all the vertex data generation/transformation on Cell, do the culling, and then stream the results to the RSX, which should be up to the task from there on.
Indeed the SPU->vertex fast path on PS3 could be used to great effect in this demo. However to be fair it could easily be made to run faster on the GPU as well via some simple LOD, so it's not as if there's a performance problem there ;)
 
Looks very nice, although the waves etc. are far from fully realistic from a river but hey, we're in 2007 and not 2070 so I won't complain too much about that! ;)
Yup, for sure it's a *very simple* water model. The entire advection term is dropped from the NS equations, and there's no way you'll get complex effects like turbulence, etc. Basically it just models waves, which works well for ponds and stuff and looks decent for lakes, etc. but certainly doesn't realistically model turbulent flow such as rivers. Still, I think it looks kind of cool ;)

BTW, Andy, did you read this paper?
Yup! That's the paper I was referring to when I said that the hybrid techniques (2D+3D) look pretty promising. I also think SPH is interesting since it's pretty straightforward to parallelize.
 
Nice demo! I agree with you that particle-based methods like SPH aren't well-suited to large volumes of water, but you can still get some good results on todays' GPUs.

Here's a video of the latest particles demo from the CUDA SDK:
http://www.youtube.com/watch?v=RqduA7myZok

It's not quite as pretty as yours, but it gives you an idea of the performance possible.

I think the best fluid simulation for games is going to be achieved by combining the strengths of these and other techniques.

Is the performance of SPH and other particle based codes limited by poor texture locality?

Seems as if most codes (regardless of if they employ a uniform grid, spacial hash, or something else for the nearest particle search) tend to end in the following form,

Code:
for all particles {
  for all particle neighbors {
    gather ID of the neighboring particle
    lookup particle properties in table based on ID   <-- Bad texture locality???
    do math } }

As the simulation runs, lookups of particle properties would be fetching with a random (or semi-random) access pattern.

Is this something which your new CUDA SPH code has found a workaround for?

Or perhaps something else is the bottleneck for particle methods like SPH.
 
Nice demo. The surface waves look pretty good.

I spent some time working on an implementation of SPH for incompressible flow. The code was MPI parallel, running on a large cluster (up to 128 CPUs). I just couldn't get performance to scale -- the communications overhead just killed it.

At the end of the project I dumped that and rewrote the code using CUDA. The end result was much more promising, though I didn't get the time to finish it before funding ran out, which was frustrating.

TimothyFarrar said:
Seems as if most codes (regardless of if they employ a uniform grid, spacial hash, or something else for the nearest particle search) tend to end in the following form,

Whatever structure you use to accelerate the neighbour search has to be rebuilt periodically. That can be a challenge if you want to do it in parallel, particularly if you want to migrate particles around in memory to help ease the locality issues.
 
Whatever structure you use to accelerate the neighbour search has to be rebuilt periodically. That can be a challenge if you want to do it in parallel, particularly if you want to migrate particles around in memory to help ease the locality issues.

Often neighbor search on the GPU is simple 3D texture (uniform grid) with depth pealing to pack multiple indexes into the color channels of a pixel. With the grid getting rebuilt each frame. One option for smaller grid sizes would be to use multiple MTRs and write out the particle properties instead of the particle ID. Then particle property read back would have better locality.

As for using CUDA, is there an advantage in going that route when you loose access to the ROP and other graphics functionality which can be of an advantage for SPH and other fluid simulation algorithms?
 
Do you mean specifically for the fluid example? I except it to scale very well, but I'll certainly post results when I get to messing around with it :)

If you mean the x86 backend in general the results are often very good so far... there are a few benchmarks posted on the site IIRC, but it's not uncommon to see something like 2x speedup on one core over typical C++ code, and 10+x speedups on 8 cores.

How can Rapidmind automagically scale code (efficiently) among multiple cores? You'd always have to design your code/algorithms to support multiple cores somehow?
 
How can Rapidmind automagically scale code (efficiently) among multiple cores? You'd always have to design your code/algorithms to support multiple cores somehow?
It doesn't do it "automatically" per se: it provides an embedded programming model that allows you to easily express computations in a way that can be efficiently parallelized. In the simplest case it's just data parallelism (SIMD) although once you start throwing in gather/scatter, control flow and collective operations (scan, reduce, etc.) it gets significantly more expressive. Now under the hood RapidMind does a ton of optimizations, some auto-vectorizing where it can, lots of fancy memory management, etc. but the general model is to help the programmer to write good parallel code easily rather than try to infer parallelism from serial code (which is a dead end IMHO).

Certainly operations such as blocking inter-process communication are intentionally restricted or unsupported to force people to write efficient parallel code, but many applications can be expressed efficiently in such a form, sometimes using a different "more parallel" algorithm. Sure there are algorithms that seem to "resist parallelizing" but I find that for the most part once you get used to parallel programming models it becomes almost as natural as writing serial code... then again maybe I'm just too deep in it now ;)

If you're interested in RapidMind and how it works, I'd definitely encourage you to check out the web site and sample code. You can also request an evaluation copy to play with.
 
It doesn't do it "automatically" per se: it provides an embedded programming model that allows you to easily express computations in a way that can be efficiently parallelized. In the simplest case it's just data parallelism (SIMD) although once you start throwing in gather/scatter, control flow and collective operations (scan, reduce, etc.) it gets significantly more expressive. Now under the hood RapidMind does a ton of optimizations, some auto-vectorizing where it can, lots of fancy memory management, etc. but the general model is to help the programmer to write good parallel code easily rather than try to infer parallelism from serial code (which is a dead end IMHO).

Certainly operations such as blocking inter-process communication are intentionally restricted or unsupported to force people to write efficient parallel code, but many applications can be expressed efficiently in such a form, sometimes using a different "more parallel" algorithm. Sure there are algorithms that seem to "resist parallelizing" but I find that for the most part once you get used to parallel programming models it becomes almost as natural as writing serial code... then again maybe I'm just too deep in it now ;)

If you're interested in RapidMind and how it works, I'd definitely encourage you to check out the web site and sample code. You can also request an evaluation copy to play with.

Very interesting subject! I recently became interested in something similar: http://www.llvm.org/

Which can be used to dynamically compile DSL to different backends too, see for example:
http://zrusin.blogspot.com/2007/05/mesa-and-llvm.html

I haven't got to comparing the two, but they seem similar techniques, what differentiates Rapidmind from LLVM?
 
Very interesting subject! I recently became interested in something similar: http://www.llvm.org/
Oh neat, I hadn't seen that myself. Don't know how they compare, but I'll certainly check out LLVM in detail when I get the chance: thanks for the link.

Which can be used to dynamically compile DSL to different backends too, see for example:
That's funny, because at a Terrasoft "hack-a-thon" early this year one of the RapidMind guys worked with one of the Mesa guys for a few days to accelerate the programmable shading parts of Mesa too. It was just hacking for a few days, but they managed to get about an 80x speedup or similar on the Cell (PS3, using the SPUs) over the CPU implementation. Apparently it was quite easy to write too, basically just changing the "interpretor" portion to use RapidMind types, which provides a really simple way to turn an interpretor into a compiler using RapidMind :)

Anyways, just though that was interesting. GLSL is a very simple problem compared to many in HPC, but it's cool that other people are looking at similar problems (PeakStream was the most direct competition to RapidMind, but they got bought by Google).
 
Back
Top