What about next-gen particles?

ShootMyMonkey · Jul 28, 2005

I was thinking more cast a ray and reduce it's intensity (or increase for light particles) for each particle it passes through. You would only need traverse until a percentage occlusion, so wouldn't need to draw every single particle back to front. So in say 1000 particles deep cloud of smog, by the time the ray's passed through the 10th particle's volume it's already black so stop there.

Actually that wouldn't work. If you've got a super bright light-source at the end, or in the middle, you'd need to keep tracing until you reached the lightsource (or rather, bright object) to add it's intensity to the traced pixel. So you'd have to keep tracing until you hit an opaque surface. I s'pose if particle data could be efficiently enough described it wouldn't be too hard on the memory to do this, using the z-buffer from RSX to determine length of the ray, creating a half-ray traced renderer.

Ah... that makes more sense. Either way, you'd probably cut out some unnecessary sampling, and you can probably get a better model of light transmission through the medium. The hard part with lightsources is you'd have to step through and see how much direct and indirect scattering you receive at every point. You can probably get away with assuming that all indirect scattering is only along the view direction, but direct scattering would be straight from the lightsource to each step point along the ray. Assuming isotropic scattering, the direct light paths would scatter a 1/2pi fraction to the view direction from that point.

Well, you can stop raytracing until you've reached total opacity, probably just because you can say that there wouldn't be any indirect scattering along the view direction beyond that point, but you'd still take direct lighting samples... okay I'm not even making sense to myself at this point. I need to get some coffee.

Ummmmmmm... tell me you wouldn't *entirely* mind just sticking with sprites for a while...

Shifty Geezer · Jul 28, 2005

Sissy.

Regards RSX's bandwidth, it's got lots, but it could have more. You can always have more! And certainly in comparison to Xenos on such things as alpha blending RSX has considerably less BW available.

dukmahsik · Jul 28, 2005

Shifty Geezer said:
Sissy.

Regards RSX's bandwidth, it's got lots, but it could have more. You can always have more! And certainly in comparison to Xenos on such things as alpha blending RSX has considerably less BW available.

I am assuming this is due to edram saving bw on xenos for hdr, alpha blending, culling, fsaa?

Titanio · Jul 28, 2005

dukmahsik said:
I am assuming this is due to edram saving bw on xenos for hdr, alpha blending, culling, fsaa?

The alpha blending can be done using the eDram logic iirc, yes.

There are potential solutions for PS3, if bw becomes an issue, as outlined earlier.

Shifty Geezer · Jul 28, 2005

Not eDRAM saving, but eDRAM bandwidth. On RSX the alpha blending takes place between the logic on the RSX and the data in the backbuffer in main RAM. That's 22.5 GB/s max (ignoring other BWs that complicate things). Whereas on XB360 the blending takes place between the logic and backbuffer on the daughter die, so this extra processor has a shed load of BW to play with. ATi were smart and moved the BW intensive tasks to an area they could stretch their legs.

Npl · Jul 28, 2005

dukmahsik said:
Shifty Geezer said:

Sissy.

Regards RSX's bandwidth, it's got lots, but it could have more. You can always have more! And certainly in comparison to Xenos on such things as alpha blending RSX has considerably less BW available.

Click to expand...

I am assuming this is due to edram saving bw on xenos for hdr, alpha blending, culling, fsaa?

Xenos: 32 GB/s to Framebuffer (eDRam), alpha-blending is done for free (not done over the 32GB/s Bus)

RSX: 25 GB/s to VRam, 35GB/s to Main Ram.
You need to place the Framebuffer in one of these pools, VRam is the natural choice. So to Framebuffer you have 25GB/s, but you `ll additionally have to acess Textures/Vertices from the same Bus, making the Bandwith considerably lower.
For alphablending you will need to read and write to FB, so this is effectively the double expense.
So you have 25 GB/s access for everything stored in the 256MB VRam including Framebuffer with operations that read from the FB being more expensive than those than don`t

dukmahsik · Jul 28, 2005

hey thanks for the extensive replies guys. does it sound reasonable that xenos will be able to display more effects with a higher frame rate then?

Titanio · Jul 28, 2005

Npl said:
Xenos: 32 GB/s to Framebuffer (eDRam), alpha-blending is done for free (not done over the 32GB/s Bus)

RSX: 25 GB/s to VRam, 35GB/s to Main Ram.
You need to place the Framebuffer in one of these pools, VRam is the natural choice. So to Framebuffer you have 25GB/s, but you `ll additionally have to acess Textures/Vertices from the same Bus, making the Bandwith considerably lower.

This is true, though technically you could write off the vram bandwidth and still have pretty much the same inward bandwidth to RSX as Xenos has. Of course, that won't ever happen - you can't just use 256MB for the framebuffer! - but it's worth remembering that as you limit bandwidth for framebuffer usage, it only increase that available for other incoming data, beyond what Xenos can accept.

Npl said:
hey thanks for the extensive replies guys. does it sound reasonable that xenos will be able to display more effects with a higher frame rate then?

Depends what effects you're talking about, and if you're treating the GPUs in isolation.

Laa-Yosh · Jul 28, 2005

Regarding true 3D particle effects and voxels and such, well, the movie VFX industry is the place to look at. For example the tornados and water effects in 'The day after tomorrow' used Digital Domain's custom voxel renderer.

However, most of the times, tricks are preferred to brute force rendering. Weta for example threw out their fire simulation software and instead decided to use hardware rendered! particles (textured quads) for both the Balrog and the flowing water in the flooding of Isengard. Their advantage compared to game engines, was that they used filmed footage of real life smoke, fire and water as animated textures, which probably meant hundreds, if not thousands, of megabytes of texture data. Nevertheless, both examples were quite impressive, although Weta went on to write their custom software renderer to replace the hardware part.

richardpfeil · Jul 28, 2005

Npl said:
Xenos: 32 GB/s to Framebuffer (eDRam), alpha-blending is done for free (not done over the 32GB/s Bus)

RSX: 25 GB/s to VRam, 35GB/s to Main Ram.

You need to place the Framebuffer in one of these pools, VRam is the natural choice. So to Framebuffer you have 25GB/s, but you `ll additionally have to acess Textures/Vertices from the same Bus, making the Bandwith considerably lower.

These numbers are a bit off. The Xenos has 32GB/s write, 16GB/s read to the EDRam module for a total of 48GB/s. The PS3 has 22GB/s to VRam and 25GB/s to Main Ram.

The placement of the frame buffer is the key to fully utilizing the bandwidth available on the PS3. It seems that the best way to handle that is to place the Color portion of the buffer in VRam, and the Z Buffer in the Main Ram. Without spliting them there is not much to be gained.

Titanio · Jul 28, 2005

I'm kind of curious - what is NVidia's colour compression like now? I recall reading during the GeforceFX era that they had up to 4:1 lossless compression, but only for antialiased images.

Anyway, there seems to be two paths with regard to particles - one that requires lots of transparecy (billboarding) and one that doesn't (modelling the particles opaquely with geometry). For some types of system, transparency is hard to avoid (think rain, smoke for example). For others (destruction, on a less granular level - the big debris for example) it can be avoided.

For PS3, there's a few options. The brute-force bw-eating method is one, and may be fine in a lot of situations. But I'm liking the sound of Npl and Faf's idea - rendering the particles/transparencies with SPE(s). As I understand it, you can keep seperate buffers for opaque and transparent objects? Zbuffer comparisons would still require external main memory bw, but everything else could chew on the SPE's bandwidth to LS, and then finally blend the buffers. Relatively speaking, RSX could virtually forget about the transparencies. I don't see much problem with tiling the "transparent" buffers. I guess then it's a question of performance, and whether a SPE (or 2?) could get through that in the duration of one frame.

I think that sort of solution would be only used in VERY heavy situations, though..

j^aws · Jul 29, 2005

richardpfeil said:
...
The Xenos has 32GB/s write, 16GB/s read to the EDRam module for a total of 48GB/s...

That's what I used to think but looking at Dave's article and the block diagram, it's 32 GB/sec aggregate.

ERP · Jul 29, 2005

The placement of the frame buffer is the key to fully utilizing the bandwidth available on the PS3. It seems that the best way to handle that is to place the Color portion of the buffer in VRam, and the Z Buffer in the Main Ram. Without spliting them there is not much to be gained.

You're making the assumption that this configuration is even possible....

zidane1strife · Jul 29, 2005

Shifty, did you see the ps3 particle demo?

That demo had 100s upon 100s of thousands of particles moving in realtime, think wind physics too(though memory's a bit fuzzy), and they said they could even simulate 3d sound for each one or something like that with the processing power left.

Titanio · Jul 29, 2005

zidane1strife said:
Shifty, did you see the ps3 particle demo?

That demo had 100s upon 100s of thousands of particles moving in realtime, think wind physics too(though memory's a bit fuzzy), and they said they could even simulate 3d sound for each one or something like that with the processing power left.

Yeah, it'd be interesting to here more on that demo asides from the high level "hundreds of thousands" of leaves - iirc, they were modelling a few vortexes. The claim was that they could have a channel of sound for each leaf if you so wanted (but presumably there's only so many leaves that would be within "listening" distance at any one time..). More interestingly, though, as far as I can tell the leaves weren't just textured billboards..some were curved etc. Did they model every one with geometry?

Lair also seems to have really nice particle effects. The rain looks really great, it looks to be very dense.

Barbarian · Jul 29, 2005

PS2's particle prowess was a combination of two very important components - GS's EDRAM(=massive fillrate), and VU1 being able to 'generate' quads on the fly from raw particle data or even completely procedurally (emitters + time equations).
On the other hand Xbox had to render to main ram and fight over BW with the Cpu.
The result from this was that on every games i worked on, we had to cut the particle count on the respective Xbox version by more than a half!

Next gen it seems the situation has reversed. I have no doubt Xenos will decimate RSX in particle performance no matter if you help out with Cell or not (which I think is a waste of time anyway).

My biggest problem with next gen is that AGAIN there is no HW support for particle sorting of any kind! Dreamcast was the only system that provided HW transparency sort, and their solution was kind of slow.

To get that rich, deep, smoke/explosion effect, you need to sort, there are no two ways about it. If you fake it and supperimpose smoke on top of fire or vice verse, it looks cheap and fake.

Titanio · Jul 29, 2005

Barbarian said:
Next gen it seems the situation has reversed. I have no doubt Xenos will decimate RSX in particle performance no matter if you help out with Cell or not (which I think is a waste of time anyway).

I think "decimate" is a little strong a word..

Care to elaborate? I think if you got it working on Cell, certainly that would make a big difference.

I'm trying to flesh out the idea of transparencies on Cell vs the rest on RSX. The biggest obstacles I see are perhaps the size of tile used, but my knowledge isn't exhaustive.

I guess Faf or Npl - can you elaborate a little on how this might work?

My thinking is you have a seperate transparency buffer, split into tiles. For each tile you need the corresponding portion of the rest of the scene's z-buffer (this could be populated early if you wanted Cell to draw transparancies in parallel with RSX shading?). For each tile you'd need to figure out which particles belonged to it (some particles may belong to multiple tiles), and perhaps sort back to front to make things easier (is this hard/easy to do?). Then basically for each tile Cell would rasterise the particles (in order, if sorted), checking against the z-buffer to see if they're visible or not. There may be potential to stream the particles in? Do that for each tile, then finally blend the "transparency" buffer with the rest of the scene?

LS space is obviously limited (must accomodate output buffer, z-buffer, some geometry - stream that in? - a little texture data, program code..?), and that's why I wonder about the tile size..

These are just my own likely ill-informed thoughts on the matter..there may be much better ways of doing this, or there may be stuff I'm missing. I think it's very much worth discussing though given its apparent potential. If it worked, it would save RSX bandwidth, and free it up from some processing to work on other things. I'm guessing a little here, but I think Cell could also generate masses of particles internally, transform and draw them without that geometry ever having to touch another piece of bandwidth? Obviously the bandwidth between the SPEs and LS would accomodate a lot more drawing too.

Faf/Npl - care to throw a little light?

Barbarian · Jul 29, 2005

It can work if you do an early Z pass, copy the z-buffer over to main ram and strart streaming portions together with particles to 1 or more SPUs.
Particles usually don't need to modify Z so you don't need to back integrate it with RSX. You'll need to blend the frame buffer obviously.
It will be tricky with other transparencies that RSX handles, unless you pull all transparent work to Cell.
And you can't sort them (completely that is) since you can access only a batch at a time.

The biggest problem I see is that inherently particles are very small and very translucent. That means loads and loads of very tiny colour outputs. We are talking million particles here. Can you compute one every cycle (like Xenos and RSX can)?
And besides, if you tie up your SPU with rasterization, what about animations, physics, AI?
I mean don't get me wrong, I so wanted to do graphics work on Cell. The patent, the initial design with 4 Cells + Cell rasterizer, it was all begging for SW rasterization. Unfortunatelly, with 1 Cell + RSX, I don't think it will be profitable to go that route.

Titanio · Jul 29, 2005

Barbarian said:
You'll need to blend the frame buffer obviously.

Can this be one final blend at the end? Draw all the transparencies and then blend with RSX's buffer?

Barbarian said:
It will be tricky with other transparencies that RSX handles, unless you pull all transparent work to Cell.

This is what I was thinking to do.

Barbarian said:
And you can't sort them (completely that is) since you can access only a batch at a time.

It may not cover all cases, but sorting batches could work OK. If you have a bucket of all particles in a particular portion of the frame, you could sort them independently of others? You might have problems with shared particles, though...but I'm not sure if you'd notice in dense particle systems.

Barbarian said:
The biggest problem I see is that inherently particles are very small and very translucent. That means loads and loads of very tiny colour outputs. We are talking million particles here.

Not that it can't be done, but are we? At least with a traditional "generate on the CPU, render on the GPU approach", that could be very difficult given the bw between the two. I think to do millions would require avoiding the need to use cpu<->gpu bandwidth, which would necessitate an approach like we're talking about now, or similarly doing it all internally on the GPU (can either RSX or Xenos create vertices?). Millions seems like a lot to be drawing on one frame, even with overdraw.

edit - Whoops! I had a stray couple of zeros in my calcs for this. You could actually push 1m particles from CPU to GPU OK, probably, would take a couple of a few GB/s depending on your framerate.

Barbarian said:
Can you compute one every cycle (like Xenos and RSX can)?
And besides, if you tie up your SPU with rasterization, what about animations, physics, AI?

The amount of time required is a good question, I'm not sure at all. The alpha blending itself doesn't seem particularly complicated though (I'd say perhaps the sorting and splitting into tiles might take as long if not more). I'm not really qualified to say, though, I'd leave it to Faf or Npl or other devs to comment on that.

richardpfeil · Jul 29, 2005

ERP said:
The placement of the frame buffer is the key to fully utilizing the bandwidth available on the PS3. It seems that the best way to handle that is to place the Color portion of the buffer in VRam, and the Z Buffer in the Main Ram. Without spliting them there is not much to be gained.

Click to expand...

You're making the assumption that this configuration is even possible....

Not an assumption, just a statement of opinion. If they can't be split, then all this talk about leveraging the CELL's memory bandwidth is for naught. The color and Z buffers are far and away the largest bandwidth consumers. Stuffing them both down the same bandwidth will cause congestion.

If they can't be split then the RSX will be severly hamstrung by it's 128 bit bus to the VRam. Frame buffer read and write bandwidth would be capped at 22GB/s (VRam) or 25GB/s if both can be placed together in the Rambus memory. The actual numbers will be lower after taking other bandwidth consumers into account. The GeForce 7800 GTX has 46GB/s minus the texture read bandwidth, and the RSX is suppose to run faster? (550 vrs 430) I just don't see it.

Let's look at a comparision with the 360. The RSX needs to consume 16GB/s of write bandwidth for both the color and the Z buffer to keep up with the Xenos. It also needs quite a bit of read bandwidth for Z compare and blending, which the Xenos does inside it's EDRam. That 32GB/s (+ read) can't come from only one side.

What about next-gen particles?

ShootMyMonkey

Shifty Geezer

uber-Troll!

dukmahsik

Titanio

Shifty Geezer

uber-Troll!

Npl

dukmahsik

Titanio

Laa-Yosh

I can has custom title?

richardpfeil

Titanio

j^aws

ERP

zidane1strife

Titanio

Barbarian

Titanio

Barbarian

Titanio

richardpfeil

Similar threads