Watch Impress PS3 Technical article, from GDC (some new info)

scooby_dooby · Mar 31, 2006

!eVo!-X Ant UK said:
All that can be done on PS3 aswel via SPE

You know from experience?

!eVo!-X Ant UK · Mar 31, 2006

scooby_dooby said:
You know from experience?

I know david kirt said it, and Koj is doing it in MGS4

heliosphere · Mar 31, 2006

!eVo!-X Ant UK said:
All that can be done on PS3 aswel via SPE

It can but the logistics are a bit of a pain. With only 256K of local store you can't fit much of the frame buffer into memory at a time so you have to stream small tiles through. For a lot of post-processing effects (motion blur, depth of field, bloom, anything with a filter kernel that spans multiple pixels of the input) you can't process tiles entirely in isolation because pixels in one tile are affected by values in neighbouring tiles. That means you have to make sure the right neighbouring tiles are available at all times.

Then you have the fact that for this kind of work the SPEs are noticeably slower than a GPU. The Xenos for example can execute 48 vector and scalar ALU ops and perform 16 bilinear filtered texture fetches per clock. A bilinear texture fetch is 4 memory accesses and two lerps, each of which is at least two vector ops. Many post processing filters can take advantage of bilinear filtering to gain significant speedups. The SPEs are clocked a little over 6x faster than the Xenos but you're still probably looking at needing all 6 available SPEs working flat out to get anywhere near the performance of a GPU for this kind of thing.

!eVo!-X Ant UK · Mar 31, 2006

heliosphere said:
It can but the logistics are a bit of a pain. With only 256K of local store you can't fit much of the frame buffer into memory at a time so you have to stream small tiles through. For a lot of post-processing effects (motion blur, depth of field, bloom, anything with a filter kernel that spans multiple pixels of the input) you can't process tiles entirely in isolation because pixels in one tile are affected by values in neighbouring tiles. That means you have to make sure the right neighbouring tiles are available at all times.

Then you have the fact that for this kind of work the SPEs are noticeably slower than a GPU. The Xenos for example can execute 48 vector and scalar ALU ops and perform 16 bilinear filtered texture fetches per clock. A bilinear texture fetch is 4 memory accesses and two lerps, each of which is at least two vector ops. Many post processing filters can take advantage of bilinear filtering to gain significant speedups. The SPEs are clocked a little over 6x faster than the Xenos but you're still probably looking at needing all 6 available SPEs working flat out to get anywhere near the performance of a GPU for this kind of thing.

So your saying that Cell can post process but just not at a decent enough speed to make it worth while?

EDIT :

David Kirk: SPE and RSX can work together. SPE can preprocess graphics data in the main memory or postprocess rendering results sent from RSX.

Nishikawa's speculation: for example, when you have to create a lake scene by multi-pass rendering with plural render targets, SPE can render a reflection map while RSX does other things. Since a reflection map requires less precision it's not much of overhead even though you have to load related data in both the main RAM and VRAM. It works like SLI by SPE and RSX.

David Kirk: Post-effects such as motion blur, simulation for depth of field, bloom effect in HDR rendering, can be done by SPE processing RSX-rendered results.

Nishikawa's speculation: RSX renders a scene in the main RAM then SPEs add effects to frames in it. Or, you can synthesize SPE-created frames with an RSX-rendered frame.

David Kirk: Let SPEs do vertex-processing then let RSX render it.

Nishikawa's speculation: You can implement a collision-aware tesselator and dynamic LOD by SPE.

David Kirk: SPE and GPU work together, which allows physics simulation to interact with graphics.

Nishikawa's speculation: For expression of water wavelets, a normal map can be generated by pulse physics simulation with a height map texture. This job is done in SPE and RSX in parallel

heliosphere · Mar 31, 2006

!eVo!-X Ant UK said:
So your saying that Cell can post process but just not at a decent enough speed to make it worth while?

EDIT :

No, I'm just saying it can't really compete with GPU performance when it comes to post-processing effects. It might still be worthwhile in some situations (if your GPU is maxed out doing other stuff and the SPUs are mostly sitting idle).

Edge · Mar 31, 2006

scooby_dooby said:
Isn't it obvious? If the PS3 is going to be a location free server for the PSP it's going to need to re-encode video on the fly, while people are potentially playing games on that same PS3. They can't saddle that on the PPE cause it would probably be too hard to manage the performance drops in-game, by reserving an entire SPE they make it easy for dev's to target the full power of the system without having to worry about how it will perform in various situations.

There have also been hints of encode your gameplay, then maybe do some edits later, saving the portions your want to showoff to your friends, and upload to the net. That would be very cool.

Fafalada · Mar 31, 2006

Inane_Dork said:
I also thought that some uses of the GPU are naturally weighted, say shadow map creation, in such a way that they are accelerated with a unified approach.

That's just something people that don't understand how shadow generation works, say.

Although like you mentioned, smoothing out bubbles alone can provide performance benefits in any situation.

Megadrive1988 said:
what about this possibility: 12 pixel shader pipes, with 2 textures each, making 24 textures, 8 ROPs, and 6 vertex shaders. just a wild guess. probably wrong. but something has to give, given the 128-bit bit memory interface.

You had already 'given' something by cutting rops to 8, yet you still think they have to cut every other single thing in the chip nearly in half?

heliosphere said:
It can but the logistics are a bit of a pain. With only 256K of local store you can't fit much of the frame buffer into memory at a time so you have to stream small tiles through.

That's 32x more then what we sometimes worked with on PS2 (for maximum efficiency postprocessing was often optimized for dram page sizes). And btw, box filters don't need entire bloody neighbouring tiles to work - you only need a small border (boxwidth/2) on each of the tile sides to work.
The other thing is that "postprocessing = boxfilters" is the mentality of PS2 generation, I would like to think new hardware moves us way ahead of this point.

Many post processing filters can take advantage of bilinear filtering to gain significant speedups.

The only "significant" speedups you get are when you downscale your buffers first, which requires writebacks to memory anyhow, so no reason to use SPEs for it.

Anyway, with the right scheduling SPE could be a win for postprocessing even if they are considerably slower then GPU at the said task. The whole point is that GPU can go off to do something else in the meantime (render new frame, or whatever).
The idea to do this only cuz SPEs might finish a few % faster then GPU would - and have your GPU sit IDLE in the meantime, is just kind of selfdefeating.

heliosphere · Mar 31, 2006

Fafalada said:
That's 32x more then what we sometimes worked with on PS2 (for maximum efficiency postprocessing was often optimized for dram page sizes). And btw, box filters don't need entire bloody neighbouring tiles to work - you only need a small border (boxwidth/2) on each of the tile sides to work.
The other thing is that "postprocessing = boxfilters" is the mentality of PS2 generation, I would like to think new hardware moves us way ahead of this point.

Sure, you can bring in less than a full tile but that's got it's own problems as well (fetching non-contiguous chunks of memory). Anyway, the point was really that there's engineering effort and performance overhead involved in getting this to work on the SPEs whereas it's trivial and efficient to do on the GPU. I don't know where you're getting the box filters thing from - any kind of filter with a kernel that spans more than one pixel is going to require information from neighbouring tiles.

The only "significant" speedups you get are when you downscale your buffers first, which requires writebacks to memory anyhow, so no reason to use SPEs for it.

Not true - see http://www.ati.com/developer/gdc/GDC2003_ScenePostprocessing.pdf for example. It's a pretty standard trick to use one bilinear filtered texture fetch to simultaneously sample and weight multiple pixels from the input.

Anyway, with the right scheduling SPE could be a win for postprocessing even if they are considerably slower then GPU at the said task. The whole point is that GPU can go off to do something else in the meantime (render new frame, or whatever).
The idea to do this only cuz SPEs might finish a few % faster then GPU would - and have your GPU sit IDLE in the meantime, is just kind of selfdefeating.

Yeah, I already said this in my reply to eVo. The only reason to do this on the SPUs would be if you were GPU bound and had SPU cycles to burn (which will probably be a common scenario admittedly). SPUs can do this kind of thing better than a standard processor but they suck compared to a GPU.

Inane_Dork · Mar 31, 2006

Fafalada said:
That's just something people that don't understand how shadow generation works, say.

Interesting. Would you mind explaining why that is not the case? I'm not aware of solid proof either way.

heliosphere · Mar 31, 2006

Inane_Dork said:
Interesting. Would you mind explaining why that is not the case? I'm not aware of solid proof either way.

The original poster was actually correct - shadow map generation should be more efficient on a unified architecture. A non-unified architecture will be balanced for 'typical' workloads where there is a fair amount of vertex processing work to be done (skinning, vertex animation, transform and projection, etc.) and a fair amount of pixel processing to be done (per-pixel lighting calculations, texturing, fancy shading techniques). When you're rendering shadow maps you still have to do most of the vertex processing work but there's no pixel processing to be done - you're just writing depth to the shadow map. In a non-unified architecture that means the vertex shader units are working at maximum capacity and the pixel shader units are sitting idle. On a unified architecture all the ALUs will be doing vertex shader work so you're getting more use from the available resources and therefore greater efficiency. The caveats mentioned before about unified architectures being less efficient in general at vertex and pixel shading because they are less specialised still apply but at least in principle there should be an advantage to a unified architecture for shadow map generation.

Fafalada · Mar 31, 2006

Not true - see http://www.ati.com/developer/gdc/GDC...processing.pdf for example. It's a pretty standard trick to use one bilinear filtered texture fetch to simultaneously sample and weight multiple pixels from the input.

It all depends on what you consider to be a "significant" improvement. The increased number of lerps(if hw can only point sample) may or may not amount for a significant portion of the shader in question.
And not all postprocessing involves averaging between input samples, it also comes down to picking out what workloads make sense where.
It's at least worth considering splitting these workloads between the GPU and CPU, any postprocessing that is ROP instead of shader bound is potentially wasting GPUs strongest resource (this applies to Xenos too).

any kind of filter with a kernel that spans more than one pixel is going to require information from neighbouring tiles.

Right - it was poor choice of wording on my part.

Sure, you can bring in less than a full tile but that's got it's own problems as well (fetching non-contiguous chunks of memory)

Actually I'm not sure now, were you referring to physical GPU tiles? In that case we're skirting on edge of talking hw specific memory layouts so I won't go there.
Although for what's worth, most GPUs I'm familiar with tile memory in pretty tiny sizes, so you could easily keep many of those in local memory at a time.

heliosphere · Mar 31, 2006

Fafalada said:
It all depends on what you consider to be a "significant" improvement. The increased number of lerps(if hw can only point sample) may or may not amount for a significant portion of the shader in question.
And not all postprocessing involves averaging between input samples, it also comes down to picking out what workloads make sense where.

Lets just say it can be significant, for certain kinds of postprocessing and certain definitions of significant

It's at least worth considering splitting these workloads between the GPU and CPU, any postprocessing that is ROP instead of shader bound is potentially wasting GPUs strongest resource (this applies to Xenos too).

No argument that it can be worth considering splitting the workloads. If you're bottlenecked in one place and you've got spare cycles somewhere else it makes sense to rebalance by shifting some work around. With the EDRAM it's pretty hard to be ROP bound on the Xenos unless your shader is very short but you do have to pay for the resolve so there could be situations where you might want to move some work to the CPUs.

Actually I'm not sure now, were you referring to physical GPU tiles? In that case we're skirting on edge of talking hw specific memory layouts so I won't go there.
Although for what's worth, most GPUs I'm familiar with tile memory in pretty tiny sizes, so you could easily keep many of those in local memory at a time.

I was thinking of physical GPU tiles since on a fixed platform you would probably know the layout but yeah, it's not something for public discussion.

dukmahsik · Mar 31, 2006

Heliosphere are you a 360 dev?

heliosphere · Mar 31, 2006

I've been doing some 360 and some PS3 but more 360.

Xenus · Mar 31, 2006

Ha ha I bet quite a few would give a arm and a leg to know what is going on PM's or email's between those 2. Sadly the intersting discussion has gone to far for our not NDA'd ears

Brimstone · Mar 31, 2006

What kind of software tools are there to analyze work flow and aid in optimizing graphic processing? Is it clock by clock simulated? I'm guessing this is something both ATI and nVidia provide to all developers around the world.

Shifty Geezer · Mar 31, 2006

heliosphere said:
Sure, you can bring in less than a full tile but that's got it's own problems as well (fetching non-contiguous chunks of memory). Anyway, the point was really that there's engineering effort and performance overhead involved in getting this to work on the SPEs whereas it's trivial and efficient to do on the GPU. I don't know where you're getting the box filters thing from - any kind of filter with a kernel that spans more than one pixel is going to require information from neighbouring tiles.

For the most common filters, blurs, a fast pseudo-gaussian will work on horizontal than vertical components, only needing a one pixel lookahead, same as a fast box blur. This could be streamed ideally well for SPEs. Other traditionally kernal filters like Median would need full kernal size, but I don't know how many of these you'd use in a game situation, and even then are only a 3x3 kernal. As I look at it, with some experience of Photoshop plugin development, I can't immediately see many occassions in a game postprocessing situation where you'd need multiple image tiles stored at a time. Most would make do with a pixel or few border. Well, I say most, but that's really those that can't be rendered as a two pass X,Y linear one-pixel lookahead. That'll give you most of that 256kb to fit the image tile to be processed. Not sure how large a quad you could fit though. Some filters would want z data for example. I guess 32 bits for RGBA + 16 z, maybe 6 bytes per pixel, a 128x128 tile would fit nicely. All it would really need is a nice library to handle the tiling and fetching of data, and adding post-processing (or any image processing) on SPE's should be relatively easy.

Not that I'm saying all PP should be on SPEs, not GPUs. Just saying it shouldn't be too hard and should be fairly effective.

one · Mar 31, 2006

nevermind

Titanio · Mar 31, 2006

I figured it was worth it's own thread

http://www.beyond3d.com/forum/showthread.php?t=29622

There seems to be more new info in this one - also the particle simulation and RSX/Cell synchronisation is new.

one · Mar 31, 2006

Titanio said:
I figured it was worth it's own thread

http://www.beyond3d.com/forum/showthread.php?t=29622

There seems to be more new info in this one - also the particle simulation and RSX/Cell synchronisation is new.

Oh no problem, I found some interesting parts were left in the above summary so I'll put more complete one in the other thread...

Watch Impress PS3 Technical article, from GDC (some new info)

scooby_dooby

!eVo!-X Ant UK

heliosphere

!eVo!-X Ant UK

heliosphere

Edge

Fafalada

heliosphere

Inane_Dork

Rebmem Roines

heliosphere

Fafalada

heliosphere

dukmahsik

heliosphere

Xenus

Brimstone

B3D Shockwave Rider

Shifty Geezer

uber-Troll!

one

Unruly Member

Titanio

one

Unruly Member