scooby_dooby
Legend
!eVo!-X Ant UK said:All that can be done on PS3 aswel via SPE
You know from experience?
!eVo!-X Ant UK said:All that can be done on PS3 aswel via SPE
scooby_dooby said:You know from experience?
It can but the logistics are a bit of a pain. With only 256K of local store you can't fit much of the frame buffer into memory at a time so you have to stream small tiles through. For a lot of post-processing effects (motion blur, depth of field, bloom, anything with a filter kernel that spans multiple pixels of the input) you can't process tiles entirely in isolation because pixels in one tile are affected by values in neighbouring tiles. That means you have to make sure the right neighbouring tiles are available at all times.!eVo!-X Ant UK said:All that can be done on PS3 aswel via SPE
heliosphere said:It can but the logistics are a bit of a pain. With only 256K of local store you can't fit much of the frame buffer into memory at a time so you have to stream small tiles through. For a lot of post-processing effects (motion blur, depth of field, bloom, anything with a filter kernel that spans multiple pixels of the input) you can't process tiles entirely in isolation because pixels in one tile are affected by values in neighbouring tiles. That means you have to make sure the right neighbouring tiles are available at all times.
Then you have the fact that for this kind of work the SPEs are noticeably slower than a GPU. The Xenos for example can execute 48 vector and scalar ALU ops and perform 16 bilinear filtered texture fetches per clock. A bilinear texture fetch is 4 memory accesses and two lerps, each of which is at least two vector ops. Many post processing filters can take advantage of bilinear filtering to gain significant speedups. The SPEs are clocked a little over 6x faster than the Xenos but you're still probably looking at needing all 6 available SPEs working flat out to get anywhere near the performance of a GPU for this kind of thing.
David Kirk: SPE and RSX can work together. SPE can preprocess graphics data in the main memory or postprocess rendering results sent from RSX.
Nishikawa's speculation: for example, when you have to create a lake scene by multi-pass rendering with plural render targets, SPE can render a reflection map while RSX does other things. Since a reflection map requires less precision it's not much of overhead even though you have to load related data in both the main RAM and VRAM. It works like SLI by SPE and RSX.
David Kirk: Post-effects such as motion blur, simulation for depth of field, bloom effect in HDR rendering, can be done by SPE processing RSX-rendered results.
Nishikawa's speculation: RSX renders a scene in the main RAM then SPEs add effects to frames in it. Or, you can synthesize SPE-created frames with an RSX-rendered frame.
David Kirk: Let SPEs do vertex-processing then let RSX render it.
Nishikawa's speculation: You can implement a collision-aware tesselator and dynamic LOD by SPE.
David Kirk: SPE and GPU work together, which allows physics simulation to interact with graphics.
Nishikawa's speculation: For expression of water wavelets, a normal map can be generated by pulse physics simulation with a height map texture. This job is done in SPE and RSX in parallel
No, I'm just saying it can't really compete with GPU performance when it comes to post-processing effects. It might still be worthwhile in some situations (if your GPU is maxed out doing other stuff and the SPUs are mostly sitting idle).!eVo!-X Ant UK said:So your saying that Cell can post process but just not at a decent enough speed to make it worth while?
EDIT :
scooby_dooby said:Isn't it obvious? If the PS3 is going to be a location free server for the PSP it's going to need to re-encode video on the fly, while people are potentially playing games on that same PS3. They can't saddle that on the PPE cause it would probably be too hard to manage the performance drops in-game, by reserving an entire SPE they make it easy for dev's to target the full power of the system without having to worry about how it will perform in various situations.
That's just something people that don't understand how shadow generation works, say.Inane_Dork said:I also thought that some uses of the GPU are naturally weighted, say shadow map creation, in such a way that they are accelerated with a unified approach.
You had already 'given' something by cutting rops to 8, yet you still think they have to cut every other single thing in the chip nearly in half?Megadrive1988 said:what about this possibility: 12 pixel shader pipes, with 2 textures each, making 24 textures, 8 ROPs, and 6 vertex shaders. just a wild guess. probably wrong. but something has to give, given the 128-bit bit memory interface.
That's 32x more then what we sometimes worked with on PS2 (for maximum efficiency postprocessing was often optimized for dram page sizes). And btw, box filters don't need entire bloody neighbouring tiles to work - you only need a small border (boxwidth/2) on each of the tile sides to work.heliosphere said:It can but the logistics are a bit of a pain. With only 256K of local store you can't fit much of the frame buffer into memory at a time so you have to stream small tiles through.
The only "significant" speedups you get are when you downscale your buffers first, which requires writebacks to memory anyhow, so no reason to use SPEs for it.Many post processing filters can take advantage of bilinear filtering to gain significant speedups.
Sure, you can bring in less than a full tile but that's got it's own problems as well (fetching non-contiguous chunks of memory). Anyway, the point was really that there's engineering effort and performance overhead involved in getting this to work on the SPEs whereas it's trivial and efficient to do on the GPU. I don't know where you're getting the box filters thing from - any kind of filter with a kernel that spans more than one pixel is going to require information from neighbouring tiles.Fafalada said:That's 32x more then what we sometimes worked with on PS2 (for maximum efficiency postprocessing was often optimized for dram page sizes). And btw, box filters don't need entire bloody neighbouring tiles to work - you only need a small border (boxwidth/2) on each of the tile sides to work.
The other thing is that "postprocessing = boxfilters" is the mentality of PS2 generation, I would like to think new hardware moves us way ahead of this point.
Not true - see http://www.ati.com/developer/gdc/GDC2003_ScenePostprocessing.pdf for example. It's a pretty standard trick to use one bilinear filtered texture fetch to simultaneously sample and weight multiple pixels from the input.The only "significant" speedups you get are when you downscale your buffers first, which requires writebacks to memory anyhow, so no reason to use SPEs for it.
Yeah, I already said this in my reply to eVo. The only reason to do this on the SPUs would be if you were GPU bound and had SPU cycles to burn (which will probably be a common scenario admittedly). SPUs can do this kind of thing better than a standard processor but they suck compared to a GPU.Anyway, with the right scheduling SPE could be a win for postprocessing even if they are considerably slower then GPU at the said task. The whole point is that GPU can go off to do something else in the meantime (render new frame, or whatever).
The idea to do this only cuz SPEs might finish a few % faster then GPU would - and have your GPU sit IDLE in the meantime, is just kind of selfdefeating.
Interesting. Would you mind explaining why that is not the case? I'm not aware of solid proof either way.Fafalada said:That's just something people that don't understand how shadow generation works, say.
The original poster was actually correct - shadow map generation should be more efficient on a unified architecture. A non-unified architecture will be balanced for 'typical' workloads where there is a fair amount of vertex processing work to be done (skinning, vertex animation, transform and projection, etc.) and a fair amount of pixel processing to be done (per-pixel lighting calculations, texturing, fancy shading techniques). When you're rendering shadow maps you still have to do most of the vertex processing work but there's no pixel processing to be done - you're just writing depth to the shadow map. In a non-unified architecture that means the vertex shader units are working at maximum capacity and the pixel shader units are sitting idle. On a unified architecture all the ALUs will be doing vertex shader work so you're getting more use from the available resources and therefore greater efficiency. The caveats mentioned before about unified architectures being less efficient in general at vertex and pixel shading because they are less specialised still apply but at least in principle there should be an advantage to a unified architecture for shadow map generation.Inane_Dork said:Interesting. Would you mind explaining why that is not the case? I'm not aware of solid proof either way.
It all depends on what you consider to be a "significant" improvement. The increased number of lerps(if hw can only point sample) may or may not amount for a significant portion of the shader in question.Not true - see http://www.ati.com/developer/gdc/GDC...processing.pdf for example. It's a pretty standard trick to use one bilinear filtered texture fetch to simultaneously sample and weight multiple pixels from the input.
Right - it was poor choice of wording on my part.any kind of filter with a kernel that spans more than one pixel is going to require information from neighbouring tiles.
Actually I'm not sure now, were you referring to physical GPU tiles? In that case we're skirting on edge of talking hw specific memory layouts so I won't go there.Sure, you can bring in less than a full tile but that's got it's own problems as well (fetching non-contiguous chunks of memory)
Lets just say it can be significant, for certain kinds of postprocessing and certain definitions of significantFafalada said:It all depends on what you consider to be a "significant" improvement. The increased number of lerps(if hw can only point sample) may or may not amount for a significant portion of the shader in question.
And not all postprocessing involves averaging between input samples, it also comes down to picking out what workloads make sense where.
No argument that it can be worth considering splitting the workloads. If you're bottlenecked in one place and you've got spare cycles somewhere else it makes sense to rebalance by shifting some work around. With the EDRAM it's pretty hard to be ROP bound on the Xenos unless your shader is very short but you do have to pay for the resolve so there could be situations where you might want to move some work to the CPUs.It's at least worth considering splitting these workloads between the GPU and CPU, any postprocessing that is ROP instead of shader bound is potentially wasting GPUs strongest resource (this applies to Xenos too).
I was thinking of physical GPU tiles since on a fixed platform you would probably know the layout but yeah, it's not something for public discussion.Actually I'm not sure now, were you referring to physical GPU tiles? In that case we're skirting on edge of talking hw specific memory layouts so I won't go there.
Although for what's worth, most GPUs I'm familiar with tile memory in pretty tiny sizes, so you could easily keep many of those in local memory at a time.
For the most common filters, blurs, a fast pseudo-gaussian will work on horizontal than vertical components, only needing a one pixel lookahead, same as a fast box blur. This could be streamed ideally well for SPEs. Other traditionally kernal filters like Median would need full kernal size, but I don't know how many of these you'd use in a game situation, and even then are only a 3x3 kernal. As I look at it, with some experience of Photoshop plugin development, I can't immediately see many occassions in a game postprocessing situation where you'd need multiple image tiles stored at a time. Most would make do with a pixel or few border. Well, I say most, but that's really those that can't be rendered as a two pass X,Y linear one-pixel lookahead. That'll give you most of that 256kb to fit the image tile to be processed. Not sure how large a quad you could fit though. Some filters would want z data for example. I guess 32 bits for RGBA + 16 z, maybe 6 bytes per pixel, a 128x128 tile would fit nicely. All it would really need is a nice library to handle the tiling and fetching of data, and adding post-processing (or any image processing) on SPE's should be relatively easy.heliosphere said:Sure, you can bring in less than a full tile but that's got it's own problems as well (fetching non-contiguous chunks of memory). Anyway, the point was really that there's engineering effort and performance overhead involved in getting this to work on the SPEs whereas it's trivial and efficient to do on the GPU. I don't know where you're getting the box filters thing from - any kind of filter with a kernel that spans more than one pixel is going to require information from neighbouring tiles.
Oh no problem, I found some interesting parts were left in the above summary so I'll put more complete one in the other thread...Titanio said:I figured it was worth it's own thread
http://www.beyond3d.com/forum/showthread.php?t=29622
There seems to be more new info in this one - also the particle simulation and RSX/Cell synchronisation is new.