1st International Symposium on CELL Computing

Shifty Geezer · Jul 4, 2006

I think there's a confusion here between dynamic and static allocation. The devs can choose where to place a workload at design+implementation time, but that's static at runtime and the workload isn't being shifted back and forth bewteen Cell and RSX as needed.

Titanio · Jul 4, 2006

Shifty Geezer said:
I think there's a confusion here between dynamic and static allocation. The devs can choose where to place a workload at design+implementation time, but that's static at runtime and the workload isn't being shifted back and forth bewteen Cell and RSX as needed.

I doubt the paper mentioned above is dealing with dynamic allocation either. The ultimate general purpose approach would be to have a dynamic allocation system that took shaders and executed them transparently on RSX or Cell (with the dev "switching on" SPUs for rendering), but tbh, I'm not sure how desireable that'd even be. It'd be very flexible and easy to use, but you'd not be playing to the strengths of Cell, in particular (and there'd be the overhead of shaders going through that layer before hitting executable hardware).

Fran · Jul 4, 2006

Jawed said:
Deferred pixel shading on Cell prolly deserves its own thread, but I suppose we'll just have to wait.

I dare say you could have a lot of fun constructing a tile based deferred renderer in software on Cell, with a deferred lighting engine used to shade each pixel in a tile.

It seems to me that you could hide texturing latency quite nicely. When you know in advance per-pixel materials, effects plus anisotropy you should be able to construct a really efficient texture fetch stream.

Jawed

I was a big fan of deferred techniques. The architecture of the engine just comes out so simple, a very clear distinction between the geometry phase where you lay down your geometry buffers, and the shading phase where you use these information to do the lighting computation. Problems with having different materials can be also solved quite nicely. Several optimisation techniques can speed it up and i can see the Cell being a big win here.

The main reason why I'm not a big fan anymore is you loose multisampling: it doesnt make mathematical sense to multisample a normal or position buffer and when you use this information the the lighting stage, the result tends to be horribly aliased. The guys working on Stalker solve this problem in some special cases but I can't see a nice and simple general solution to it, apart supersampling.

What I'm toying with is using a hybrid approach: normal forward rendering for the main lights with multisampling and deferred rendering for all the point lights contribution. This solution surely adds complexity to the system but where a game needs hundreds of point lights it could be viable and it looks like it maps nicely to the Cell.

Fran/Fable2

Shifty Geezer · Jul 4, 2006

It might have a use outside of PS3. Say a mobile device with a small and simple GPU where you might want to have Cell jump in and help out. Though it's probably best to see any such product even being mentioned before trying to find different techniques to use the hardware most effectively!

scificube · Jul 4, 2006

I had a thread a while back that spoke about TBDR similar to how the Kyro series used raycasting to do it.

I'm assuming these fellows are not using rays but rather rasterization on Cell which many have said is a no go up to this point. I'm very curious to know just what Cell's role is. Is Cell doing post processing and then it's results are coposited with the GPU's work or is Cell handling specific portions on the rendering pipeline saving the GPU work.

Well, perhaps those of use who saw the SPUs LS's and flexibility as a opportunity to try deferred rendering aren't as insane as first thought we were

I understand the reasons why Cell would not perform as well as a GPU at pixel shading but I've also wondered why pixel shading was still not a good candidate Cell additive rendering in concert with RSX. Seems to me pixel shaders are tiny programs and the LSs are fast local addressable memory and then given what the SPUs are it would seem you could at least target simpler shaders or go to the other end of the spectrum and write shader programs which RSX's architecture do not permit or would handle very poorly.

edit:
The scheduler sounds very interesting as well...if anyone from Ninja Theory will answer, does this sound similar to the kernel I believe DeanoC crafted for SPU thread scheduling?

-------------

If anyone can attend the presentations could they find a means to pass on the work to us that cannot attend the event in Boston on the 27th? or tell us how we can acquire the presentations ourselves?

Shifty Geezer · Jul 4, 2006

If you're not using textures, SPU's should be very good at pixel shading I'd have thought. But using textures is pretty fundamental.

Titanio · Jul 4, 2006

Shifty Geezer said:
If you're not using textures, SPU's should be very good at pixel shading I'd have thought. But using textures is pretty fundamental.

I'm guessing there's an advantage with a deferred shader, though, in that your attributes (and textures) per pixel are known ahead of time, which might better suit a SPU and you can use that to your advantage. I think that's what Jawed was getting at earlier:

Jawed said:
It seems to me that you could hide texturing latency quite nicely. When you know in advance per-pixel materials, effects plus anisotropy you should be able to construct a really efficient texture fetch stream.

Jawed · Jul 4, 2006

The impression I'm getting here is that Cell is doing an arithmetic-intensive final phase of deferred rendering, while RSX is doing the bandwidth/fill-rate intensive geometry and render-target phases, to setup the data that Cell will use.

Normally all three phases would run on RSX (much like GRAW on PC) - but this seems to be an instance where the data structures produced at the end of phase 2 make a perfect source for Cell, hence there's an overall boost in performance.

It's also interesting to mull over the question of dynamic branching: if the "pixel shader(s)" being executed in phase 3 consists of fairly intensive dynamic branching then the performance profile on Cell may be much friendlier than on RSX.

Jawed

scificube · Jul 4, 2006

Shifty Geezer said:
If you're not using textures, SPU's should be very good at pixel shading I'd have thought. But using textures is pretty fundamental.

Static or generated textures?

hmm? Could an SPU serve only to pre-fetch textures and use it's LS as a texture cache...in essence it would be an intelligent texture server to the other SPUs working on the solution?

I'm assuming you're referring to the latency/random access of textures being the problem here.

Jawed · Jul 4, 2006

When you only have ~1M pixels to texture (the 1280x720 framebuffer, there's no overdraw) and your data source provides a precise list of how to texture every single pixel with only a single pass (though texture filtering itself implies dynamic loops), it seems to me that texturing is fairly easy to do very efficiently.

You don't have to build a software cache in LS, you simply need to create a texel-fetch queue with enough lead-time that the entire final phase of rendering is pipelined with a start-up lag of ~5000 cycles, say (at 3.2GHz). Obviously, I'm just guessing and glossing over the fact that texel coordinates need to be calculated before texels are fetched.

I'm sure Mintmaster could run the numbers for us, but 16x anistropy for 2 blended textures at 60fps, say, per pixel (1M total) would require xGB/s of bandwidth. Erm... I dunno, I'm useless at working that stuff out. Fingers-crossed someone will work it out.

Anyway, the lack of overdraw is one of the key things that always helps with deferred lighting, and it'll definitely cut texturing bandwidth. The latency problem can be attacked simply by constructing a long shader execution pipeline. If the shader is arithmetic intensive then there's a reasonable chance that you've got some free latency hiding right there. After that, the predictable texel fetching should mean that the shader execution pipeline can proceed without stalling.

Put simply: it's a streaming problem. Well, that's my guess, anyway.

Jawed

scificube · Jul 4, 2006

Thanks Jawed, that seems a fair way of going about it to me. It's just the common thoughts I've seen suggest latency etc. to be big hurdle for pixel shading on cell so much so that it comes across to me that the more natural ways of contending with latency would not suffice. That's why I proposed a software cache to deal with it.

I had neglected to distinguish between normal rasterization and the savings on texture bandwith with deferred rendering.

I still can't help but feel that it's just a streaming problem is somehow...to easy. No matter we're all guessing here...especially me

Shifty Geezer · Jul 4, 2006

Jawed said:
When you only have ~1M pixels to texture (the 1280x720 framebuffer, there's no overdraw) and your data source provides a precise list of how to texture every single pixel with only a single pass (though texture filtering itself implies dynamic loops), it seems to me that texturing is fairly easy to do very efficiently.

Very relevant point. At first I was thinking of massive txture accessing, but of course if you haven't got any redundant pixel draws, you're saving a lot of texture work that's being done on the typical GPU. Creating the DMA batch requests would be the tricky part, probably a multi-texture fetch per pixel so you grab normal map, diffuse, GI etc. textures for a quad perhaps and apply your shaders on this data, while fetching data for the next quad. It's certainly an interesting idea.

For procedural textures like Perlin or Cellular noise, the SPE's should be ideally suited. If homebrew happens, I'd like to see someone try some procedurally created games/demos on Cell.

Inane_Dork · Jul 4, 2006

Fran said:
Problems with having different materials can be also solved quite nicely.

That surprises me quite a bit. With all the different kinds of shaders there are now, it seems like it would be a huge headache. Unless, of course, you run all the specific stuff (reflection/refraction, subsurface scattering, etc.) before outputting the deferred buffers. But doing that breaks both the simplicity and speed of the system of the system, I would think.

Would you mind sharing how to do this? I would ask in PM but I think others might want to know too.

Shifty Geezer · Jul 4, 2006

Fran said:
The main reason why I'm not a big fan anymore is you loose multisampling: it doesnt make mathematical sense to multisample a normal or position buffer and when you use this information the the lighting stage, the result tends to be horribly aliased. The guys working on Stalker solve this problem in some special cases but I can't see a nice and simple general solution to it, apart supersampling.

That would be the obvious solution. If you're saving enough draw, doubling the render target and scaling back would work. You'd also have the option of adaptive SS, no? So the increase might not be even as high as 2x the number of rendered pixels.

Kittonwy · Jul 5, 2006

Vysez said:
I know at least one (big) Dev house working on that already.

I wonder who this could be? Is it SCEWW First or Second-party?

Fafalada · Jul 5, 2006

Jawed said:
Obviously, I'm just guessing and glossing over the fact that texel coordinates need to be calculated before texels are fetched.

You could always store non-dependant texture lookups as additional attributes - removing the issue of texture fetch alltogether.
You're left with dependant lookups but those shouldn't be that frequent, and can often be replaced by arithmetics.

Fran said:
The main reason why I'm not a big fan anymore is you lose multisampling

In console environment I see memory overhead as the worst problem. Coupled with the fact that you effectively idle GPU shaders for the duration of laying down attribute buffers, I am not convinced about it being an effective approach for current consoles (it might have been more interesting if we got that Toshiba's GPU that never came to be in PS3).

Shifty Geezer · Jul 5, 2006

Fafalada said:
it might have been more interesting if we got that Toshiba's GPU that never came to be in PS3.

Following the CCS talk, it would be worth revisitng the topic of whether the Cell/Toshiba GPU idea was 'broken' or whether nVidia got the contract solely on developer tools and otherwise, the original ideas would have worked.

Arwin · Jul 5, 2006

Shifty Geezer said:
Following the CCS talk, it would be worth revisitng the topic of whether the Cell/Toshiba GPU idea was 'broken' or whether nVidia got the contract solely on developer tools and otherwise, the original ideas would have worked.

Well, I think that the challenge of using the 8-core Cell as a CPU to programmers was about as much as they could stomach at this stage. Then there's the cost factor - the RSX in its current form is likely to be a lot cheaper. Then there are the developer tools, and perhaps even more importantly, a lot of experience among developers with Nvidia hardware, including in the Arcade, might I add (think Virtua Fighter).

The choice of Cell/Nvidia I think makes transitioning easier ... as developers get more experience with Cell and get a feel for when they can use it in the graphics pipeline, Sony can more easily evaluate this new approach, and developers will be more ready.

Fran · Jul 5, 2006

Inane_Dork said:
That surprises me quite a bit. With all the different kinds of shaders there are now, it seems like it would be a huge headache. Unless, of course, you run all the specific stuff (reflection/refraction, subsurface scattering, etc.) before outputting the deferred buffers. But doing that breaks both the simplicity and speed of the system of the system, I would think.

Would you mind sharing how to do this? I would ask in PM but I think others might want to know too.

For different lighting models that use the same attributes I output a material id and then breanch in the shader. It's a fairly consistent branch and doesnt cost too much.
For materials where I need other properties like subfsurface scattering, i output a mask and treat those pixels with a normal renderer. It breaks a bit the elegance of the solution.

Shifty Geezer said:
That would be the obvious solution. If you're saving enough draw, doubling the render target and scaling back would work. You'd also have the option of adaptive SS, no? So the increase might not be even as high as 2x the number of rendered pixels.

The point is, you are not saving any draw against a normal renderer with a z-pass, which is already guaranteed to run your complex pixel shader once per pixel. You don't do deferred rendering for performance reasons (unless you really have many point lights and you cant solve that problem in a simpler way), but for the simplicity and elegance of the approach.

Fafalada said:
In console environment I see memory overhead as the worst problem. Coupled with the fact that you effectively idle GPU shaders for the duration of laying down attribute buffers, I am not convinced about it being an effective approach for current consoles (it might have been more interesting if we got that Toshiba's GPU that never came to be in PS3).

That's a good point. And I agree with you: deferred rendering is not the way to go on consoles at the moment. At least not a fullly deferred renderer. It becomes interesting when you can access the zbuffer as a texture, so you have your world position, and you already need to lay down a normal per pixel for other reasons. Then the Cell or the R500 can step in and do something deferred (shadow term computation, simple lighting models).

Fran/Fable2

nAo · Jul 5, 2006

Fran said:
For different lighting models that use the same attributes I output a material id and then breanch in the shader. It's a fairly consistent branch and doesnt cost too much.
For materials where I need other properties like subfsurface scattering, i output a mask and treat those pixels with a normal renderer. It breaks a bit the elegance of the solution.

Do you output this mask to the stencil buffer?
I wonder if storing a per pixel material ID into the stencil buffer and completely avoid dynamic branching would make it faster..

Marco

1st International Symposium on CELL Computing

Shifty Geezer

uber-Troll!

Titanio

Fran

Dev

Shifty Geezer

uber-Troll!

scificube

Shifty Geezer

uber-Troll!

Titanio

Jawed

scificube

Jawed

scificube

Shifty Geezer

uber-Troll!

Inane_Dork

Rebmem Roines

Shifty Geezer

uber-Troll!

Kittonwy

Fafalada

Shifty Geezer

uber-Troll!

Arwin

Now Officially a Top 10 Poster

Fran

Dev

nAo

Nutella Nutellae

Similar threads