SM 3.0, yet again.

Xmas said:
DiGuru said:
Well, the last time we discussed this, we came to the conclusion that the dynamic branching of the NV6x00 is essentially your first method, with batches of about a 1000 pixels each. Only if all pixels in that area take the same path and this can be determined by the driver are the instructions skipped. Which doesn't sound very dynamic to me.
The driver couldn't do such a fine-grained decision in the shader pipeline. The hardware decides whether to run one or both branches.

Ok. But Tridam's results showed, that the batches are expanded to cover the whole area that uses the shader if there are pixels that run the other branch, while there is still a penalty of 9 clocks for each branch instruction.

So, in how far wouldn't it be better to use a single (lineair) shader or multipassing if branches are actually taken? And if they aren't, why not use a different shader for that frame? That would save you at least 9 clocks per pixel.

I'm still trying to understand in what cases that would be useful.
 
Tridam said:
The size of the batch is huge at least with current drivers (under 1024 quads both branches are always computed). The size seems to be at least 1024 quads but seems to be expanded to the number of quads that use the same branch.
Batch size is only increased if more quads take the same branch.
 
And then there's the demo that 99160 just posted that shows a dramatic performance improvement by implementing dynamic branching.

So it's clearly obvious that it is possible to extract a performance improvement through dynamic branching vs. "compute all paths and choose the correct result" path.

And it is further clearly obvious that any multipass technique that one chooses to use will require one to pass the geometry multiple times to the video card, and thus will not run nearly as fast as either of the above solutions in a geometry-limited case.
 
Given how much things are geometry limited in the first place (not particularly frequently!) it doesn't seem like there are that many occasions that are going to be geometry limited if the rendering quirements are asking for an extra pass.
 
DaveBaumann said:
Given how much things are geometry limited in the first place (not particularly frequently!) it doesn't seem like there are that many occasions that are going to be geometry limited if the rendering quirements are asking for an extra pass.

And Dave, the perennial tease, has changed his sig again. :p
 
DaveBaumann said:
Given how much things are geometry limited in the first place (not particularly frequently!) it doesn't seem like there are that many occasions that are going to be geometry limited if the rendering quirements are asking for an extra pass.
Except that extra passes increase the geometry limitations, since each pass is inherently shorter than one single pass.
 
Yes, but you must have reached some other limitations to require that extra pass and in many circumstances those are likely to be more limiting than a geometry pass.
 
Like what? And why?

After all, here's a specific instance where you could get geometry-limited through multipassing very quickly:

Imagine Humus' stencil-based dynamic branching multipass demo that he did some time ago. The basic idea is that you don't render if the pixel in question is some distance away from the light source. This rendering is done in two passes per light source (1. Check for visibility. 2. Render visible pixels.)

Now, in this situation, if you are even getting close to geometry-limited in, say, a 4-light scene, this technique multiplies the required geometry rendered by a factor of eight, for approximately the same total pixel processing (as true dynamic branching), and therefore is rather likely to reduce performance.
 
It's not just extra geometry, it's bandwidth as well in many cases, since the latter passes may have to use stencil/z-test or blending to combine results.
 
Has anyone done some more testing in the last months? I would really like to see the results of those tests. That might be better than speculating.
 
Yep, full reverse spin I bet. Once ATI has SM3.0, I think all the SM3.0 naysaying will disappear, and all of a sudden a whole crop of "only possible with SM3.0" scenarios will appear. And people in the past who were exclaiming no big deal between sm2.0b and sm3.0 will suddenly be at the head of the bandwagon, especially if ATI's performs better. For example, if ATI's dynamic branching performs better, then dynamic branching support will suddenly be an achilles heel, despite the fact that previously, it wasn't, and the real life scenarios where it was used were few and far between. Now, such support will be seen as *crucial*

:popcorn mode engaged:

(who remembers how horrible it was to waste *two* MB slots and how terrible it was not to support small form factor pcs, until.....)
 
I know I won't, at least not unless ATI's solution manages to show something actually worthwhile pertaining to SM3's use, which nvidia certainly hasn't done yet.
 
ANova said:
I know I won't, at least not unless ATI's solution manages to show something actually worthwhile pertaining to SM3's use, which nvidia certainly hasn't done yet.

That is exactly what DemoCoder was talking about.

:)
 
Chalnoth said:
Like what? And why?

After all, here's a specific instance where you could get geometry-limited through multipassing very quickly:

Imagine Humus' stencil-based dynamic branching multipass demo that he did some time ago. The basic idea is that you don't render if the pixel in question is some distance away from the light source. This rendering is done in two passes per light source (1. Check for visibility. 2. Render visible pixels.)

Now, in this situation, if you are even getting close to geometry-limited in, say, a 4-light scene, this technique multiplies the required geometry rendered by a factor of eight, for approximately the same total pixel processing (as true dynamic branching), and therefore is rather likely to reduce performance.

A factor of 8 is quite stretching it. If you're limited by vertex fetch, then you can in some cases get near that. In normal situations where the shader is of decent length and you're more limited by the vertex shader, you won't get anywhere near a factor of 8. First of all, the visibility pass is very cheap, it's more or less just transform. Compared to the lighting shader it's short. In my demo it's maybe half the instructions of the lighting shader. For more advanced lighting the relative cost of the visibility shader goes down even further. Also, it's not like you can pop my lighting vertex shader right into a ps3.0 dynamic branching case. The vertex shader needed for that case of course gets larger since you still have to do computations for all lights and pass to the fragment shader. So it's not cutting the workload with 75%, but rather closer to like 30%-50% or so. So realistically we're not talking about a factor of 8 but rather something in the range 2-3.
 
Dave Orton in May of 2004 said:
I think the main feature that people are looking at is the 3.0 shader model and I think that’s a valid question. What we felt was that in order to really appeal to the developers who are shipping volume games in ’04 Shader 2.0 would be the volume shader model of use. We do think it will be important down the road.

I think ATI has been pretty consistent in their message. . .its the hallelujah chorus that at times has been off message.
 
Back
Top