Dany - Shader 3.0 and Branching

That's what ATI wants you to think, for sure. We'll have to wait for the NV40 to see how well nVidia has handled it, though.
 
I'm not going to comment on anything but this thread.

What I stated above is a generalized description of shader branching issues, which is applicable to most/all VPUs/GPUs that implement DX9 level of gfx processing.
 
Chalnoth said:
That's what ATI wants you to think, for sure. We'll have to wait for the NV40 to see how well nVidia has handled it, though.

Oops, I'm in the trap. :oops:

But I think what sireric said is very reasonable. There's no doubt that nVIDIA has very good engineering team, even after the NV30 fiasco I still think so. But they can't just do magic. I still have no idea how dynamic branching is implemented in NV3X, is there a test about it?
 
991060 said:
In a recent leaked pdf, ATi suggested not using dynamic branching before R5XX, does that mean all of the problems you mentioned will be resolved in R5XX?

Well, hell if I know, of course, ;) but logic dictates the following to me:

1) The problem will never be "completely solved" (ie: "free"). However, there will come a point where it starts to be "solved enough" to be useful.

2) With that in Mind, I suspect ATI intends to targets a base level of performance when branching, and won't release a part (R5xx) until they can do that.

3) ATI doubts that nVidia has "solved it enough" with the NV4x part.

If history is a guide, ATI seems to take a similar approach to designing deatures into chips that 3dfx did. (Simplfying, and in the general case:) "implement a feature when it can be implemented in such a way for it to be useful, performance wise.

nVidia takes a different approach: Implement the feature for the feature's sake, irrespective of whether or not in practice, it's useful from a performance perspective.

I'm not passing judgement on either approach...as they each have pros and cons for consumers and developers alike. Just highlighting the historical difference, which may give us an indication of NV4x, R4xx and R5xx has in store...
 
I don't really think you can sum the developments up like that Joe, otherwise ATI would have just made a very fast DX8 processor now. IMO, ATI are very driven by what they consider to be infelction points at the moment, and the other factor is that their development cycle just isn't in the same step as NVIDIA's at the moment.
 
sireric said:
I'm not going to comment on anything but this thread.

What I stated above is a generalized description of shader branching issues, which is applicable to most/all VPUs/GPUs that implement DX9 level of gfx processing.

So I take that to mean that longhorn (dx10) will/may solve some of those issues? Or is that still an unknown?
 
As far as branching/SM 3.0 is concerned for me, static conditionals is something I see being used a lot by game developers. The IHV's should have this fast and scale well (eg. every 20th instruction is a static conditional, and a new set of conditionals are sent down per object with, say, 200-pixel objects being the common case, and these state changes not becoming horribly expensive, i.e. not requiring a JIT recompile). In the overall scheme of things, I agree with almost everything sireric says specifically about branching.
 
DaveBaumann said:
I don't really think you can sum the developments up like that Joe, otherwise ATI would have just made a very fast DX8 processor now.

Why? Are PS 2.0 shaders non-practical, performance wise, in the R300?

IMO, ATI are very driven by what they consider to be infelction points at the moment, and the other factor is that their development cycle just isn't in the same step as NVIDIA's at the moment.

That certainly part of the consideration, no argument. (2 inflection points, for example, being DX9 base level support, and PCI-E.)

We could argue that another "inflection point" is the release of Doom3 and Half-Life2.

And likely, the next software inflection point will be Longhorn, and/or the "next" id or Epic engine...
 
Great thread! Now someone please explain why a texture fetch has such a high latency. Is it because the texture cache reads are costly or coz the interpolation and other calculations are costly or both?

How will this change when we have programmable texture filtering? Can the latency be more easily hidden in that case?
 
krychek said:
Great thread! Now someone please explain why a texture fetch has such a high latency. Is it because the texture cache reads are costly or coz the interpolation and other calculations are costly or both?
Interpolation is fast, probably a handful of cycles. But the memory reads are that costly. No amount of cache can cover the fact that you need to read in texture data from video memory frequently. And instead of doing speculative reads to have the data ready beforehands, the GPU designers prefer to have a long quad FIFO where quads are parked until the texture data requested is available (almost) for certain, whether it's in the cache or not ("design for cache miss"). The texture cache is there to save bandwidth, not latency.
 
Joe DeFuria said:
DaveBaumann said:
I don't really think you can sum the developments up like that Joe, otherwise ATI would have just made a very fast DX8 processor now.

Why? Are PS 2.0 shaders non-practical, performance wise, in the R300?

Some are, and there are some features which are not practical to use everywhere. PS2.0 has resource limits that if you even got close to, you would quickly slow down alot. ATI supports much longer shaders with F-Buffer and PS_2_b. Do you think 100+ instruction long shaders are "practical"? Do you think procedural wood, marble, et al shaders are practical in a game engine?

On the other hand, do you think FP16 blending in PS3.0 is totally impractical? Do you think there will be no circumstances in which dynamic branches will be a big win?

I love this 3dfx/ATI spin doctoring. They don't release any new features until they can run them very fast. I guess that explains no multisampling in 8500 right? ATI doesn't support SM3.0 purely because it isn't practical, not because of development schedules?

I think sireric makes some good points, but I think some of the issues he has brought up are strawmen, such as texture fetches in branches. I looked at a large library of RenderMan shaders, and I could not find any shaders which had branch-dependant texture lookups. A guy walks into an office and says "Doctor, it hurts when I do that." The doctor says "well, don't do that". This goes for gradient instructions and texture fetches inside branches. I have not seen it explained why a dynbamic branch with 20 ALU ops in it won't be a performance win vs executing 40 ALU ops.
 
I think the performance scenario is almost( if not totally ) implementation depedent. Since the two branches are indepedent, you have to allocate 2X resources. But if you write it without branch, there's a chance to organize your code to let it comsume less resources( i.e. 1.5X). So the result is that if you have enough register files, dynamic branching wins. Otherwises, non-branched code may have a chance.

And it's not necessarily the texture read in branches slow down the pipeline. The size of the register file is fixed, more resources allocated to each threads mean there're less activated threads in the pipeline, this COULD translate to a slow down. I think it's all about how to distribut processing power AND resources efficiently. There's no golden rule to follow, you'd have to test different approaches.
 
991060 said:
In a recent leaked pdf, ATi suggested not using dynamic branching before R5XX, does that mean all of the problems you mentioned will be resolved in R5XX? Since I don't see there's any fundamental solution for the fighting between long shaders and limited resources, if possible, I'd like to hear how you guys at ATi find a way around. ;)

my (perfectly uneducated) guess would be that the R500 architecture would have a significantly better resource managemen (resulting from the unified shaders) that would help it not to suffer such a drasting performance wastage as sireric talked about.
 
DemoCoder said:
If anything, unified shaders and more complex resource management could amplify problems, not solve them.

Could. Unless you effectively analyse the issues and build accordingly.
 
Well, ATI waiting for branching support until the R500 because only with that chip will the performance be high enough is just FUD. Wait until the NV40 gets here. See if the branching support in the NV40 can be used for performance a improvement, or at least for making programming of certain things easier with no performance hit (i.e. multiple lights). If so, then ATI's statements in that leaked pdf are misleading.

We've heard this, "we'll have it when the performance is there," or, "we'll have it when the games are there," stuff before. It's attitudes like that that hold technology back.
 
DaveBaumann said:
DemoCoder said:
If anything, unified shaders and more complex resource management could amplify problems, not solve them.

Could. Unless you effectively analyse the issues and build accordingly.

How is that different than any other issue? I could say "PS3.0 branching could amplify performance issues, unless you effectively analyse the issues and build accordingly. So why am I to believe that the R500 will be do correctly, but NV40 won't?

As sireric said, these are general problems that apply to any GPU or CPU architecture. The idea that DX-Next enables you to solve the problem, which, apparently, is supposed to be "hard" under PS3.0 sounds bass-ackwards. The unification of the shaders into a pool of units to be used per-vertex or per-fragment seems only to amplify the problem with non-determinism of branches. Having more complex resource management is bound to increase latency, and eat into gate budgets which could be used for other things.

Furthermore, DX9 does not stop you from implementing the "pool of unified units" strategy that is predicted for DX-Next. Similar solutions could be done for DX9 today, and 3dLabs has already demonstrated this for integer units.
 
DemoCoder said:
How is that different than any other issue? I could say "PS3.0 branching could amplify performance issues, unless you effectively analyse the issues and build accordingly. So why am I to believe that the R500 will be do correctly, but NV40 won't?

I'm not saying that it'll will be better, but your assertion is just as baseless - we don't know how they are implementing them yet.

However, I'm fairly sure that NV40's VS branching will be better than the PS branching - at least a unified shader allows branch (prediction) logic to be equally as good (or bad) for all operations and potentially enables you dedicate more die to a single unit as opposed to separate logic for both pixel and vertex shaders.

[Edit] I don't think anyone is suggesting that DX Next will be any kind of panacea for these things either, it just represents a point where it begins to look very beneficial to unify the shaders as the models are the same. DX Next parts are still going to be dealing with all previous shader models, and reading between the lines of the Huddy presentation it seems like R500 will fill the role of ATI's shader 3.0 part and be unified.
 
DaveBaumann said:
However, I'm fairly sure that NV40's VS branching will be better than the PS branching - at least a unified shader allows branch (prediction) logic to be equally as good (or bad) for all operations and potentially enables you dedicate more die to a single unit as opposed to separate logic for both pixel and vertex shaders.
The primary difference here, of course, being that pixel shaders are optimized for texture accesses, whereas vertex shaders are not. In a unified shader, you will still have this problem, as texture addressing will always be hard on branching, and the deeper pipelining that texture addressing requires will hurt vertex shader branching. Of course, I'm sure that by that time, architectures will have more rigorous optimizations from branching, but I don't think unified pipes will help pixel shader branching at all.
 
Back
Top