DemoCoder said:
zeckensack said:
Speed isn't a question of SM2 or SM3. It doesn't matter, throughput-wise, whether you can do a "real" loop or have to unroll it umpteen times.
I beg to differ.
I admit that wasn't very accurate
DemoCoder said:
First, with respect to loops, if flow control is used inside the loop, there could be a big difference throughput wise.
I know dynamic branching has been regarded by some as an optimization technique. I don't take that for granted right now. hardware.fr's results weren't encouraging IMO. That may be driver related, it may be because they used a "too high frequency" branch conditional (they didn't disclose their actual shader code), but it also may be natural behaviour for an SIMD architecture. I'd rather wait until I can test it myself.
Secondly, the non-unrolled loop may fit in an instruction cache whereas the unrolled might not and this could have throughput implications.
Right. But then I don't know how much instruction cache would really matter on GPUs. These things have enormous local memory bandwidths at their disposal. If you go compute heavy instead of texture fetch heavy, some of that should be available for instruction fetch.
*shrugs*
Really, I don't know.
Third, if you don't have enough slots to fully unroll, what then?
Multipass
512 instructions is a lot of code for a single shader, though. Executing them all would reduce NV40 "ultra"'s fillrate right back in S3 Virge territory.
Fourth, some features, like the aL register will eah up throughput when they are emulated.
What's the aL register?
Finally, the other features of SM3.0 impose non-trivial throughput issues. How to emulate vertex textures? Gradient? FP filtering and blending? Indexable constant registers? Predicates? (CMP eats up more slots)
Point sampled VTs aren't all that different from another vertex attribute stream. I mean, they are not the same, and it will require different code, but it's not impractical. Linear filtered VTs would be a different story.
Gradient: prepare an adequately sized luminance only mipmap, where each mipmap level encodes a step between 0 and 1. Use a trilinear filter. Do a dependent fetch from that mipmap with whatever quantity you need a gradient for. You can do two fetches from a 1D luminance mipmap if you need separate x/y gradients.
FP filter/blend: no way
Constant index: that's available in SM2 vertex shaders, no? Pretty near impossible in SM2 fragment shaders AFAICS. But what's wrong with using point sampled 1d textures?
Predicates: you got it.
I would say that speed is the major question of SM2 vs SM3. Unlike the PR, it is not a question of what's visually possible to do. There was a Siggraph paper that proved even OpenGL1.0 was universal (e.g. can compute anything with enough passes) Anything you can render with SM3, you can render with SM2, the question is: how fast will it run.
So the crucial question boils down to, how large (and important) is the class SM3: the set of all algorithms that run more efficiently on SM3 vs SM2. We don't yet have alot of information in this area.
I can agree with that. What I'm concerned about is that very often, "advanced" technology, as determined by a higher version number, is understood as "faster" without much reflection. I've been browsing the Rage3D boards a few hours ago, and it always takes a while to snap out of it
With the PS nothing vs PS1.x debate, one could make a valid argument in stating that you could collapse rendering passes, and at least save bandwidth and geometry load, even if nothing else changed. This has diminished with PS1.x vs PS2, and isn't the case with PS2 vs PS3 at all. If you spend dozenz of cycles per fragment (which you can do with SM2), it plain doesn't matter anymore.
So we're really left with structural features. I don't know enough yet to make a blanket statement about what's going to happen. Branching, while offering a great opportunity to skip unneeded computation, also comes at a penalty. The other features you noted, however interesting and useful they may be, do you expect a shader to really be bound by their performance? Isn't the bulk of any shader still DOT3s and MADs? Serious questions, in case you wondered.
I'd really like to see performance figures taken with real world models and real world high level shaders, run through the two compiler profiles, preferably on the same architecture (NV40). It may turn out that there's a 200% speedup in some cases. It may just as well be nothing, or worse.
Note that none of the above makes me want some NV40 any less. I guess it's just some sort of reservation against the latest fashion that naturally comes with old age 8)
edited bits: spelling, propose textures as a replacement for indexed constant storage.