Vertex Shaders, Unified Shaders, workload...

swaaye

Entirely Suboptimal
Legend
Supporter
One thing we don't seem to hear asked much is how important vertex shaders are. With the unified shader architectures upon us, I've been wondering how a variable load will help in future/current games.

I've seen charts from ATI and NV showing the typical workloads in games, and they show VS having a much lower usage than PS. All the way back to the X800 series, chips were being equipped with 6 VS @ 500 MHz or so. Even RV410 had 6 VS. Were these units really being fully utilized? Or were the separation of duties in older GPUs really very, very inefficient and had these units somewhat idle while pixel shaders were screaming away in the profiled games?

Now, with these unified designs, vertex shading can be as much or as little of chip's task as necessary.

I'm particularly fascinated by G80's performance in Oblivion. I had the impression that the game engine was just inefficient and that faster cards wouldn't help much (ala EQ2). We certainly saw that with the X1900 vs. X800 and 7800/7900 cards. But, G80 really thrives with the game and I wonder why exactly it eclipses older generations so much. I wonder if its more geometry processing power on G80 being used, or the incredible shading capability of the chip, that helps the demanding game out. After all, G80 has potentially incredibly more "VS" resources than any previous GPU, assuming the chip gets "partitioned" that way (by driver?). Of course, maybe it's just G80's vastly higher pixel shader capabilities. Oblivion runs a PS on just about everything in the game, afterall.
 
It's a bit difficult to try and judge the loading for individual units of a GPU without some serious instrumentation. Not only that, but GPU's (as well as their more conventional CPU based driver/runtime architectures) are complex beasts - one seemingly unrelated stall or inefficiency might screw up timing elsewhere and thus lead to a false-positive as to where the bottleneck really is.

As for the original comment about VS's - think about it: even in a scene defined by millions of vertices there are still going to be an order of magnitude more pixels to rasterize (don't forget overdraw ;)). Consequently it makes perfect sense in a non-unified architecture to bias the number of fixed VS/PS units according to the ratio of expected load.

Except for complex animation, the VS is often relegated to setting up inputs (e.g. transforming vectors to tangent space) for the PS - which is where the real work is done. That doubles up the ratio effectively - not only are there more pixels, but there is a higher percentage of work being done for each of those pixels...

It will be interesting to see where unified architectures take things - much of my job as a graphics programmer is detecting bottlenecks (or, even better, predicting where they'll be and designing to avoid them from the outset!). On a GPU that can do decent runtime load-balancing without any interaction from the application programmer it makes it much more difficult to try and guesstimate, measure or even solve bottlenecks...

Jack
 
I've seen charts from ATI and NV showing the typical workloads in games, and they show VS having a much lower usage than PS. All the way back to the X800 series, chips were being equipped with 6 VS @ 500 MHz or so. Even RV410 had 6 VS. Were these units really being fully utilized? Or were the separation of duties in older GPUs really very, very inefficient and had these units somewhat idle while pixel shaders were screaming away in the profiled games?
That's simply how the workload is. You either use the VS at near peak or the PS at near peak (providing there's not other issue like bandwidth or setup), but rarely both at the same time.

Think about a ratio of pixel cycles to vertex cycles for any triangle sent to the GPU. That ratio can easily span 4-5 orders or magnitude for visible triangles, and will be zero for culled/clipped triangles. Only one ratio will keep both VS and PS at full throttle simultaneously. Even if you do a running average of 50 triangles, you'll still get a very wide distribution of ratios. Imagine a histogram of triangle counts for different ratios. Only a sliver of triangles in that plot will have the ideal ratio.

That's why adding flops from the PS and VS is pretty useless for determining graphical power. 99% of the time, only one of the two is meaningfully utilized.
 
Well considering that, we should be able to say that pre-unified GPUs were hardly optimally utilized. A unified GPU should therefore show even larger performance gains than one would perhaps expect because it should, in theory, be able to be fully utilized. ?
 
Well considering that, we should be able to say that pre-unified GPUs were hardly optimally utilized.
Yes, this sounds like a fair statement. As I hinted at before, much of the time and effort of graphics programmers (probably even moreso for the driver/compiler writers) is spent trying to hide this sort of inefficient utilization. At the micro-level the scheduling of arithemetic operations during texture fetches is an example - attempt to utilize as much of both possible units at all times.

A unified GPU should therefore show even larger performance gains than one would perhaps expect because it should, in theory, be able to be fully utilized. ?
Yes, but don't forget that a unified architecture potentially brings additional performance problems.

If you have 96 or 128 units (whatever the 8800's have) then you need a scheduler that can handle high-precision low-latency load-balancing. It is without a doubt a non-trivial problem.

If you look into multi-programming in a more theoretical CompSci context then you'll see there are lots of papers and so on regarding to how more processing units does not equate to higher performance - I forget the name, but there is a law that covers the way that a scheduler introduces overhead. I think (not sure) the same law also factors in process inter-dependency which will be lower in GPU's but the point still stands.

The more units you have the more a scheduler potentially has to do, and the slower (or less efficient) it is at doing this job the less utilized the actual processing units will be.

Therefore, given that D3D10 defines performance rather than features as the fighting ground I'm expecting the schedulers to become more and more important as the generation(s) progress.


Cheers,
Jack
 
I forget the name, but there is a law that covers the way that a scheduler introduces overhead.
It was Amdahl's Law I was thinking of:
Amdahl's law, named after computer architect Gene Amdahl, is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.

Jack
 
Well considering that, we should be able to say that pre-unified GPUs were hardly optimally utilized. A unified GPU should therefore show even larger performance gains than one would perhaps expect because it should, in theory, be able to be fully utilized. ?
Yeah, but the other thing to consider is that vertex shaders are cheap as long as you don't do vertex texturing (or if you do then don't bother hiding latency). Furthermore, although there are a few games that are exceptions, having less than half as many VS units as PS units still didn't make the VS the limiting factor for more than say 20% of the time. Finally, if you didn't care too much about dynamic branching on your pixel shader, than designing a unified architecture would make your shader units more expensive because they have to worry about switching so frequently and juggling different loads (i.e. the scheduler).

So even if the previous strategy was inefficent from a POV of utilization, it probably was most efficient in terms of perf/mm2 (and thus perf/cost ratio, the defining metric of competitiveness) given the workloads of the past.

Once you set the design goal that you want fast dynamic branching and fast vertex texturing (because you foresee workloads needing this during the lifetime of your product), then the cost advantages of the previous strategy (smaller PS and way smaller VS) mostly disappear. On top of that the efficiency gain of a unified architecture is more than enough to compensate for any additional routing cost.

It all makes sense. What didn't make sense was why ATI made such a huge DB investment in R5xx's die size without reaping the benefits of a unified architecture, even though they'd already designed it in Xenos.
 
What didn't make sense was why ATI made such a huge DB investment in R5xx's die size without reaping the benefits of a unified architecture, even though they'd already designed it in Xenos.

Hmm. Maybe because the impetus for unified was driven by Microsoft, making it's way into nVidia's mindset right around the time of XBox, and bankrolled by Microsoft at ATI for XBox360? Is it mere coincidence that Microsoft walks away from nVidia, who they knew were building a unified design, and hook up with ATI, and *pay* ATI to basically build the same thing?

[btw, IF that's true, there must have been some pretty dramatic moments. Maybe even Eisner+Jobs dramatic....]

"Unified" strikes me as more a software design-driven concept, while "scalar" strikes me as more a hardware design-driven concept. Perhaps that's just nVidia pre-unified spin affecting me, though.

/me stops inhaling now ;-)
 
This question sort of pertains to this topic, so this is where I'm asking it.

The old way of calculating the theoretical number of vertices per second in millions was:

#VS x clspd x 0.25

So for the X800XT:

6 x 500 x 0.25 = 750

This formula worked on NVIDIA hardware as well, prior to the G80. It was dependent on the fact that both IHVs used vertex shaders that had the same level of OP performance, Vec4 + Scalar.

Actually in the middle of typing this I might have found the answer to my own question, of how to calculation theoretical OPs for the G80. For every Vec4 operation old GPUs used to do, it takes 4 of the G80s scalar ALUs. Add another scalar for the fifth MADD operation of previous GPUs, and you have five of the 128 of the G80s shaders equaling the vertex performance of one vertex shader in previous GPU architectures. 128 divided by 5 = 25.6

25.6 x 1350 x 0.25 = 8,640

Sounds reasonable to me.
 
Add another scalar for the fifth MADD operation of previous GPUs, and you have five of the 128 of the G80s shaders equaling the vertex performance of one vertex shader in previous GPU architectures.
What? Why 5? Is that "5th" scalar MAD actually doing anything?

Typiecally, the performance numbers are quoted assuming 1 modelview projection transform matrix applied per vertex, with no other operations. That's either 4 DP4s OR 4 vec4 MADs. Either way, that translates to 16 scalar ops.

That means that the max vertex transform rate, using the old measure, would be:

128 SPs * 1350 MHz / 16 ops == 10.8 G vertices/sec.

Of course, you start hitting other limits first, like the Setup reject rate of 1 triangle/clock.
 
What? Why 5?
As I said, the vertex shaders of yesteryear are Vec4 + Scalar.


Is that "5th" scalar MAD actually doing anything?

Typiecally, the performance numbers are quoted assuming 1 modelview projection transform matrix applied per vertex, with no other operations. That's either 4 DP4s OR 4 vec4 MADs. Either way, that translates to 16 scalar ops.

That means that the max vertex transform rate, using the old measure, would be:

128 SPs * 1350 MHz / 16 ops == 10.8 G vertices/sec.

Of course, you start hitting other limits first, like the Setup reject rate of 1 triangle/clock.
Yeah, setup limits and all those other side things are irrelevant when just calculating raw theoretical numbers, numbers in which mean nothing to most people, but I like keeping track of it.

http://www.misfitisland.us/Rep/gcdb.htm

Anyway, MADD is two flops, correct? So if it's 4 vec4, it would be 32 ops, no?

Anyway, each vertex shader in previous architectures isn't capable of four Vec4 per cycle, that I'm aware of.

http://www.beyond3d.com/reviews/ati/r580/index.php?p=03

They have a little chart on that page, and under VS ALU, you've got 8 vertex units, 2 flops each (cause it's in MADD ops) and five components (Vec4 + Scalar), to get the final per-cycle number.

All that means is that it takes five scalar ALUs in the G80 to make an even playing ground between each of the vertex units in previous architectures.
 
As I said, the vertex shaders of yesteryear are Vec4 + Scalar.
Yes, but the scalar part of the shaders of yesteryear are not sed to compute basic matrix multiplies. Thus, they are not used (ie: completely idle hardware) when computing the max vertex transform rate.

Either you count it everywhere, or you count it nowhere. But giving G80 a "penalty" for no apparent reason doesn't strike me as very fair.

MADD is two flops, correct? So if it's 4 vec4, it would be 32 ops, no?
I should have written "instruction" where I wrote "op" above. It's 16 instructions, some of which are MADs and some are MULs.

Anyway, each vertex shader in previous architectures isn't capable of four Vec4 per cycle, that I'm aware of.
Indeed, that's where that "0.25" factor you used came from: 4 clocks for 4 instructions, thus 0.25 vertices/clock/unit.


All that means is that it takes five scalar ALUs in the G80 to make an even playing ground between each of the vertex units in previous architectures.
Yeah, that SFU on G80 is just sitting there idle all the time I suppose?
 
Yes, but the scalar part of the shaders of yesteryear are not sed to compute basic matrix multiplies. Thus, they are not used (ie: completely idle hardware) when computing the max vertex transform rate.

Either you count it everywhere, or you count it nowhere. But giving G80 a "penalty" for no apparent reason doesn't strike me as very fair.
Eh, I'm just trying to compare this odd new architecture to previous ones in a way that's even as possible. If that scalar's there, whether it's ever used or not it's still there in older architectures. But hey, if you say it's not used to figure the raw theoretical vertices then I have no reason to doubt you.

I should have written "instruction" where I wrote "op" above. It's 16 instructions, some of which are MADs and some are MULs.

Indeed, that's where that "0.25" factor you used came from: 4 clocks for 4 instructions, thus 0.25 vertices/clock/unit.
Ah, OK, it all makes sense now. Yeah I pulled that 0.25 out of my hide when I went through Beyond3D's tables and tried to reverse engineer the number they got for vertices as it pertained to vertex shaders and clockspeed.

Yeah, that SFU on G80 is just sitting there idle all the time I suppose?
No idea what you're refering to.
 
Reputator said:
No idea what you're refering to.
This:
All that means is that it takes five scalar ALUs in the G80 to make an even playing ground between each of the vertex units in previous architectures.

Like I said: Either you count the extra scalar op on "older" architectures and count the SFU on G80, or you don't count either one. Mix-and-matching just doesn't work.
 
Back
Top