PCGH - Pixelshader-Shootout: RV770 vs. GT200

MAD They're bloody furious....

but on a more serious note, with nvidia allegedly helping to get physx running on ati hardware, if it did run would ati be kicking some serious ass ?

No. It'd signify less urgency for a unified standard.

From a tactical point ATI definitely doesn't want PhysX/CUDA. They'll just be... throwing knives (whatever they want) until OpenCL comes, and hopefully the game is levelled.

Once the API is the same, even with the so called "CUDA"ness of OpenCL and DX11 Compute, I'm pretty sure that the post RV770s can make it back in terms of brute compute density.
 
Eh?

Jawed

Even counting in the inherent inefficiencies of VLIW and batch sizes, I still see a lot more ALU power squeezed inside a RV730, relative to a die-justified (scaled) G96C (also made with 55). You could do it for G92b vs RV770, and the gap is still big (despite the non-ALU unit differences which I'm aware of). I'm starting to doubt that nVidia sticking to "G80"+++ which is relatively quite lighter on ALU/Compute density being the right choice.

The other route they can go would be even higher clock domains, and the G92b 9800GTX+ does have an rather impressive compute power relatively, at the expense of perf/watt.


You mean, like the "CUDA"ness of Brooks+? ;)

:LOL: That goes along with the "Barcelona"ness of Bloomfield, I swear!
Well, maybe API maturity also weighs in in this one. Brook+ still breaks with drivers IIRC, something that is definitely a no-no.
 
Even counting in the inherent inefficiencies of VLIW
And not forgetting the massive efficiencies gained in simplicity of scheduling/sequencing in hardware...

and batch sizes,
The factor of 2 between them looks like it's set for at least a couple of years, I'd say. I'm still waiting to see any decent test of dynamic branching, i.e. some non-trivial nested branching, i.e. with incoherency at each level. Though perhaps there is an argument saying that no sane games developer would deploy such a shader, at least not any time soon...

I still see a lot more ALU power squeezed inside a RV730, relative to a die-justified (scaled) G96C (also made with 55). You could do it for G92b vs RV770, and the gap is still big (despite the non-ALU unit differences which I'm aware of).
G92(b) is bandwidth starved. In the majority of games (since they aren't significantly ALU-limited), with ~40% more bandwidth, I imagine it would come surprisingly close to HD4870.

I'm starting to doubt that nVidia sticking to "G80"+++ which is relatively quite lighter on ALU/Compute density being the right choice.
The "scalar efficiency" argument is a sham. NVidia has spent shedloads of die space on scheduling/sequencing seemingly with the aim of keeping the register file/number of batches in flight "small". Controlling the ALUs is costly because it's doing fine-grained dependency analysis per instruction, in real time. They've ended-up with an NV40-like "unbalanced" ALU configuration. The obvious suggestion is to go to MAD+MAD to simplify compilation and increase throughput, but this also demands significantly more register file bandwidth so may not happen...

Intel's alternative appears to be a true scalar configuration because it has no ALU dedicated to transcendentals (no definitive statement on this, though). While it's more expensive to control than VLIW of the same throughput, it also has considerably less register bandwidth to supply, per object (pixel), than NVidia's or ATI's architectures. Obviously transcendental throughput per lane is relatively low, but the payback in terms of simplicity and also seemingly in the ability to maximise double-precision throughput seems like it'll be worth it (though perhaps not with version 1.0 whose absolute ALU performance won't be anything special...).

The other route they can go would be even higher clock domains, and the G92b 9800GTX+ does have an rather impressive compute power relatively, at the expense of perf/watt.
After GT200b I imagine we'll get significantly higher ALU clocks, simply because those chips will be on 40nm. Arguably there's about 40% more performance lying in wait solely from having decent ALU clocks, before adding ALUs. To be honest I've lost track of what 40nm is likely to make viable, so perhaps I'm being too optimistic.

In performance per mm2, assuming 40% higher NVidia ALU clocks, 55nm (71% scaling of 65nm) and ignoring NVIO, I reckon NVidia is adrift to the tune of about 70%, based on ATI's current average of 45% greater performance in ALU-limited game shaders.

Jawed
 
The "scalar efficiency" argument is a sham. NVidia has spent shedloads of die space on scheduling/sequencing...

I think there's some merit to this in general. Not sure if the proposed reasoning is accurate (fine-grained dependency scheduling etc) but it does seem like Nvidia is at a vast scheduling overhead disadvantage right now. I don't necessarily agree with the efficiency bit as it's obvious that RV770 doesn't perform anywhere near its theoretical flops advantage over GT200. Or G92 for that matter if you want to compare parts with similiar die sizes.

On a side note, has anybody run gpubench against GT200 to see if a pure MUL shader gets any speedup from the supposedly improved scheduling vs G80? I know we spent a lot of time on this back when we weren't sure how this supposed co-issue works but now that we do I'd like to see if there's anything to it.
 
I think there's some merit to this in general. Not sure if the proposed reasoning is accurate (fine-grained dependency scheduling etc) but it does seem like Nvidia is at a vast scheduling overhead disadvantage right now.
What do you think it could be, instead?

For what it's worth an NVidia GPU without double-precision but based on GT2xx architecture should look better...

I don't necessarily agree with the efficiency bit as it's obvious that RV770 doesn't perform anywhere near its theoretical flops advantage over GT200.
How can it? Grid still seems to be the only game with a significant portion of ALU-bound rendering...

Or G92 for that matter if you want to compare parts with similiar die sizes.
I agree - as long as we're talking about games in general and not games that are mostly ALU-bound.

We expect ATI's VLIW to average below 100% ALU utilisation - it's the nature of the beast, even with a perfect compiler. Typically it's in the range 60-80% in games.

On a side note, has anybody run gpubench against GT200 to see if a pure MUL shader gets any speedup from the supposedly improved scheduling vs G80? I know we spent a lot of time on this back when we weren't sure how this supposed co-issue works but now that we do I'd like to see if there's anything to it.
These results are from June, so subject to driver quality:

http://translate.google.com/transla...05004/20080627060/&sl=ja&tl=en&hl=en&ie=UTF-8

Quick glance shows MUL doing roughly what we expect and dot products also gaining (significantly). The latter's actually quite interesting...

I don't have a clue why RCP is faster than RSQ, except to suspect that the compiler has optimised-out some instructions. Wonder if something similar is happening with the dot product tests...

Jawed
 
What do you think it could be, instead?

No idea, but I would think much of the instruction dependencies will be ironed out by the compiler. I figured the hardware scheduling is more focused on thread management. In general though I think it's simply because they just have more and smaller batches in flight, hence the need for additional instruction issue hardware.

How can it? Grid still seems to be the only game with a significant portion of ALU-bound rendering... I agree - as long as we're talking about games in general and not games that are mostly ALU-bound.

True but even in the best case scenario that is Vantage's perlin noise shader it doesn't come close to its theoretical flops advantage.


Cool, thanks!
 
No idea, but I would think much of the instruction dependencies will be ironed out by the compiler.
If there were no dependencies then the code would be doing nothing. The dependencies that remain are those that have to be evaluated in real-time. Every operand of every instruction has to be tracked as well as avoiding conflicts where resultants must be locked-out of writing to registers/memory-locations that still have a pending dependency.

I figured the hardware scheduling is more focused on thread management.
In a sense NVidia's architecture makes a pea soup of batches and instructions simultaneously. Instructions are windowed, in much the same way as instructions are windowed in x86 CPUs for out of order processing. It's the per-instruction dependency checking that's costly.

ATI's architecture doesn't do any per-instruction real-time dependency checking as the compiler has done all of that (and the ALU pipeline organisation doesn't do any super-scalar issue). It does do per-batch dependency checking, though (e.g. to identify that texture results for a batch have returned from the TUs).

Actually, I should add a caveat here, as fetches from constants (which are indexable) and fetches from indexed registers (i.e. register address is not known until the instruction using the address is executed) can both result in a run-time dependency that slows ALU execution. This is basically a gather problem, what's often referred to as waterfalling. Bad access patterns to constants and/or register file will cause a reduction in the bandwidth of operand fetches.

In general though I think it's simply because they just have more and smaller batches in flight, hence the need for additional instruction issue hardware.
They have smaller batches per SIMD (32 instead of 64), but they also have less batches per SIMD, in general. I can't remember if NVidia stuck with a maximum of 24 batches per SIMD in GTX280 (G80 has 24). ATI supports a maximum of 128. Per SIMD, NVidia's register files are smaller than ATI's, 64KB (for GTX280, I think) versus 256KB.

True but even in the best case scenario that is Vantage's perlin noise shader it doesn't come close to its theoretical flops advantage.
Is Vantage's Perlin Noise test a pure ALU test? 3DMark06's appears to be. But only just. The ALU:TEX ratio of the assembly code on HD4870 is 1.03 - I have the shader code and there's a hell of a lot of texturing. I don't have Vantage's shader code though :cry:

HD4670's Vantage Perlin Noise performance is "wrong":

http://forum.beyond3d.com/showpost.php?p=1234793&postcount=251
http://forum.beyond3d.com/showpost.php?p=1235573&postcount=264

So, right now, I remain to be convinced that Vantage's Perlin Noise test is ALU-bound, on ATI's 4:1 GPUs. The performance on ATI's architecture appears to be all over the place :???:

3DMark06's Perlin Noise test, on RV770, runs at 93% ALU utilisation. Based upon these results:

http://www.xbitlabs.com/articles/video/display/geforce-gtx200-theory_15.html

A comparison of theoretical GFLOPs and the performance measured there (average of all results) as a percentage of HD4870's achieved performance:

Code:
                                   % of
         Theoretical  Apparent  Theoretical
HD4870  - 1200          1116        93
HD4850  - 1000           925        92
GTX280  -  933.12        880        94
GTX260  -  715.392       678        95
9800GTX -  648.192       540        83

So it appears that GT200 is gaining performance from MUL and/or increased register file size in this shader (hey, maybe the improved performance of texturing? - seems unlikely though).

Jawed
 
If there were no dependencies then the code would be doing nothing. The dependencies that remain are those that have to be evaluated in real-time. Every operand of every instruction has to be tracked as well as avoiding conflicts where resultants must be locked-out of writing to registers/memory-locations that still have a pending dependency.

But aren't those things common to all architectures. Why doesn't RV770 need to track those things as well. And what exactly are you referring to with GT200 when you say "fine-grained"?

In a sense NVidia's architecture makes a pea soup of batches and instructions simultaneously. Instructions are windowed, in much the same way as instructions are windowed in x86 CPUs for out of order processing. It's the per-instruction dependency checking that's costly.
Do you have any docs that point to this arrangement. Based on things I've read it seems that Nvidia's instruction scheduling is very simple - there are mutliple batches per SM and there is only a single instruction from each batch eligible for issue the next go around either to the primary ALU or the SFU. Are you sure there's real-time instruction re-ordering on G8x hardware?

They have smaller batches per SIMD (32 instead of 64), but they also have less batches per SIMD, in general.
Yeah but they have more SIMDs. Isn't the number of SIMDs a more relevant indicator of scheduling/issue overhead than the number of batches per SIMD?

Is Vantage's Perlin Noise test a pure ALU test? 3DMark06's appears to be. But only just. The ALU:TEX ratio of the assembly code on HD4870 is 1.03 - I have the shader code and there's a hell of a lot of texturing. I don't have Vantage's shader code though :cry:
You're right, Vantage is not ALU bound on my GTS-640 but heavily so in 3dmark06. A 37% shader overclock keeping everything else constant yields a 36% higher score in 3dmark06 but only 20% more in Vantage.

So it appears that GT200 is gaining performance from MUL and/or increased register file size in this shader (hey, maybe the improved performance of texturing? - seems unlikely though).
Yeah, my results on G80 point to the increased register file size as being responsible for the higher performance. If it was just pure ALU throughput without any register pressure it should have scaled linearly like 3dmark06 did.

It's going to be difficult to separate texturing and shading performance on RV770 but it should be easy enough for someone to figure out if the test is ALU bound on GT200.
 
But aren't those things common to all architectures. Why doesn't RV770 need to track those things as well. And what exactly are you referring to with GT200 when you say "fine-grained"?
Essentially ATI's compiler has fully evaluated the sequence of operations that will run in the ALU pipeline/registers.

This is from 3DMark06 Perlin Noise, and shows the first clause of RV770's assembly:

Code:
00 ALU: ADDR(32) CNT(115) 
      0  x: MUL         T0.x,  C1.z,  C0.x      
         y: ADD*2       T0.y,  R1.y, -0.5      
         z: MUL         ____,  C1.y,  C0.x      
         w: ADD*2       ____,  R1.x, -0.5      
         t: MULADD      T1.x,  R1.y,  C2.x, -1.0f      
      1  x: MUL         T2.x,  C0.x,  (0x3CCCCCCD, 0.02500000037f).x      
         w: ADD         R7.w,  PV0.w,  PV0.z      
      2  y: ADD         ____,  T0.y,  PV1.w      
      3  z: ADD         ____,  T0.x,  PV2.y      
      4  x: MULADD      T0.x,  PV3.z,  C1.w,  T2.x      
         y: MULADD      T0.y,  PV3.z,  C1.w,  R7.w      
         z: MULADD      T0.z,  PV3.z,  C1.w,  T1.x      VEC_021 
      5  y: FRACT       ____,  PV4.z      
         z: FRACT       ____,  PV4.x      
         w: FRACT       ____,  PV4.y      
      6  x: ADD         T0.x,  T0.y, -PV5.w      
         y: ADD         T0.y,  T0.x, -PV5.z      
         w: ADD         T0.w,  T0.z, -PV5.y      
      7  x: MUL         T3.x,  PV6.x,  C2.w      
         y: MUL         T1.y,  PV6.w,  C2.w      
         z: ADD         ____,  PV6.x,  PV6.w      
         w: MUL         T2.w,  PV6.y,  C2.w      
      8  x: ADD         ____,  T0.y,  PV7.z      
         y: ADD         R4.y,  PV7.y,  C4.x      
         z: ADD         R4.z,  PV7.x,  C4.x      
         t: ADD         R1.z,  PV7.w,  C4.x      
      9  x: MULADD      ____,  PV8.x, -C2.z,  T0.y      
         y: MULADD      ____,  PV8.x, -C2.z,  T0.w      
         w: MULADD      ____,  PV8.x, -C2.z,  T0.x      
     10  x: ADD         R8.x,  R7.w, -PV9.w      
         y: ADD         R7.y,  T1.x, -PV9.y      
         z: ADD         R3.z,  T2.x, -PV9.x      VEC_120 
     11  x: ADD         ____, -PV10.x,  PV10.z      
         y: ADD         ____, -PV10.y,  PV10.x      
         z: ADD         ____, -PV10.x,  PV10.y      
         w: ADD         ____, -PV10.y,  PV10.z      
         t: ADD         T0.z,  PV10.x, -PV10.z      
     12  x: ADD         ____,  R7.y, -R3.z      
         y: CNDGE       ____,  PV11.z,  0.0f,  1.0f      
         z: CNDGE       ____,  PV11.y,  0.0f,  1.0f      
         w: CNDGE       ____,  PV11.x,  0.0f,  1.0f      
         t: CNDGE       ____,  PV11.w,  0.0f,  1.0f      
     13  x: ADD         ____,  PV12.z,  PS12      
         y: CNDGE       ____,  T0.z,  0.0f,  1.0f      
         z: ADD         ____,  PV12.y,  PV12.w      
         w: CNDGE       ____,  PV12.x,  0.0f,  1.0f      
         t: MUL         T1.x,  R3.z,  R3.z      
     14  x: ADD         T2.x,  PV13.z, -0.5      
         y: ADD         ____,  PV13.y,  PV13.w      
         z: ADD         ____,  PV13.x,  C4.w      
         w: ADD         ____,  PV13.z,  C4.w      
         t: ADD         T1.w,  PV13.x, -0.5      
     15  x: ADD         ____,  PV14.y,  C4.w      
         y: CNDGE       ____,  PV14.w,  1.0f,  0.0f      
         z: ADD         T0.z,  PV14.y, -0.5      
         w: CNDGE       ____,  PV14.z,  1.0f,  0.0f      
         t: MULADD      T1.x,  R7.y,  R7.y,  T1.x      
     16  x: ADD         ____,  R8.x, -PV15.y      
         y: ADD         ____,  R7.y, -PV15.w      
         z: CNDGE       ____,  PV15.x,  1.0f,  0.0f      
         w: MUL         ____,  PV15.y,  C2.w      
         t: MUL         T0.w,  PV15.w,  C2.w      
     17  x: ADD         R7.x,  PV16.x,  C2.z      
         y: ADD         R6.y,  PV16.y,  C2.z      
         z: MUL         ____,  PV16.z,  C2.w      
         w: ADD         ____,  R3.z, -PV16.z      
         t: ADD         R2.x,  PV16.w,  R4.z      
     18  x: MULADD      ____,  R8.x,  R8.x,  T1.x      VEC_021 
         y: ADD         R2.y,  T0.w,  R4.y      
         z: ADD         R2.z,  PV17.w,  C2.z      
         w: ADD         R4.w,  PV17.z,  R1.z      
         t: CNDGE       T0.w,  T2.x,  1.0f,  0.0f      
     19  x: DOT4        ____,  R7.x,  R7.x      
         y: DOT4        ____,  R6.y,  R6.y      
         z: DOT4        ____,  PV18.z,  PV18.z      
         w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
         t: ADD         ____, -PV18.x,  C3.x      
     20  x: MAX         ____,  PS19,  0.0f      
         y: CNDGE       T0.y,  T1.w,  1.0f,  0.0f      
         z: ADD         ____, -PV19.x,  C3.x      
         w: CNDGE       T1.w,  T0.z,  1.0f,  0.0f      
         t: ADD         ____,  R8.x, -T0.w      
     21  x: ADD         ____,  R3.z, -PV20.w      
         y: MUL         ____,  PV20.x,  PV20.x      
         z: MAX         ____,  PV20.z,  0.0f      
         w: ADD         ____,  R7.y, -PV20.y      
         t: ADD         R10.x,  PS20,  C1.w      
     22  x: MUL         R9.x,  PV21.y,  PV21.y      
         y: ADD         R8.y,  PV21.w,  C1.w      
         z: ADD         R6.z,  PV21.x,  C1.w      
         w: MUL         ____,  PV21.z,  PV21.z      
         t: MUL         ____,  T0.w,  C2.w      
     23  x: MUL         ____,  T0.y,  C2.w      
         y: ADD         R9.y,  R8.x, -0.5      
         z: MUL         T0.z,  T1.w,  C2.w      
         w: MUL         R2.w,  PV22.w,  PV22.w      
         t: ADD         R3.x,  R4.z,  PS22      
     24  x: DOT4        ____,  R10.x,  R10.x      
         y: DOT4        ____,  R8.y,  R8.y      
         z: DOT4        ____,  R6.z,  R6.z      
         w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
         t: ADD         R3.y,  R4.y,  PV23.x      
     25  x: ADD         ____, -PV24.x,  C3.x      
         y: ADD         R5.y,  R1.z,  T0.z      
         z: ADD         R7.z,  R7.y, -0.5      
         w: ADD         R3.w,  R3.z, -0.5      VEC_201 
         t: ADD         R0.x,  T3.x,  C3.y      
     26  x: ADD         R14.x,  R1.y, -0.5      
         y: MAX         R1.y,  PV25.x,  0.0f      
         z: ADD         R0.z,  T1.y,  C3.y      VEC_120 
         w: ADD         R6.w,  T2.w,  C3.y      
         t: MOV         R8.w,  C0.x

27 VLIW instructions will be executed, comprising 112 scalar operations (115 in the clause header is a compiler bug :cry: or I'm misinterpreting it...). All the dependencies have been evaluated and it will complete in 27*4*2 clocks. *4 clocks because a batch of 64 elements runs in 4 phases per instruction. *2 clocks because two batches take it in turn to execute a given instruction. So that's 216 clocks in total, totally uninterrupted. Then some other pair of batches will come along and take their turn running some clause or other (from VS, GS or PS).

(In fact it's 110 scalar operations, because there are two DOT4 instructions there which only do the work of a DP3 instruction (3 way dot product). There are 40 DOT4s in this assembly that are all DP3s, making my earlier stated 93% utilisation wrong, sigh. Utilisation is actually 89%.)

Analysing the first instruction:

Code:
      0  x: MUL         T0.x,  C1.z,  C0.x      
         y: ADD*2       T0.y,  R1.y, -0.5      
         z: MUL         ____,  C1.y,  C0.x      
         w: ADD*2       ____,  R1.x, -0.5      
         t: MULADD      T1.x,  R1.y,  C2.x, -1.0f

The resultants T0 and T1 are temporary registers. There are a maximum of 4 (T0-T3) of these and their scope is the duration of the clause. The resultants called "____" are kept in the pipeline's registers (separate from the register file).

Here you can see how PV0.w and PV0.z are referenced:

Code:
      1  x: MUL         T2.x,  C0.x,  (0x3CCCCCCD, 0.02500000037f).x      
         w: ADD         R7.w,  PV0.w,  PV0.z

PV means "previous vector register" (i.e. .xyzw) and "0" means "written by instruction 0". So that's the two "____" resultants from instruction 0 being used in instruction 1.

Looking a bit further down we see:

Code:
      4  x: MULADD      T0.x,  PV3.z,  C1.w,  T2.x      
         y: MULADD      T0.y,  PV3.z,  C1.w,  R7.w      
         z: MULADD      T0.z,  PV3.z,  C1.w,  T1.x      VEC_021

Here the VEC_021 directive instructs the pipeline on the order in which to fetch operands from the register file. It's a nightmare I don't understand that well (can't be bothered, partly)...

In this texturing clause:

Code:
01 TEX: ADDR(992) CNT(4) 
     27  SAMPLE R4.w___, R2.xyxx, t0, s0
     28  SAMPLE R5.w___, R3.xyxx, t0, s0
     29  SAMPLE R6.w___, R0.xzxx, t0, s0
     30  SAMPLE R0.___w, R4.zyzz, t0, s0

four texture instructions will run to completion. None of the results of any of these texture instructions can be used by the ALUs until all four instructions have completed.

So you can see how the compiler is generating code for the precise latencies and mechanics of the ALU and TU pipelines. There's other stuff associated with dynamic branching and constant caches that the compiler also handles...

NVidia's architecture has 3 ALUs of varying latency and issue rate (MAD, MUL/transcendental/interpolation and double-precision-MAD) and they decided to make the instruction issuer track the readiness of each operand in order to account for varying latencies. Also texture results return their results individually to the register file. This means that unlike ATI, the ALUs can consume texture results when they are ready, rather than having to wait until a set of, for example 4, return. This has a big pay-back in terms of overall latency, meaning that less batches are needed to hide latency than in ATI's style of scheduling which then means less register file is needed.

I wrote about this here:

http://forum.beyond3d.com/showthread.php?t=42852

Essentially, each TEX resultant has variable latency and it is the scheduler's job to respond, by issuing instructions that depend on this texture result as soon as possible. Waiting any longer will, in general, require more batches to be in flight.

Do you have any docs that point to this arrangement. Based on things I've read it seems that Nvidia's instruction scheduling is very simple - there are mutliple batches per SM and there is only a single instruction from each batch eligible for issue the next go around either to the primary ALU or the SFU. Are you sure there's real-time instruction re-ordering on G8x hardware?
http://forum.beyond3d.com/showpost.php?p=1157238&postcount=207

http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=8
Instructions are scoreboarded to prevent various hazards from stalling execution. When all the operands for an instruction and the destination registers/shared memory are available, the instruction status changes to ‘ready’. Each cycle the issue logic selects and forwards the highest priority ‘ready to execute’ warp instruction from the buffer. Prioritization is determined with a round-robin algorithm between the 32 warps that also accounts for warp type, instruction type and other factors.

A warp which has multiple ready instructions can continue to issue until the scoreboarding blocks further progress or another warp is selected for issue. This means that the scoreboarding actually enables very simple out-of-order completion. A warp could issue a long latency memory instruction, followed by a computational instruction and in that case, the computation would end up writing back its results before the memory instruction. This is a very limited form of out-of-order execution, comparable to techniques used in Itanium and much less aggressive (and more power efficient) than a fully renamed and out-of-order issue processor such as the Core 2.

That was the article that convinced me it's not VLIW and that scheduling is way more complex than I thought reasonable...

Yeah but they have more SIMDs. Isn't the number of SIMDs a more relevant indicator of scheduling/issue overhead than the number of batches per SIMD?
Yes. I wasn't disagreeing, merely adding perspective on the per-SIMD configuration and noting that the number and size of the batches doesn't fully define the extent of the scheduling hardware. ATI supports 1280 batches in flight (10 SIMDs * 128), whilst NVidia supports 720 (30 SIMDs * 24).

You're right, Vantage is not ALU bound on my GTS-640 but heavily so in 3dmark06. A 37% shader overclock keeping everything else constant yields a 36% higher score in 3dmark06 but only 20% more in Vantage.
Woah :oops: that's a useful datum!

Yeah, my results on G80 point to the increased register file size as being responsible for the higher performance. If it was just pure ALU throughput without any register pressure it should have scaled linearly like 3dmark06 did.
I suspect that dependent-texturing is the bottleneck here - where texture addressing is calculated in code. So there's a non-obvious interaction between texturing latencies, register file size and ALU throughput...

It's going to be difficult to separate texturing and shading performance on RV770 but it should be easy enough for someone to figure out if the test is ALU bound on GT200.
I don't see how Vantage Perlin Noise can be ALU-bound on GT200, since GT200 is inherently less ALU-bound than G80 due to having a higher ALU:TEX ratio.

Jawed
 
All the dependencies have been evaluated and it will complete in 27*4*2 clocks. *4 clocks because a batch of 64 elements runs in 4 phases per instruction. *2 clocks because two batches take it in turn to execute a given instruction. So that's 216 clocks in total, totally uninterrupted. Then some other pair of batches will come along and take their turn running some clause or other (from VS, GS or PS).

In this texturing clause:

Code:
01 TEX: ADDR(992) CNT(4) 
     27  SAMPLE R4.w___, R2.xyxx, t0, s0
     28  SAMPLE R5.w___, R3.xyxx, t0, s0
     29  SAMPLE R6.w___, R0.xzxx, t0, s0
     30  SAMPLE R0.___w, R4.zyzz, t0, s0
four texture instructions will run to completion. None of the results of any of these texture instructions can be used by the ALUs until all four instructions have completed.

So you can see how the compiler is generating code for the precise latencies and mechanics of the ALU and TU pipelines. There's other stuff associated with dynamic branching and constant caches that the compiler also handles...

Ah, that is some impressive assembly-fu. I know you've mentioned this clause based approach before but this example really fleshes it out, thanks! So this is what you mean by fine-grained: per-instruction on G8x vs per-clause on R6xx? I don't get one thing though - how is it possible to "know" the latency of a texturing clause? Or does that bit just depend on regular latency hiding mechanisms by scheduling in other batches until the TEX clause returns?
 
NVidia's architecture has 3 ALUs of varying latency and issue rate (MAD, MUL/transcendental/interpolation and double-precision-MAD) and they decided to make the instruction issuer track the readiness of each operand in order to account for varying latencies. Also texture results return their results individually to the register file. This means that unlike ATI, the ALUs can consume texture results when they are ready, rather than having to wait until a set of, for example 4, return. This has a big pay-back in terms of overall latency, meaning that less batches are needed to hide latency than in ATI's style of scheduling which then means less register file is needed.

Essentially, each TEX resultant has variable latency and it is the scheduler's job to respond, by issuing instructions that depend on this texture result as soon as possible. Waiting any longer will, in general, require more batches to be in flight.

http://forum.beyond3d.com/showpost.php?p=1157238&postcount=207

Yep, that's what I got from the patent too. And that's what originally led me to believe that the main ALU and SFU were issued to independently which turned out to be the case. But I thought this approach was the "normal" way of doing things. So when you said fine-grained I thought it was something extra that Nvidia was doing....

http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=8

That was the article that convinced me it's not VLIW and that scheduling is way more complex than I thought reasonable...
Isn't that a simple side effect of instruction level scoreboarding though? You should be able to issue any number of independent instructions from the same batch until you hit a dependency or run out of registers. Sort of intra-batch latency hiding similiar to running non-dependent computations while waiting on a texturing result. If I understand it correctly the scoreboarding allows this to happen for free without any real additional work or complexity in the instruction issue logic.

Yes. I wasn't disagreeing, merely adding perspective on the per-SIMD configuration and noting that the number and size of the batches doesn't fully define the extent of the scheduling hardware. ATI supports 1280 batches in flight (10 SIMDs * 128), whilst NVidia supports 720 (30 SIMDs * 24).
Gotcha, but isn't the number of batches more relevant to the size of the register file? Which is supposedly much larger on ATi's stuff?

I suspect that dependent-texturing is the bottleneck here - where texture addressing is calculated in code. So there's a non-obvious interaction between texturing latencies, register file size and ALU throughput...
Yeah, maybe it's time for you to upgrade to Vista so you can know for sure ;)

I don't see how Vantage Perlin Noise can be ALU-bound on GT200, since GT200 is inherently less ALU-bound than G80 due to having a higher ALU:TEX ratio.
True, but it may be a useful exercise as a comparison with G80 to see how they respond to isolated overclocking of the shaders.
 
how is it possible to "know" the latency of a texturing clause? Or does that bit just depend on regular latency hiding mechanisms by scheduling in other batches until the TEX clause returns?
Always optimize for the worst case scenario (no texture cache hits). Memory subsystem latency is always known as a priori quantity.
 
Beware of Vantages Perlin Noise Results on the GTX (and maybe on other DX10-GF - dunno). I've seen performance jump up roughly 30% with the switch back to older drivers (177.40ish IIRC).
 
Ah, that is some impressive assembly-fu. I know you've mentioned this clause based approach before but this example really fleshes it out, thanks! So this is what you mean by fine-grained: per-instruction on G8x vs per-clause on R6xx?
Yes, ATI's scheduling is per-clause. It's performed by the Sequencer, with, as far as I can tell, one Sequencer per SIMD. Though I'm a bit unsure about that because I've seen architectural descriptions that imply a single Sequencer controlling all clusters (ALUs + TUs).

Here's a different shader's (Keenan Crane's QJuliaFragment.cg) Sequencer instructions (I've deleted the content of the clauses). Note there's no sampling (texture fetches), just ALU clauses and branching:

Code:
00 ALU: ADDR(64) CNT(38) 
01 LOOP_NO_AL i1 FAIL_JUMP_ADDR(18) VALID_PIX 
    02 ALU: ADDR(102) CNT(17) 
    03 LOOP_NO_AL i0 FAIL_JUMP_ADDR(11) VALID_PIX 
        04 ALU_PUSH_BEFORE: ADDR(119) CNT(40) 
        05 JUMP  POP_CNT(1) ADDR(9) VALID_PIX 
        06 ALU: ADDR(159) CNT(8) 
        07 BREAK ADDR(10) 
        08 POP (1) ADDR(9) 
        09 ALU: ADDR(167) CNT(8) 
    10 ENDLOOP i0 PASS_JUMP_ADDR(4) 
    11 ALU_PUSH_BEFORE: ADDR(175) CNT(32) 
    12 JUMP  POP_CNT(1) ADDR(16) VALID_PIX 
    13 ALU: ADDR(207) CNT(3) 
    14 BREAK ADDR(17) 
    15 POP (1) ADDR(16) 
    16 ALU: ADDR(210) CNT(3) 
17 ENDLOOP i1 PASS_JUMP_ADDR(2) 
18 ALU_PUSH_BEFORE: ADDR(213) CNT(3) 
19 JUMP  ADDR(40) VALID_PIX 
20 ALU: ADDR(216) CNT(24) 
21 LOOP_NO_AL i0 FAIL_JUMP_ADDR(24) VALID_PIX 
    22 ALU: ADDR(240) CNT(84) 
23 ENDLOOP i0 PASS_JUMP_ADDR(22) 
24 ALU_PUSH_BEFORE: ADDR(324) CNT(103) 
25 JUMP  POP_CNT(1) ADDR(40) VALID_PIX 
26 ALU: ADDR(427) CNT(7) 
27 LOOP_NO_AL i1 FAIL_JUMP_ADDR(39) VALID_PIX 
    28 ALU: ADDR(434) CNT(17) 
    29 LOOP_NO_AL i0 FAIL_JUMP_ADDR(37) VALID_PIX 
        30 ALU_PUSH_BEFORE: ADDR(451) CNT(40) 
        31 JUMP  POP_CNT(1) ADDR(35) VALID_PIX 
        32 ALU: ADDR(491) CNT(8) 
        33 BREAK ADDR(36) 
        34 POP (1) ADDR(35) 
        35 ALU: ADDR(499) CNT(8) 
    36 ENDLOOP i0 PASS_JUMP_ADDR(30) 
    37 ALU_BREAK: ADDR(507) CNT(32) 
38 ENDLOOP i1 PASS_JUMP_ADDR(28) 
39 ALU_POP_AFTER: ADDR(539) CNT(7) 
40 ELSE POP_CNT(1) ADDR(42) VALID_PIX 
41 ALU_POP_AFTER: ADDR(546) CNT(3) 
42 EXP_DONE: PIX0, R1

All control flow is predicate based and the Sequencer tests the predicate, as it will be set by each ALU clause, to determine how to handle branching. The Sequencer can also manipulate the predicate stack.

The final instruction there is a shader export instruction, exporting one or more registers to the RBEs. There's a family of these instructions that manipulate caches, registers and memory locations.

In NVidia's PTX (hardware-agnostic "assembly") there's a similar mixture of ALU/TEX instructions and control flow instructions. So there's a similar control hierarchy, but the hardware also does fine-grained dependency evaluation per-instruction.

Apart from converting HLSL etc. into PTX, NVidia's compiler will make decisions such as whether to convert MAD into MUL+ADD instructions and how to unroll loops and how to re-use intermediate-result registers. The compiler uses some kind of statistical model of the hardware in order to make these decisions. The model covers things like typical texturing latencies, the ratios of ALU and TMU clocks and guesstimated divergence patterns for dynamic branching.

The hardware takes the dependencies of each instruction (as described in PTX) and decides in real time how to issue to MAD, MI, DP-MAD and TEX.

I don't get one thing though - how is it possible to "know" the latency of a texturing clause? Or does that bit just depend on regular latency hiding mechanisms by scheduling in other batches until the TEX clause returns?
It's just regular latency-hiding - fingers-crossed that the result will return before all available ALU instructions in all available batches have completed.

The programmer can only manipulate the efficiency of this by deciding on how many fetches per TEX clause to use (up to a limit of 8), how to schedule ALU and TEX clauses (manually manipulating dependencies and doing medium-grained scheduling, in effect) and how many registers are used. But graphics programmers don't get this access, normally. Only if you're programming in IL do you get this control.

So, graphics programmers are at the mercy of the compiler. One of the issues that compiler has is that the combination of VS, GS and PS loaded into the GPU at any given time can have a marked effect on available register file space. So, ahem, that's one kind of compiler tweaking that needs to be done per-game (though I don't know if AMD does this).

Jawed
 
Isn't that a simple side effect of instruction level scoreboarding though? You should be able to issue any number of independent instructions from the same batch until you hit a dependency or run out of registers. Sort of intra-batch latency hiding similiar to running non-dependent computations while waiting on a texturing result. If I understand it correctly the scoreboarding allows this to happen for free without any real additional work or complexity in the instruction issue logic.
In fact it's able to issue instructions from any available batch - it doesn't have to "exhaust" the current batch. How the hardware evaluates this when multiple batches have available instructions is where the pixie dust comes in.

Gotcha, but isn't the number of batches more relevant to the size of the register file? Which is supposedly much larger on ATi's stuff?
It is relevant but other factors come into play. e.g. NVidia's hardware clocks its ALUs at ~2x TMUs, etc. But yeah, ATI's total register file size is larger. NVidia's fragmentation, 30 SIMDs versus 10 SIMDs, clearly adds scheduling hardware overhead when comparing two architectures of similar performance.

I didn't notice earlier, but RWT says there are 32 batches per SIMD, so that's 960 total batches in GT200.

Jawed
 
These results are from June, so subject to driver quality:

http://translate.google.com/transla...05004/20080627060/&sl=ja&tl=en&hl=en&ie=UTF-8

Quick glance shows MUL doing roughly what we expect and dot products also gaining (significantly). The latter's actually quite interesting...

I don't have a clue why RCP is faster than RSQ, except to suspect that the compiler has optimised-out some instructions. Wonder if something similar is happening with the dot product tests...

Jawed

Got my 285 so could run some tests myself. Not sure if something is screwy with Nvidia's drivers but my SFU numbers are coming in twice as high as expected:

gpubench instrissue -l 64 -a -m -c 4



On the 280 results you linked they were coming in 1/4 speed of the MAD ALU as expected. But I'm seeing them running at 1/2 speed. This is with shaders running at 1585Mhz.
 
Back
Top