Sir Eric Demers on AMD R600

Status
Not open for further replies.
If 384 threads are enough in some 16-SIMD refresh of G80, then why does G80 have 768?
Sorry, I wasn't too clear there. I also meant for you to spend more time thinking about the next paragraph.

I'm talking about the ALUs when working on ALU instructions. If you look at my example, you see that this is the only aspect of the ALUs that affects the latency hiding necessary. I need to make sure at least 6 warps have ALU instructions ready to run. If that parameter was, say, 16, then I'd start running into trouble and could no longer extend my example to the double-width, double warp size G80 without a loss in performance.
85 bytes per fragment is 5 vec4 registers.
I know (21 FP32 scalars sounds better :p ). That limitation isn't holding back G80 much. The only time this issue would affect the double-ALU G80 and not the current G80 is if a shader is both heavily ALU limited and also frequently hit the read-after-write limitations of the register file.

Such an odd shader is not the reason you think the RF needs to be doubled, so there's no need to discuss this any further.
I disagree, because shaders can have "hotspots", e.g. where a combination of register allocation per fragment and dependent texturing, say, causes you to chew through available threads resulting in an ALU pipeline stall. As Mike Shebanow says, as you consider smaller and smaller windows of instructions, the shader average throughput is irrelevant - the small window has its own effective ALU:TEX ratio, which you have to balance against other resource constraints (e.g. registers per fragment). The program counters for threads will "bunch-up" behind this hotspot.
You need a hotspot that's quite a few instructions in length to get all the warps to bunch up like that. In any case, when I said "statistics takes care of other scenarios", I meant it in terms of their applicability to my argument.

A bunch of dependent texture lookups is going to be texture throughput limited unless you have a huge swath of ALU instructions outside this portion of the shader, which makes "bunching up" less likely in the first place. That's not your typical game shader, and "statistics" also takes into account the variation in shaders out there. You don't double the register file solely to improve the performance of 0.01% of the fragments your hardware will ever see.

Also, keep in mind the scope of our debate. We are comparing two methods of double the ALU:TEX ratio of G80. Your way is to double the number of multiprocessors to 4 per cluster. My way is to double the SIMD width and warp size and nothing else - not even the register file. Clearly your way has some advantages in corner cases. However, it's also far more costly. My contention is that overall the difference in performance will be minimal. Even in extreme cases of a 100% serially dependent instruction stream, as in my example, there will be no performance hit. Clearly texture latency hiding is not affected, and there is no pressing need to double the register file.

What NVidia does decide to do is a different story. You had some good points about why NVidia would like to keep 32 pixel warps, and you're probably right.

-----------------------------

I thought of another way to summarize my argument against your claim:
Also, you have to double the size of the RF (can't keep the size constant), because you want to double the number objects in flight, since your ALU pipe is now chewing through them twice as fast. Otherwise you've just lost half your latency-hiding.
Consider 3 different shaders:
-X: 200 scalar ALU, 10 TEX
-Y: 100 scalar ALU, 10 TEX
-Z: 100 scalar ALU, 5 TEX

I'll refer to my double-width, double warp size, equal RF modification as G80**. I claim:
A) G80** will run X as fast as G80 runs Y. With double the ALU instructions, we can now feed the double-width SIMD, hiding the same 10 texture instructions almost identically.
B) G80 will run Y at the same speed as Z (since we're ALU limited).
C) Both G80 and G80** will run Z twice as fast as X.
D) The purpose of doubling the ALU:TEX ratio can be summarized as having equal performance with double the ALU load. A shows we've done that.
E) From A, B, and C, we see that G80** is twice as fast as G80 in X and Z.

So whether you look at D or E, we've accomplished our goal without doubling the register file. If you dig further, you can see the fundamentals of this argument are the same as in my other example. Equal TMU throughput between G80** and G80 is the reason A is possible, and is the primary factor thread count can remain equal.
 
My main point was to indicate that you're seemingly an order of magnitude out. You're saying, as far as I can tell, that excluding cache, 32TAs and 64 TFs in G80 cost about 12M transistors.
Oh no, the scope of my statement is far narrower than that. I'm talking only about the math part of the double speed INT8 filtering in G80's TMUs. Basically I'm saying that if each of G80's 32 TMUs were only as capable an R600 TMU (i.e. still free FP16 filtering, but no free INT16 not accelerated INT8 operations during trilinear, volume, or AF), then the filtering math logic saved would only be around 4M transistors. There's still all the other stuff (address offset calcs, faster tri/aniso calcs, more decompression, and especially cache granularity) which could quadruple that or more. I was really just pointing out that those 14 muls and 10 adds are a piece of cake, and that the chart is a poor basis from which to judge the relative costs of addressing and filtering.

In no way did that figure represent the number of transistors necessary for a 32 TMUs. There's a ton of other stuff in a TMU. I was just discussing the incremental cost of adding extra INT8 filtering for certain situations, which you appear to think is huge. That was the root of our debate, remember? Way back on page 10? :LOL: Boy, we've been at it for a while.
 
I guess you didn't look at:

http://www.techreport.com/reviews/2007q2/radeon-hd-2900xt/index.x?pg=5

I'm just making an observation on that RightMark test, I don't know how it's implemented and why it's different from the multi-texturing test.
Sorry, I meant to address this graph the last time you brought it up.

If you look at the first point on that graph (bilinear), you see that the value is way below that of the multitexturing test. Clearly this test does not have 8 textures enabled, which is a shame. Bandwidth and fillrate are probably big factors. The orientation of the geometry in this test is also critical to the degree of AF needed for each pixel.

So not only is this test very poor at measuring what it's trying to, it's also has very little applicability to any other situation. Even digit-life, who use Rightmark synthetics extensively, ignore this one. The only thing you can really take from this graph is the scaling. As expected, a 1:2 TA:TF ratio halves the perf drop compared to 1:1.
 
Or having double the amount of filtering units will give you lower net-latency of ALUs waiting to be texture-fed. After all, a TA does not take nearly as long as it takes to fetch and filter the values - especially on a cache miss.

Or am i completely mistaken here?
It's more about throughput than latency. Equal latency doesn't help if you're getting values back from the TMU at half the speed.

Of course, half the throughput inevitably doubles the latency anyway. If a drive-through starts moving at half the speed due to a sleepy cashier, it doesn't matter whether the cooking time is the same or doubled. The length of time between ordering and getting food will double.
 
Sorry, I wasn't too clear there. I also meant for you to spend more time thinking about the next paragraph.
Fair enough, the read after write latency shouldn't be on the critical path for ALU throughput generally anyway. It's never been my focus in these discussions, just an adjunct (e.g. it is mildly relevant because it adds a constant latency overhead on texturing).

I know (21 FP32 scalars sounds better :p ). That limitation isn't holding back G80 much.
Games that are math-limited are hard to find...

The only time this issue would affect the double-ALU G80 and not the current G80 is if a shader is both heavily ALU limited and also frequently hit the read-after-write limitations of the register file.
But your proposed revision is more sensitive to these corner cases - it has less headroom.

Register file is cheap, comparatively. Memory is dense and easy to yield at 100%. G80's register file memory is somewhere in the region of 5% of the transistor cost, and way less in terms of area.

A bunch of dependent texture lookups is going to be texture throughput limited unless you have a huge swath of ALU instructions outside this portion of the shader, which makes "bunching up" less likely in the first place. That's not your typical game shader, and "statistics" also takes into account the variation in shaders out there. You don't double the register file solely to improve the performance of 0.01% of the fragments your hardware will ever see.
NVidia's playing a probability game when it sizes the hardware. The 768 threads/8192 fragments (in CUDA, at least) per multiprocessor seems, to me, to be too much for "classical free bilinear", even with 4:1 (scalar-instruction:texel) code. G71 is the same, there seems to be more latency-hiding than is necessary. Yet this is the way these GPUs are. It'd be nice to be able to explain that.

Also, keep in mind the scope of our debate. We are comparing two methods of double the ALU:TEX ratio of G80. Your way is to double the number of multiprocessors to 4 per cluster. My way is to double the SIMD width and warp size and nothing else - not even the register file. Clearly your way has some advantages in corner cases. However, it's also far more costly.
I agree, my proposal is significantly more costly. But the register file cost seems relatively easy to swallow - and so shouldn't be the centre of disagreement. The scheduling part is the costly part.

My contention is that overall the difference in performance will be minimal. Even in extreme cases of a 100% serially dependent instruction stream, as in my example, there will be no performance hit. Clearly texture latency hiding is not affected, and there is no pressing need to double the register file.
The trend is for ALU:TEX ratio to increase. As ALU:TEX ratio rises, register allocation also tends to rise, which puts increased pressure on the register file and reduces the chances of the ALU pipeline running at 100% - though it's still early days to draw much of a conclusion on scaling factors.

I thought of another way to summarize my argument against your claim:Consider 3 different shaders:
-X: 200 scalar ALU, 10 TEX
-Y: 100 scalar ALU, 10 TEX
-Z: 100 scalar ALU, 5 TEX

I'll refer to my double-width, double warp size, equal RF modification as G80**. I claim:
A) G80** will run X as fast as G80 runs Y. With double the ALU instructions, we can now feed the double-width SIMD, hiding the same 10 texture instructions almost identically.
B) G80 will run Y at the same speed as Z (since we're ALU limited).
C) Both G80 and G80** will run Z twice as fast as X.
D) The purpose of doubling the ALU:TEX ratio can be summarized as having equal performance with double the ALU load. A shows we've done that.
E) From A, B, and C, we see that G80** is twice as fast as G80 in X and Z.

So whether you look at D or E, we've accomplished our goal without doubling the register file. If you dig further, you can see the fundamentals of this argument are the same as in my other example. Equal TMU throughput between G80** and G80 is the reason A is possible, and is the primary factor thread count can remain equal.
I don't disagree with any of this. NVidia's modelling of these scenarios is more advanced, which is why G80 has 768 threads/8192 fp32s per multiprocessor (or 512 threads). In effect you're arguing that G80's register file could be half its current size or even less.

When you throw in things like vertex:pixel ratio; ratio of the shader lengths vertex:pixel; bandwidth consumption by render targets (hence increased texture latency); texture cache thrashing due to out of order threading; per object register allocation etc. the size of the register file can only grow from some "ideal".

Did you listen to Shebanow's presentation? It's worth it. Even if his parting words to the assembled mass were something along the lines of "I can't tell you all the parameters in G80 so that you can tune your code perfectly ... and you don't want to program that close to the hardware unless you like rewriting your programs."

Jawed
 
I was just discussing the incremental cost of adding extra INT8 filtering for certain situations, which you appear to think is huge. That was the root of our debate, remember? Way back on page 10? :LOL: Boy, we've been at it for a while.
OK, well my remaining problem is that filtering is seemingly about 10% of a TMU according to the sizings you're proposing. I'm struggling to find that credible. Seems we aren't going to get anywhere without more specifics.

I'm nearing the end of page 10 here...

Jawed
 
The only thing you can really take from this graph is the scaling. As expected, a 1:2 TA:TF ratio halves the perf drop compared to 1:1.
I've collected R600 and R580+ texturing data from the following pages, using the best-available scores for R580+ (they vary a fair bit for some reason):

Multitexturing:

http://www.techreport.com/reviews/2007q2/radeon-hd-2900xt/index.x?pg=4
http://www.techreport.com/reviews/2006q4/geforce-8800/index.x?pg=5

R600 versus R580:
  • single texture - 6900 v 5000 = 21% per clock
  • 8 textures - 9700 v 8400 = 1% per clock
Filtering:
http://www.techreport.com/reviews/2007q2/radeon-hd-2900xt/index.x?pg=5
http://www.techreport.com/reviews/2006q4/geforce-8800/index.x?pg=6

R600 versus R580:
  • bilinear - 8100 v 6000 = 18% per clock
  • 16xAF - 4600 v 3300 = 22% per clock
Jawed
 
You guys ought to just meet up in real life (TM) and do some serious brainstorming at the local coffee shop or whatever (taco bell?). Make sure the place has wifi so u can still gather evidence and references. Then report back with a grand, all-encompassing summary of G80 and R600. plz? ;)
 
Register file is cheap, comparatively. Memory is dense and easy to yield at 100%. G80's register file memory is somewhere in the region of 5% of the transistor cost, and way less in terms of area.
...
I agree, my proposal is significantly more costly. But the register file cost seems relatively easy to swallow - and so shouldn't be the centre of disagreement. The scheduling part is the costly part.
I take it you're assuming 7 transistors for each bit of register space, right? I don't know if it's that simple.

Anyway, it is not the centre of disagreement. You said double the register file is required to maintain efficiency during texturing, and I disagree. There is only one case where the double-SIMD-width will not give the gains expected for any doubled ALU:TEX ratio: Shaders with high register usage per thread, very few TEX lookups, and high read-after-write dependency.

(well, incoherent dynamic branching too, but that's another story.)
NVidia's playing a probability game when it sizes the hardware. The 768 threads/8192 fragments (in CUDA, at least) per multiprocessor seems, to me, to be too much for "classical free bilinear", even with 4:1 (scalar-instruction:texel) code. G71 is the same, there seems to be more latency-hiding than is necessary. Yet this is the way these GPUs are. It'd be nice to be able to explain that.
(I assume you meant 8192 FP32s.) A 4:1 ratio is extremely texture throughput limited for G80, so I don't see what that has to do with thread count. Also, I don't know how you can say they have more latency hiding than necessary when ATI has so much more.

Do you understand the basic equation I set out previously for hiding latency?

Consider a Cell SPU, for example, running scalar code on a bunch of "threads" (these aren't hardware threads I'm talking about). 4 instr/clock, and 7 cycle latency means you need 28 threads to maximize throughput regardless of dependency within each one. The same holds with GPUs and TMUs, except the latency is a random variable with a probability distribution. Sure, the ALUs give you a number too, but it's much smaller.
The trend is for ALU:TEX ratio to increase. As ALU:TEX ratio rises, register allocation also tends to rise
The latter rises at a far lower rate than that of the former. Texture fetchs are held in registers far more than math results. When a shader has one float4 output (the vast majority of them), the only reason to have lots of registers is for reusing an intermediate value. Parallel instruction streams in a shader (which R600 loves) can be arranged serially to reduce register use (which G80 loves).

I see a very weak relationship between ALU:TEX ratio and register usage. If you've ever used an RPN calculator, you'd see that very long expressions can be evaluated with very few registers. It would be interesting to see what the Cg or HLSL compiler could do with randomly generated shaders of different length in terms of register use, and I suspect that the relationship would be much weaker than you think.
I don't disagree with any of this. NVidia's modelling of these scenarios is more advanced, which is why G80 has 768 threads/8192 fp32s per multiprocessor (or 512 threads). In effect you're arguing that G80's register file could be half its current size or even less.
AAAAH!!! No I'm not!!! Lets call this half register space chip G80-. I could not claim that G80- can run any shader as fast as G80.

You have to understand what I'm stating in claim A. G80** has the same number of threads as G80 for a given shader. It has the same number of registers per thread for every shader. Because the warp size is doubled, though, it has half the warps. Even so, it harms neither the ALU throughput (aside from the aforementioned pathological cases) nor the texture latency hiding.
bandwidth consumption by render targets (hence increased texture latency);
This is the kind of notion I'm trying to expunge from your mind. If bandwidth is the limiting factor, then throughput is reduced. You could have all the threads and registers in the world, and it wouldn't help. Take a look at the example I gave Quasar. The reduced throughput automatically increases your latency hiding.

#threads = latency * throughput!
 
OK, well my remaining problem is that filtering is seemingly about 10% of a TMU according to the sizings you're proposing. I'm struggling to find that credible. Seems we aren't going to get anywhere without more specifics.
Care to explain how you extracted that from anything I've said? The logic that does nothing but blend INT8 texels together is a tiny part of a TMU in a GPU. Which sections do you lump together under the label "filtering"? How many total sections do you have? Two, i.e. TA and TF?

A TMU must take texture coordinates and a texture identifer and:
-determine LOD and anisotropy
-figure out how the coordinates translates into texel positions (env. maps, projective textures, cylindrical coordinate seams, etc. aren't trivial)
-figure out where the texture is in memory
-check if it's in the cache
-if not:
---issue a request to the MC
---recognize when the request is fulfilled
---transfer the data to the cache (choosing what to overwrite)
-prioritize which samples are most important
-fetch data from the cache
-decompress and convert from the plethora of formats
-filter the samples (at various precisions)
-get them into the pixel shader again and put them with the right thread
-probably do other stuff I forgot about
-pipeline all of the above

I don't know how to quantify all this stuff myself, so I don't see how you could take my simple remark about multipliers and deduce that it implies filtering is 10% of a TMU.

G80's extra filtering per TMU compared to ATI is a matter of adding a little fixed function integer math in a situation where nearly all the parts necessary to feed it are already there due to other design decisions (which you apparently agree with). There is nothing incredulous about claiming that this incremental cost is cheap, nor does it imply anything about the rest of the TMU.
 
You guys ought to just meet up in real life (TM) and do some serious brainstorming at the local coffee shop or whatever (taco bell?).
In case you hadn't noticed, many of us are blessed by living in countries without "tacky bells". :devilish: B3D is the coffee shop, though you do have to BYOC.
 
I refuse to believe Taco Bell is not everywhere! There must be a local equivalent at least! :oops: Surely if you have rodents, cats, dogs, etc, there are Taco Bells around.

(btw, we prefer to call it "Toxic Hell" around here)
 
I refuse to believe Taco Bell is not everywhere! There must be a local equivalent at least! :oops: Surely if you have rodents, cats, dogs, etc, there are Taco Bells around.

(btw, we prefer to call it "Toxic Hell" around here)

Back in Australia we have Taco Bill, but they are proper mexican restaurants not this fast food garbage Taco Bell you have here.
 
The register file size isn't just derived from texturing latency hiding: it's any restriction that reduces the number of threads available to execute an instruction.

Do you understand the basic equation I set out previously for hiding latency?
Yeah. Shebanow covers it at length and then demurs when it comes time to give the audience the actual parameters they're working with. It's mildly comical, to be honest (plus the fact he's a graphics architecture guy not a CUDA guy). He spends the presentation saying how to work out if your shader is balanced (in terms of various ratios: critical limiters that affect balance) and boasting about the shader analysis tools they have in-house, then pulls the rug out from under the audience.

The latter rises at a far lower rate than that of the former. Texture fetchs are held in registers far more than math results. When a shader has one float4 output (the vast majority of them), the only reason to have lots of registers is for reusing an intermediate value.
The 4-light Far Cry shader uses 12 vec4 registers on R5xx, but these newer GPUs allocate scalar registers and I don't know what allocation G80 and R600 have for this shader. I don't have access to the shader analysis tools for these GPUs...

Anyway, it's 95 ALU instructions in D3D assembly, 52 ALU instruction slots on ATI's SM3 hardware, with 7 texture fetches (7.57:1), but 19 cycles on R580, an overall 2.71:1 ALU:TEX ratio.

Admittedly, for all the extra ALU muscle, R580 is only about 11% faster than R520 (no-AA/no-AF) or 21% faster (4xAA/16xAF) per clock - Far Cry Research without HDR at 1600x1200:

http://www.xbitlabs.com/articles/video/display/radeon-x1900xtx_18.html#sect0

If we take the ALU:TEX ratio of this shader at face value then according to you this shader runs at peak performance on R520, and R580 shouldn't be any faster.

Overall, though, I agree register allocation is going to grow pretty slowly. At the same time, it's very hard to deny that R580 is faster because its increased ALU:TEX ratio with concomitant register file scaling enables it to hide the bottlenecks in this and like shaders.

When it comes to register allocation scaling in the future, I don't really understand why D3D10 specifies 4096 vec4 fp32s.

Jawed
 
The register file size isn't just derived from texturing latency hiding: it's any restriction that reduces the number of threads available to execute an instruction.
And what other restriction is there besides that 192 thread figure to saturate the ALUs?

Yeah. Shebanow covers it at length and then demurs when it comes time to give the audience the actual parameters they're working with.
Okay, so the key is to realize that when you have two multiprocessors sharing a texture unit and statistically similar workloads, lambda (Little's Law, slide 33) for the latter drops to 2 fetches per clock per multiprocessor (ignoring G to look at the worst case).

If we take the ALU:TEX ratio of this shader at face value then according to you this shader runs at peak performance on R520, and R580 shouldn't be any faster.
I'm not sure how "according to me" applies here. Moreover, your calculations/conclusions have so many assumptions that I don't know where to start...

When it comes to register allocation scaling in the future, I don't really understand why D3D10 specifies 4096 vec4 fp32s.
Yeah, that's a little nuts. Probably just to cover all possible programs for future GPUs. Large matrix multiplication could really use it, as you don't want to repeat any texture loads if possible.
 
And what other restriction is there besides that 192 thread figure to saturate the ALUs?
Well the obvious ones are constant buffer fetches (which have variable latency through their 2 tier cache hierarchy from video memory) and dynamic branching.

You've also got a question of the ratio of vertices to pixels and the register allocation per vertex. There could be vertex fetch latency to hide, too. We don't know if vertex fetch and texture fetch both share a resource within G80 other than MCs.

Okay, so the key is to realize that when you have two multiprocessors sharing a texture unit and statistically similar workloads, lambda (Little's Law, slide 33) for the latter drops to 2 fetches per clock per multiprocessor (ignoring G to look at the worst case).
I presume you're referring to the worked example on slide 34. It's actually 1 fetch per clock per multiprocessor: 8 ALUs per multiprocessor = 16 ALUs per clock, with 2 TEXs per clock since TEX clock is "half" ALU clock (actually 1/2.35 in 8800GTX). It's an 8:1 ALU:TEX ratio.

He's just sizing-up the register file per multiprocessor to hide latency associated with 2 groups. With 1 group you'd be talking about 50 warps = 800 fragments (he deals in 16-fragment warps). Supposedly in graphics mode, G80 supports 512 fragments...

G80** would certainly have an interesting time if pushed to these limits.

Jawed
 
Heh, maybe the mods should prune this discussion into a another thread. But no apologies for our stampede of incomprehensible uber-nerd speak. :devilish:


Pick a post to move from and a title to call it, and it shall be done. . .
 
lol Geo..too much info in there to just start pruning willy-nilly, eh?

I tell ya what tho...the conversation, albeit a bit technical, deals primarily with a complete balance of both G80 and R600. OBth Jawed and Mintmaster have picked out alot of behaviors on each gpu that can affect(quite largely, i might add) how a programmer may end up writing his code.

Differnt types of rendering have different needs, resource wise. IF we take the "major" graphics engines in use right now(Source, IDtech, Unreal3.0, Chrome, etc), you find each has different needs from a gpu.

But the one thing I have not seen Jawed or Mintmaster mention is that R600 is basically 64 pipes, each pipes with 4 main ALU's, and an additional 5th ALU with added functionality(integer multiplication and division, bit shifts, reciprocal, division, sqrt, rsqrt, log, exp, pow, sin, cos and type conversion. Source HD2000 programming guide).

As the programming guide states,
Moderate use of these is OK, but if a shader is dominated by these operations, the other
scalar units will go idle resulting in much lower throughput.
.


OOOps.:LOL: You see that there math up there? It gets used ALOT.

To tkae again from the programming guide, on R600,

float x = a + b + c;

Will run much slower than

float t = a + b;
float x = t + c

Now, it's natural for a programmer to choose first option, as it means less typing.

But the question that comes to my mind is,

how does G80 deal with the same code?


With the exclusion of capbits in DX10, they can no longer substitute thier own code by driver when code given by api is not ideal. I think we now call this an "optimization"...

Anyway, given this, seemingly R600 likes simplified math. Can someone tell me about G80?
 
Status
Not open for further replies.
Back
Top