G80 vs R600 Part X: The Blunt & The Rich Feature

The proof is in the pudding. Take a hetereogenous collection of shader workloads, that are not ROP bound, and show me an R770 beating a GT200 by close to paper spec margin. On paper, it has a big theoretical advantage, but in the real world, it doesn't pan out. So, either utilization is low, or they made a poor decision in spending too many trannies on ALUs and not enough on TMUs to balance out the demands of the workloads.
When you aren't TMU limited, it does almost pan out:
http://www.ixbt.com/video3/images/rv670/d3drm-ps-2.png
(320 SPs at 775MHz vs. 112 SPs at 1500 MHz. Still waiting on GT200 and RV770 reviews...)

In terms of TMU count and poor design, you can easily look at it the other way around. NVidia devotes as much space to TMUs as it does to SPs, as you can see from the die shots. There will be many times where NVidia's TMUs will be sitting idle waiting for the SP's to reach a texture instruction.

What do you define as "paper spec margin"? I can say the GTX 260 should beat the HD4850 by 47% according to texture rate.

Jawed's pronouncements are a factor of his personal aesthetics. I personally think NVidia's design is more elegant, from compiler point of view, it is much easier to optimize for, simpler to understand. Those are my aesthetics.
Like I said, I agree with you. I think there are plenty of posts in this thread from me arguing with Jawed.

I don't think NV to where they are today by being too stupid (NV3x aside), so there must be a reason behind the decisions made for GT200 that we're not seeing yet. NVidia loves high margin chips, and they clearly know how the yield calculus works out.
Basically nobody expected ATI to improve like this. My best guess was 32 TMUs, 480 SPs in 300 mm2. I gave up on ATI's competitive ability after R520, and though a little respect was restored with R580 (and lost with R600 and restored with RV670), I figured NVidia would always have the upper hand.

I'm sure NV figured R600 was way less competitive than it could have been given ATI's TMU and ALU density in R580, but once RV670 came out they probably thought all major improvements were exhausted. After all, NV couldn't get any significant density improvements from G80 to G92 to GT200 aside from process scaling.

There's no real mystery behind GT200. I'm sure they have plenty of fine-grained redundancy as well, and they could probably sell it to AIB partners for well over twice the price of G92. I think GT200 would have been high margin if RV770 wasn't around, but it came down upon them like a ton of bricks.

Now they're really screwed. Instead of the 9800GTX being worth $300 (maybe $80 per chip), it's worth $200 (maybe $50 per chip). Instead of the GTX 260 being worth $450 (maybe $150 per chip), it'll barely sell at $400. Etc, etc.

This repricing hits the huge chips the most for obvious reasons.
 
In terms of TMU count and poor design, you can easily look at it the other way around. NVidia devotes as much space to TMUs as it does to SPs, as you can see from the die shots. There will be many times where NVidia's TMUs will be sitting idle waiting for the SP's to reach a texture instruction.
Exactly my thoughts, now if we could start to (at least) move texture filtering over the shader cores so that we can trade TMUs area for ALUs area.. sorry I couldn't resist :)
 
Exactly my thoughts, now if we could start to (at least) move texture filtering over the shader cores so that we can trade TMUs area for ALUs area.. sorry I couldn't resist :)
Yeah, it's not happening. Transferring the data alone will require signficantly more complex shader cores if you want to retain performance. In fact, I bet this would cost more than the filtering math.

Filtering arithmetic logic operates on fixed data paths, so it's really cheap. Even if you could remove it, you still need to keep the addressing, caching, decompression, LOD, aniso calcs, gamma correction, etc. All you're saving is a few 8-bit MULs and ADDs (plus whatever logic is needed for reduced speed FP16 and FP32) per texture unit.

My guess is <10M transistors for the lerping logic in the filter units of G92. A drop in the bucket. Moreover, having the filtering run in parallel is great for shader utilization when multiple cycles are needed.

EDIT: There is one glimmer of hope for ATI cards, though, because they spend a lot less area on TMUs than SPs. Each quad of TUs services 80 SPs. How do you feed them data for vertex shading? The point samplers are no longer there in the diagram...
 
Last edited by a moderator:
Filtering arithmetic logic operates on fixed data paths, so it's really cheap. Even if you could remove it, you still need to keep the addressing, caching, decompression, LOD, aniso calcs, gamma correction, etc. All you're saving is a few 8-bit MULs and ADDs (plus whatever logic is needed for reduced speed FP16 and FP32) per texture unit.

IIRC rv670 and rv770 have full speed FP16 filtering
 
Is piling on SPs and then requiring an uber smart compiler? (Hello Itanium) Which GPU is more elegant? The one that does more with fewer SPs or the one with smaller and simpler SPs but with many of them going to waste?
Apparently G92b and RV770 are about the same die size. RV770 includes GDDR5 functionality and will easily be considerably faster - currently RV770 in HD4850 form is being hobbled by available bandwidth.

So, RV770's ALUs are looking "free" to me (since code with heavier ALU doesn't depend on bandwidth) and NVidia appears to have spent way too many transistors on TUs for the available bandwidth.

I have never understood why you think the G8x design is brute force.
Because I think the ALU:TEX is too low (die could have been considerably smaller for same performance) and the Z rate flounders on available bandwidth. The quantity of TMUs remains my biggest gripe.

I've always felt the exact opposite. ATI's chips have far higher theoretical ALU power on paper, but have trouble keeping up with the NVidia's chips when fed complex, general purpose ALU workloads.
Example?

An 800SP chip should stomp all over a 240SP (or "480SP equivalent") chip if it was "elegant" IMHO, or atleast, efficient.
It mystifies me why you think ALU capability is the only measure of a GPU's performance. It also mystifies me why you're ignoring actual FLOPs. Perlin noise (which is very slightly ALU bound on ATI in the 3DMk06 version, not sure about Vantage) is a good example. Will have to wait until HD4870 arrives to be sure (it's not clear what the texturing, hence bandwidth, load is like)...

Vantage's GPU Cloth test runs very badly on ATI - is that down to poor ALU utilisation? Why is GTX280 only 28% faster than 9800GTX?

The ATI design IMHO depends too much on moving decisions from runtime to the driver, and there are simply limits to optimizations that can be performed on static code.
Which decisions?

One of the problems with this discusssion is these architectures are highly sensitive to the register count allocated. Once the shader's running on the GPU, there's no chance to evaluate the optimal register allocation for the shader. It's a compilation problem. I don't know if it's the dominant factor, but when people talk about longer shaders (more instructions) running faster it's hard to conclude anything other than runtime can't fix compilation decisions.

Dynamic instruction sequence selection during scheduling

Why is it, it seems, that GT200 is the first chip in this series with MUL that can be used. Why couldn't the compilers consistently deliver MUL in shaders running on G8x/9x? Or, if you prefer, why wasn't the older hardware automatically/dynamically (or whatever you want to call it) maximising the utilisation of MUL?

---

Anyway, the real crunch comes with R700/HD4870X2. I still think R700 is a big part of R600's ethos, in addition to stuff that's beyond D3D10. I still cringe at the un-balanced ALU/TU/Z configuration of R600, and I think people are too easily delighted with RV770, particularly its MSAA/Z performance, because that's what R600 should have delivered.

Jawed
 
It's not elegant, but when their quad of 5x1D processors takes half the space of NVidia's 8x1D processors
G80 is 8 MADs + 8 interpolators. The 8 interpolators are supposed to work as MUL too. Each set of 4 interpolators also works as 1 transcendental.

Jawed
 
Take a hetereogenous collection of shader workloads, that are not ROP bound, and show me an R770 beating a GT200 by close to paper spec margin. On paper, it has a big theoretical advantage, but in the real world, it doesn't pan out.
Examples please.

Jawed
 
Maybe a Folding@Home war with hand-optimized CUDA vs CAL code on GT200 vs RV770 would settle things.
It'll certainly be very interesting. I presume that the "serial scalar" architecture of GT200 won't gain much advantage over RV770 simply because I imagine a lot of the code is vec3/vec4.

Separately I have to say I'm a bit pessimistic about Brook+ - I get the impression that the right way to write performance-optimal code on ATI is to use CAL IL (which is basically HLSL assembly).

And if the local data shares in RV770's SIMDs are the equivalent of PDC in G80, then I imagine an RV7xx-specific branch in the code will be required to optimise usage. This, in my view, is bolstered by my impression that F@H performance on ATI is currently hampered by having to recompute stuff - because the cost of writing to video RAM outweighs the cost of re-computation.

Jawed
 
Yeah, it's not happening. Transferring the data alone will require signficantly more complex shader cores if you want to retain performance. In fact, I bet this would cost more than the filtering math.
Hmm, bear in mind that RV770's RBEs should be sending 64 samples per clock to the register file. Sure there's even more texel data flowing from texel decompression into TF, but per ALU lane it doesn't seem like a bust to me (remember that register file bandwidth scales linearly with ALU lanes). Oh and maybe have a decompression instruction in the ALUs.

Anyway, still a few years off...

Filtering arithmetic logic operates on fixed data paths, so it's really cheap. Even if you could remove it, you still need to keep the addressing, caching, decompression, LOD, aniso calcs, gamma correction, etc. All you're saving is a few 8-bit MULs and ADDs (plus whatever logic is needed for reduced speed FP16 and FP32) per texture unit.
Look at the rate at which ALUs are scaling in comparison with bandwidth.

Also, not all those calculations are performed (e.g. gamma correction, decompression), so the TMUs as a whole are usually not 100% utilised. So the equivalent ALU capability required to achieve the same performance is lowered in comparison with the naive translation of the math.

Jawed
 
Would an entire SIMD's worth of ALUs be needed for the given TMU capacity?

If not, it's a bit of a waste.
In the interests of geography, it would seem more tempting that only a few processor clusters nearest the texture caches would be merged with the texture cluster.
 
G80 is 8 MADs + 8 interpolators. The 8 interpolators are supposed to work as MUL too. Each set of 4 interpolators also works as 1 transcendental.
Yeah, I know, I was just trying to keep my sentence short.

Hmm, bear in mind that RV770's RBEs should be sending 64 samples per clock to the register file. Sure there's even more texel data flowing from texel decompression into TF, but per ALU lane it doesn't seem like a bust to me (remember that register file bandwidth scales linearly with ALU lanes). Oh and maybe have a decompression instruction in the ALUs.
Hmm, good point. Nonetheless, it still complicates TMU design in the output to the shader, even if I was wrong about needing more inputs to the shaders.

As an aside, I still don't understand why the RBEs are used to feed data back to the ALUs. It's not like you can Load() from the current render target, and it just makes sense to use the TU because it really is just like a texture fetch. Moreover, aside from this functionality, data is always flowing out of the ALUs and into the RBEs.

Also, not all those calculations are performed (e.g. gamma correction, decompression), so the TMUs as a whole are usually not 100% utilised. So the equivalent ALU capability required to achieve the same performance is lowered in comparison with the naive translation of the math.
Maybe gamma correction is underused, but decompression absolutely must still be free to maintain performance. Same with LOD.

In any case, the point is that filtering math is a small part of a TMU. I only mentioned those things to dispel the notion that a significant chunk of the ~150mm2 devoted to "texturing" in the clusters (as per the diagram) is due to filtering logic. That's probably why nAo made the suggestion.
 
Because I think the ALU:TEX is too low (die could have been considerably smaller for same performance) and the Z rate flounders on available bandwidth. The quantity of TMUs remains my biggest gripe.

Low ALU:TEX = brute force (G8X)
Low TEX:ALU = elegant (R6xx)

/boggle
 
Mint: then we have to put to some use those LOD/texture addressing and texture fetching units :)
 
Examples please.

Jawed

Would a test count in your books, where the pure pixel shaders have been grabbed from a real game, run through a fillrate-meter and then been normalized in relation to a given card?
 
Last edited by a moderator:
Would an entire SIMD's worth of ALUs be needed for the given TMU capacity?

If not, it's a bit of a waste.
In the interests of geography, it would seem more tempting that only a few processor clusters nearest the texture caches would be merged with the texture cluster.
Hmm, that's like arguing that only a few clusters nearest the RBEs should do MSAA resolve.

In terms of connectivity texels would be going from cache into the register file, or into those funky, shiny and new, LDSs.

Jawed
 
As an aside, I still don't understand why the RBEs are used to feed data back to the ALUs. It's not like you can Load() from the current render target, and it just makes sense to use the TU because it really is just like a texture fetch. Moreover, aside from this functionality, data is always flowing out of the ALUs and into the RBEs.
You can't Load() from the current render target until the Colour+Z compression has been decoded by the RBEs. Only the RBEs have access to the tile tag tables in order to make sense of the data in memory.

So to convert RBE-formatted render targets into something that can be fetched by the TUs, you have to run a decompression pass. If you're going to run a decompression pass (within the RBEs) and you're going to do bog-standard MSAA resolve, why ask the RBEs to write the decompressed data to a new target for the TUs to consume when it's possible to transmit the data directly to the register file?

Maybe gamma correction is underused, but decompression absolutely must still be free to maintain performance. Same with LOD.
Apparently RV770 (well 2 of them I guess) is doing real time wavelet decompression on the ALUs for the new Ruby demo.

Jawed
 
Back
Top