When you aren't TMU limited, it does almost pan out:The proof is in the pudding. Take a hetereogenous collection of shader workloads, that are not ROP bound, and show me an R770 beating a GT200 by close to paper spec margin. On paper, it has a big theoretical advantage, but in the real world, it doesn't pan out. So, either utilization is low, or they made a poor decision in spending too many trannies on ALUs and not enough on TMUs to balance out the demands of the workloads.
Like I said, I agree with you. I think there are plenty of posts in this thread from me arguing with Jawed.Jawed's pronouncements are a factor of his personal aesthetics. I personally think NVidia's design is more elegant, from compiler point of view, it is much easier to optimize for, simpler to understand. Those are my aesthetics.
Basically nobody expected ATI to improve like this. My best guess was 32 TMUs, 480 SPs in 300 mm2. I gave up on ATI's competitive ability after R520, and though a little respect was restored with R580 (and lost with R600 and restored with RV670), I figured NVidia would always have the upper hand.I don't think NV to where they are today by being too stupid (NV3x aside), so there must be a reason behind the decisions made for GT200 that we're not seeing yet. NVidia loves high margin chips, and they clearly know how the yield calculus works out.
Exactly my thoughts, now if we could start to (at least) move texture filtering over the shader cores so that we can trade TMUs area for ALUs area.. sorry I couldn't resistIn terms of TMU count and poor design, you can easily look at it the other way around. NVidia devotes as much space to TMUs as it does to SPs, as you can see from the die shots. There will be many times where NVidia's TMUs will be sitting idle waiting for the SP's to reach a texture instruction.
Yeah, it's not happening. Transferring the data alone will require signficantly more complex shader cores if you want to retain performance. In fact, I bet this would cost more than the filtering math.Exactly my thoughts, now if we could start to (at least) move texture filtering over the shader cores so that we can trade TMUs area for ALUs area.. sorry I couldn't resist
Filtering arithmetic logic operates on fixed data paths, so it's really cheap. Even if you could remove it, you still need to keep the addressing, caching, decompression, LOD, aniso calcs, gamma correction, etc. All you're saving is a few 8-bit MULs and ADDs (plus whatever logic is needed for reduced speed FP16 and FP32) per texture unit.
RV670 has.IIRC rv670 and rv770 have full speed FP16 filtering
Apparently G92b and RV770 are about the same die size. RV770 includes GDDR5 functionality and will easily be considerably faster - currently RV770 in HD4850 form is being hobbled by available bandwidth.Is piling on SPs and then requiring an uber smart compiler? (Hello Itanium) Which GPU is more elegant? The one that does more with fewer SPs or the one with smaller and simpler SPs but with many of them going to waste?
Because I think the ALU:TEX is too low (die could have been considerably smaller for same performance) and the Z rate flounders on available bandwidth. The quantity of TMUs remains my biggest gripe.I have never understood why you think the G8x design is brute force.
Example?I've always felt the exact opposite. ATI's chips have far higher theoretical ALU power on paper, but have trouble keeping up with the NVidia's chips when fed complex, general purpose ALU workloads.
It mystifies me why you think ALU capability is the only measure of a GPU's performance. It also mystifies me why you're ignoring actual FLOPs. Perlin noise (which is very slightly ALU bound on ATI in the 3DMk06 version, not sure about Vantage) is a good example. Will have to wait until HD4870 arrives to be sure (it's not clear what the texturing, hence bandwidth, load is like)...An 800SP chip should stomp all over a 240SP (or "480SP equivalent") chip if it was "elegant" IMHO, or atleast, efficient.
Which decisions?The ATI design IMHO depends too much on moving decisions from runtime to the driver, and there are simply limits to optimizations that can be performed on static code.
G80 is 8 MADs + 8 interpolators. The 8 interpolators are supposed to work as MUL too. Each set of 4 interpolators also works as 1 transcendental.It's not elegant, but when their quad of 5x1D processors takes half the space of NVidia's 8x1D processors
Examples please.Take a hetereogenous collection of shader workloads, that are not ROP bound, and show me an R770 beating a GT200 by close to paper spec margin. On paper, it has a big theoretical advantage, but in the real world, it doesn't pan out.
It'll certainly be very interesting. I presume that the "serial scalar" architecture of GT200 won't gain much advantage over RV770 simply because I imagine a lot of the code is vec3/vec4.Maybe a Folding@Home war with hand-optimized CUDA vs CAL code on GT200 vs RV770 would settle things.
Hmm, bear in mind that RV770's RBEs should be sending 64 samples per clock to the register file. Sure there's even more texel data flowing from texel decompression into TF, but per ALU lane it doesn't seem like a bust to me (remember that register file bandwidth scales linearly with ALU lanes). Oh and maybe have a decompression instruction in the ALUs.Yeah, it's not happening. Transferring the data alone will require signficantly more complex shader cores if you want to retain performance. In fact, I bet this would cost more than the filtering math.
Look at the rate at which ALUs are scaling in comparison with bandwidth.Filtering arithmetic logic operates on fixed data paths, so it's really cheap. Even if you could remove it, you still need to keep the addressing, caching, decompression, LOD, aniso calcs, gamma correction, etc. All you're saving is a few 8-bit MULs and ADDs (plus whatever logic is needed for reduced speed FP16 and FP32) per texture unit.
Yeah, I know, I was just trying to keep my sentence short.G80 is 8 MADs + 8 interpolators. The 8 interpolators are supposed to work as MUL too. Each set of 4 interpolators also works as 1 transcendental.
Hmm, good point. Nonetheless, it still complicates TMU design in the output to the shader, even if I was wrong about needing more inputs to the shaders.Hmm, bear in mind that RV770's RBEs should be sending 64 samples per clock to the register file. Sure there's even more texel data flowing from texel decompression into TF, but per ALU lane it doesn't seem like a bust to me (remember that register file bandwidth scales linearly with ALU lanes). Oh and maybe have a decompression instruction in the ALUs.
Maybe gamma correction is underused, but decompression absolutely must still be free to maintain performance. Same with LOD.Also, not all those calculations are performed (e.g. gamma correction, decompression), so the TMUs as a whole are usually not 100% utilised. So the equivalent ALU capability required to achieve the same performance is lowered in comparison with the naive translation of the math.
Because I think the ALU:TEX is too low (die could have been considerably smaller for same performance) and the Z rate flounders on available bandwidth. The quantity of TMUs remains my biggest gripe.
Examples please.
Jawed
Hmm, that's like arguing that only a few clusters nearest the RBEs should do MSAA resolve.Would an entire SIMD's worth of ALUs be needed for the given TMU capacity?
If not, it's a bit of a waste.
In the interests of geography, it would seem more tempting that only a few processor clusters nearest the texture caches would be merged with the texture cluster.
You can't Load() from the current render target until the Colour+Z compression has been decoded by the RBEs. Only the RBEs have access to the tile tag tables in order to make sense of the data in memory.As an aside, I still don't understand why the RBEs are used to feed data back to the ALUs. It's not like you can Load() from the current render target, and it just makes sense to use the TU because it really is just like a texture fetch. Moreover, aside from this functionality, data is always flowing out of the ALUs and into the RBEs.
Apparently RV770 (well 2 of them I guess) is doing real time wavelet decompression on the ALUs for the new Ruby demo.Maybe gamma correction is underused, but decompression absolutely must still be free to maintain performance. Same with LOD.
How is that supposed to improve anything?Oh and maybe have a decompression instruction in the ALUs.