G80 vs R600 Part X: The Blunt & The Rich Feature

The colour rate is what seems over the top, which then leads into the apparent superfluity of Z. Since NVidia has bound the ROPs and MCs tightly, that's the way the cookie crumbles (last time we discussed ROPs, this is the conclusion I came to). My interpretation (as before) is that NVidia did this to reduce the complexity/count of crossbars within G80.
Cookie crumbles? How?

G80 has the BW to reach near its peak colour rate. It makes a lot of sense to pair ROPs to MC controllers because that's how you evenly distribute the biggest memory client among the memory channels. Proper interleaving of the framebuffer will ensure that the load is similar for each ROP.

Thinking of the ratio of pixel-instructions:colour-write, this ratio is headed skywards (prettier pixels). D3D10's increase to 8 MRTs seems to counter to that. It'd be interesting to see what sort of colour rate a deferred renderer achieves during G-buffer creation. Clearly a D3D10-specific DR could chew through way more than a DX9-based one.
DR is one use of more ROPs, but devs are also realizing that many simple pixels can be very valuable too. Look at scenes like Crysis' foliage, for example, or the grass/weed in COD4 and DiRT.

Filling in a VSM is another place that you have very simple pixels, and higher fillrate=higher resolution shadow maps. Cascade a few 2Kx2K shadowmaps with 4xAA, and you have awesome shadows.

The marginal advantage of more math per pixel isn't that great, but we still have uses for more pixels and more textures.
 
I think it's a fallacy that prettier pixels == pure ALU, that is, the bigger the ratio of ALU:TEX, the prettier the pixel. That's the procedural rendering fallacy.

What you'll find is that prettier pixels actually require fetching MORE texels and more offline artwork, and using a handful of ALU ops to decide the next texel to fetch, and how to combine the fetched value.

We're going from albedo + shadowmap, to albedo + spec map + irradiance volume/spherical harmonics + shadow + normal + ... As a result, the ALU load is going up, but so is the TEX load.

There is also a need for multiple render targets, fast z/stencil rates, etc

ATI is banking on the ratio being, what, 7:1? I think NVidia banked on the fact texture and ROP loads continue to go up in next-gen titles. Last couple of presentations I looked at (KillZone 2, Team Fortress 2) seem to suggest that texture and ROP loads are still rising.
 
I can only agree. For example, take bump mapping. It would make so much sence to abandon normal maps alltogether and use height maps: more texture lookups, but less memory needed = better texture quality.
 
What you'll find is that prettier pixels actually require fetching MORE texels and more offline artwork, and using a handful of ALU ops to decide the next texel to fetch, and how to combine the fetched value.
I completely agree. Unless we go to raytracing with photon mapping or radiosity (which, IMO, has very poor IQ/perf ratio compared to our current realtime paradigm), math alone is the wrong way forward.

We've pretty much done as much as we can using normal vectors, light vectors, and a few material parameters. For the next few years, the best improvements will be from better art instead of shaders. The next big step is using tons of offline data, whether captured from real-life (Debevic style) or simulation, and using that data in the rendering. Right now I think spherical harmonics are the best way to compress and use this data, but there are many possibilities.

This is why I think RAM and texturing are the two most important factors for better graphics. Granted, you need math to use the textures, but I don't think we need a larger ratio than what we have in G80. Well, at least for the long term. We're pretty entrenched in our current methods for now. To me, focussing on math and flops is only useful for GPGPU stuff.
 
I can only agree. For example, take bump mapping. It would make so much sence to abandon normal maps alltogether and use height maps: more texture lookups, but less memory needed = better texture quality.
You're only saving a factor of two there (1.4x the texels in each direction), so it's not going to be that big of a deal. Interestingly, fetch4 should let you get away with no extra texture lookups. A better example is how height maps used in parallax occlusion mapping uses lots of texture samples to improve the image quality.

I think the need for more data will be part of a more fundamental shift from simple single-light models with a few parameters to data based approximation of the full lighting equation. Well, I don't expect the former to disappear, but the latter will augment it.
 
I can only agree. For example, take bump mapping. It would make so much sence to abandon normal maps alltogether and use height maps: more texture lookups, but less memory needed = better texture quality.
I thought a fair bit about this semi-recently. AFAICT, quality would be lower because normalmaps are currently generated via Sobel filters etc. that give better quality than a mere box filter...

However, I suspect the only reason a Sobel filter even makes sense in the first place is that the lighting model sucks. In the real-world, lighting isn't 'computed' based on the average normal around a large area! Some lighting models that consider things such as micro-crevasses (or whatever the right term for that is) might reduce the need for sobel filters etc. - but it's not like I tried it, so I'm obviously not quite sure at all about that.
 
Hm, as I played with bump-mapping, I generally found Blinn-style bump-mapping better looking then normal maps. To be honest, I never cared about underlaying math, I just tried to write my shaders so that it will look "good". If I remember correctly, I used derivatives to deteremine the sample positions, which increased the quality in a great way.

And, in case it wasn't clear from my earlier post: I didn't mean to suggest that Blinn-bumpmpping has better quality then normal mapping. But it allows to save memory, thus have larger (=better quality) textures.

And generally, one can put the height into the A channel of an RGB texture, saving lots of space. If one uses a compression technique additionally, like DXT5 with YCoCg colorspace, it is possible to pack color+height with 4:1 compression ratio. Basically, a 1k x 1k color/height texture will only need 1Mb
 
Of course TEX and ROP rises too, but not as fast as ALU load.

I think there is a strict length limit to the usefulness of pure ALU code clauses, with the exception of running simulations on the GPU.

You've also got to consider the penalty of running up against texturing/ROP limits vs ALU limits. What's the pipeline penalty of waiting for a texture/ROP unit to become free, vs spending another cycle executing another ALU instruction?

Let's say you've got 1 TEX:1 ALU shader code, and a GPU to run those at full rate. (60fps) I then double TEX, but quadruple ALU workload, so now my workload is 2 TEX : 4 ALU. However, my new GPU 1, I left TEX unchanged, and simply increased ALU capacity by 4 fold, vs GPU 2 where I doubled TEX, but simply increased ALU capacity 2 fold. Let's say all ALU instructions are independent.

Which one is going to run the new workload better, GPU 1 or GPU2?
 
Last couple of presentations I looked at (KillZone 2, Team Fortress 2) seem to suggest that texture and ROP loads are still rising.
Of course TEX and ROP rises too, but not as fast as ALU load.
So, ATi is increasing ALU performance primarily by increasing their quantity and TEX+ROP performance primarily by increasing their frequencies. On the contrary, nVidia is increasing ALU frequencies and TEX+ROP quantity...

ATi's way seems to be clever, because TEX+ROP requirements are rising slower than ALU loads and nVidia's method is clever too, because clocks of ALUs can be increased more easily, than TMU's clocks.
 
Let's say you've got 1 TEX:1 ALU shader code, and a GPU to run those at full rate. (60fps) I then double TEX, but quadruple ALU workload, so now my workload is 2 TEX : 4 ALU. However, my new GPU 1, I left TEX unchanged, and simply increased ALU capacity by 4 fold, vs GPU 2 where I doubled TEX, but simply increased ALU capacity 2 fold. Let's say all ALU instructions are independent.

Which one is going to run the new workload better, GPU 1 or GPU2?

In theory, both should produce 30fps, as would a GPU 3 with 1 TEX and 2 ALU. Weird.
 
I have some (a bit OT) questions.

R600 is 80nm. It's over 15% smaller than competition (G80), but requires higher base-line core-clock. 65nm production wasn't chosen, because ATi wanted to launch the card sooner. 80nm GT didn't hit required clocks... so 65nm wasn't available, 90nm was probably too power demanding. But 80nm HS is leaky (and power demanding, too). My question is, why they didn't choose 90nm back-bias + strained silicon. 90nm R600 would be as big as G80, but what about it's power consumption? If I remember correctly, 90nm BB+SS had lower power requirements and achieved higher clocks, than 80nm GT. ATi was quite succesfull with "luxury" 130nm low-k. Why they didn't choose 90nm BB+SS? Would it be too expensive? Too much, that the general economical succes of R600 would be even worse than on 80nm HS?

A lot of people avoid HD2900XT just because of it's high power consumption and related noisy cooling. Speed potential of 80nm HS couldn't be reached because of enormous power drain at high clocks, so the only advantage over 90nm is ~16% smaller die...
 
I think any graphics programmer worth his salt could easily find interesting things to do with either more ALU power or more TMU/ROP power. That's not the question. The real question is, from a theoretical perspective, what kind of unit ratio could potentially bring you the best possible image quality in a closed environment such as a console.

My opinion on this is that very high ALU:TEX ratios make the most sense when ALU and TEX performance are already very high anyway. The number of interesting things you can do with tons of ALU ops is, IMO and AFAICT, much higher with very high instruction counts than with moderate ones.

From that point of view, R580 and R600 were arguably a bit too early (and not just from a developer adoption POV)... However, that is clearly not the primary problem for R600 since in terms of raw GFlops, it's not *that* much above and beyond G80, and yet it has (very slightly) more transistors.
 
In theory, both should produce 30fps, as would a GPU 3 with 1 TEX and 2 ALU. Weird.

I don't think the answer is so simple, and depends on architecture, but maybe I'm having a brainfart. Think of the NV30, when you exceeded register resources, the result wasn't always a nice linear dropoff in performance.

If the ALU ops are dependent on texture reads, then the chip has to keep spawning more threads/pixels in flight whilest waiting for texels to come back so as to keep the ALUs busy. And where things become nonlinear is what happens when latency hiding resources are exhausted. You can get graceful degradation, or, perhaps you get something worse.

Seems to me you can get into alot more trouble being bandwidth or texture limited, rather than being ALU limited, trouble in the sense that it seems alot easier to screwup (a design) when I/O is involved and you are under resourced, than to just issue another ALU instruction. Just look at all the complex "tweaking" that has to be done to GPU memory controllers, sometimes on a per-game basis.

So, the question to me is, what's the greater penalty: doing another ALU op and having to wait for an ALU to become available, or trying to issue an texture op, and having to wait for resources. Or, having finished computing your shader, you are stalled waiting for a ROP to be free to write your results.
 
I have some (a bit OT) questions.

R600 is 80nm. It's over 15% smaller than competition (G80), but requires higher base-line core-clock. 65nm production wasn't chosen, because ATi wanted to launch the card sooner. 80nm GT didn't hit required clocks... so 65nm wasn't available, 90nm was probably too power demanding. But 80nm HS is leaky (and power demanding, too). My question is, why they didn't choose 90nm back-bias + strained silicon. 90nm R600 would be as big as G80, but what about it's power consumption? If I remember correctly, 90nm BB+SS had lower power requirements and achieved higher clocks, than 80nm GT. ATi was quite succesfull with "luxury" 130nm low-k. Why they didn't choose 90nm BB+SS? Would it be too expensive? Too much, that the general economical succes of R600 would be even worse than on 80nm HS?

A lot of people avoid HD2900XT just because of it's high power consumption and related noisy cooling. Speed potential of 80nm HS couldn't be reached because of enormous power drain at high clocks, so the only advantage over 90nm is ~16% smaller die...
BB doesn't help with speed, and it doesn't help with load power. You only switch on BB with idle units/lowered clocks to reduce sub-threshold leakage. I think it should not be too difficult to do with chips made on bulk silicon, and i wouldn't be surprised if all their current chips already have this capability. SS helps with speed, but my guess is it tends to increase leakage current as well, just like almost everything that increases transistor speed.
 
If the ALU ops are dependent on texture reads, then the chip has to keep spawning more threads/pixels in flight whilest waiting for texels to come back so as to keep the ALUs busy. And where things become nonlinear is what happens when latency hiding resources are exhausted. You can get graceful degradation, or, perhaps you get something worse.

If the ALUs are waiting, you're either TEX or bandwidth limited. If you're TEX limited, then you'd be operating at the TEX rate. If you're bandwidth limited (which probably be the case if your "latency hiding resources" are exhausted), it doesn't matter how many TEX you have. That's the most important argument against scaling up TEX as fast as ALU. The bandwidth isn't growing as fast, so adding more TEX gives diminishing returns.
 
Just a small comment here.

There are no "moral victories" in the technology sector. Either your part performs well in real world operation, or it doesn't. ATI might have made a part in R600 which has mindblowing paper specs, but if it can't perform in the real world, and at launch or close to it, the game is over and they have lost. Nobody cares now that X1900XTX is faster than 7900GTX, because that was last generation, and a new generation comes out so often in the videocard industry that you're either performing great out of the gate, or you're dead meat. Similarly, there may be a day when there are drivers and games which allow HD 2900XT to outperform 8800GTS/GTX, but by the time that happens, we're already on the next generation and nobody will care about the "old" cards.

Nvidia designs cards to perform well on today's workloads, and they do. They don't care about next year's workloads, because that will be next year's design and next year's parts. ATI doesn't seem to understand this, and they have paid the price repeatedly for their inability to focus on what's going on right now.
 
Back
Top