Hardware MSAA

Thanks for the link!

It looks like their implementation only supports micropolygons. That's a severe limitation. The size and power consumption numbers also don't appear to include any buffers and wiring to the programmable cores.
Slides 12 and 13 illustrate my fears. A dedicated rasterizer for large as well as tiny polygons, which also supports motion and defocus blur, is going to be relatively large. Its average utilization will be pretty low though (zero for anything GPGPU). Low power consumption can compensate for that, but not infinitely.

I also don't think you can look at dedicated components individually. A few percent here and a few percent there quickly add up. And while you might be left with a highly power efficient design, effective performance will be low. So as the workloads become more diverse, you're forced to have more programmable hardware, even if it's less efficient at specific tasks.
 
Generic doesn't have to mean high power consumption. Dedicated hardware has to achieve high performance while being squeezed into a small area, leading to higher power consumption while at the same time other parts of the chip are idle. If instead you have more generically programmable cores, the tasks can use a larger portion of the chip, which can then be more optimized for power.
The act of making something "generic" as in just more fully programmable resource will result it it consuming more power for the same performance. Basically you may see what looks like a good area trade off in replacing a fixed function unit with programmable units, but when you look at power it rarely proves to be a win.

Of course it's a delicate balancing act. But in the case of vertex and pixel pipeline unification it worked out rather well, especially since it also enabled new techniques. Note also that merely a decade ago people were nearly declared mental to suggest floating-point pixel processing. So while it's hard to predict exactly what the developers will do with it, more generic programmability has always proven to be a success if it's introduced gradually.

The unification of vertex and pixel shading is a bad example as they where already both programmable units with very similar functionality so it was logical to merge them. Conversely, things like rasterisation are very clearly and well defined and are a perfect fit for fixed function and will always be best performed in dedicated HW.

Ultimately you only have to look at the already huge power consumption of modern desktop cards to see where excessive and compeletly unecessary programmability is going to lead.

John.
 
This is below the rate of decrease from the early half of the 2000s.
Vdd scaling is expected to be very difficult to scale further, as the threshold voltage is not scaling and the voltage margin is getting narrow.
Like I said, it's not just voltage, the gate and wire capacities go down as well. So with every process shrink you still get a considerable transistor budget increase for the same power envelope.

Note how progress has practically stagnated with the 40 nm process node. We really need a shrink to leap forward. In fact AMD's biggest change was the move to VLIW4, to increase utilization!

So the real question is, when the process shrinks down do you just scale the old design, do you add more dedicated logic, or do you add more programmable logic? To me it seems vastly more interesting to have more programmable logic, and keep the clock frequency modest so the power consumption stays within limits. The variability in workloads demands more programmability with each generation.
My expectation is that designs will have a core amount of programmability, with adjunct specialized coprocessors or fixed function blocks it is able to offload or pick between.
Don't underestimate the storage and communication cost for decentralized processing.

A couple of specialized instructions can be as efficient as using dedicated components, without this overhead. Take for instance NVIDIA's SFUs which also double as input interpolators.
This is not a direct relationship, if we compare using an FP unit versus emulating FP in software.
Floating-point processing has evolved from being performed by a separate chip, to becoming fully unified into the execution core. Some designs even unify the integer and floating-point ALUs into one. Again that's all due to the latency and bandwidth overhead. So the inefficiency of emulating floating-point operations with integer instructions really isn't an argument pro heterogeneous dedicated processing.

I also don't think floating-point even counts as dedicated hardware. It's a fully generic mathematical operation that in and of itself doesn't really achieve anything.
We may need to define what we mean by utilization. If you mean max possible utilization=max number of switching transistors, setting the design's general performance cap so that the chip cannot exceed its power envelope in rare spike instances will leave performance on the table, since this must be a conservative estimate.
Correct me if I'm wrong, but adjusting the clocks according to the power consumption for all programmable cores at once seems simpler than power gating all dedicated logic individually. You need the ability to adjust the clocks anyway to deal with heat issues (hot climate or failing fan), and for low power states.

So I definitely wasn't suggesting leaving any performance unused. On the contrary, adjusting the clocks of programmable logic allows to always get the maximum performance for a given power consumption. Of course you could argue that with dedicated components you can also adjust the clocks, but this leads to hotspots and you can't achieve optimal performance/Watt for the programmable part with half of the chip left idle. It also complicates a lot of things to have many clock domains.
 
Large polygons could throw an exception and be rasterized by a shader program.
And what about medium sized polygons? Do you want to throw an exception for each of them? Switching between dedicated and programmable rasterization probably incurs a large context switching cost.

Switching between algorithms tuned for large/medium sized polygons, and micropolygons, is quite simple when doing it all in software. A couple of instructions to accelerate very common calculations can ensure high peak performance.
 
Slides 12 and 13 illustrate my fears. A dedicated rasterizer for large as well as tiny polygons, which also supports motion and defocus blur, is going to be relatively large. Its average utilization will be pretty low though (zero for anything GPGPU). Low power consumption can compensate for that, but not infinitely.

a) Slides 14 and 15 say pretty clearly that power efficiency demands fixed function.

b) 4 mm2, by a bunch of grad students with a few man-months of effort at best, how is that large?

c) Utilization isn't a gating factor. Perf/W is.

I also don't think you can look at dedicated components individually. A few percent here and a few percent there quickly add up. And while you might be left with a highly power efficient design, effective performance will be low. So as the workloads become more diverse, you're forced to have more programmable hardware, even if it's less efficient at specific tasks.
FF rasterizers still beat sw rasterizers in terms of effective performance. Also, why choose between programmable hw and ff hw? Just keep both around. It's not like you gain anything by getting rid of rasterizer, and possibly other ff units as well.
 
And what about medium sized polygons? Do you want to throw an exception for each of them? Switching between dedicated and programmable rasterization probably incurs a large context switching cost.
Use the uTri rasterizer when running in tessellation mode. People are unlikely to use uTri's as basic models.

Or, just switch rasterizers depending upon the size of tri's bounding box. Edge equation based rasterization wins for >2 (3?) pixel triangles anyway.

Even more naively, just tile your screen like fermi and forget about tri sizes.
 
Actually, come to think of it ... if I had to design a rasterizer optimized for small triangles I'd first try flood fill. Rasterize the quad nearest the geometric center of the triangle, test the edges against the edges of the quad, flood fill if necessary quad by quad.

Probably a million reasons why it couldn't work, but still ... it would be one of the first things I'd try.
 
Like I said, it's not just voltage, the gate and wire capacities go down as well. So with every process shrink you still get a considerable transistor budget increase for the same power envelope.
Gate capacitance and wire capacitance have increasing challenges as well. Transistor scaling is becoming more limited in the dimensions of shrinkage, so the traditional scaling there cannot be assumed.
Interconnect dielectric challenges and proximity effects have made scaling capacitance in the interconnect less guaranteed.

As a data point, the next process nodes have promised equivalent transistor performance at 20-30% power savings. This is in contrast with 40-45% logic density scaling. The high ends of those ranges may be optimistic.

Note how progress has practically stagnated with the 40 nm process node. We really need a shrink to leap forward. In fact AMD's biggest change was the move to VLIW4, to increase utilization!
AMD's primary motivation was to save space. That utilization with VLIW5 being somewhat lower was the justification that it wasn't too bad a hit to performance to distribute the hardware in the T-unit over the remaining lanes. The area ratio between programmable and fixed-function/specialized logic does not appear to have shifted significantly, and may have potentially reduced the ratio of programmable logic.

So the real question is, when the process shrinks down do you just scale the old design, do you add more dedicated logic, or do you add more programmable logic? To me it seems vastly more interesting to have more programmable logic, and keep the clock frequency modest so the power consumption stays within limits.
Why does specialized logic need to clock faster? Most of the examples I know of do not clock the specialized logic higher.
Nvidia's special function units are at half the programmable hot clock, and the various decoders and media processors run at significantly lower speeds than the more programmable parts.

Don't underestimate the storage and communication cost for decentralized processing.
Neither your scenario nor mine are centralized, unless there is a property of having a large number of general-purpose cores encompassing the whole chip that I am missing.

A couple of specialized instructions can be as efficient as using dedicated components, without this overhead. Take for instance NVIDIA's SFUs which also double as input interpolators.

Floating-point processing has evolved from being performed by a separate chip, to becoming fully unified into the execution core. Some designs even unify the integer and floating-point ALUs into one.
There isn't a 1:1 link between an instruction in the ISA and the hardware implementation. There can be, and most likely is, specialized hardware underneath it.

AMD's GPUs do this, for example. For power, latency, and performance reasons, they do not run emulation programs to perform the operations. They contain the hardware necessary to perform the operations as-is.

I also don't think floating-point even counts as dedicated hardware. It's a fully generic mathematical operation that in and of itself doesn't really achieve anything.
By the metric of utilization, it is a waste of silicon that could be twiddling bits with make-work. Why have circuits devoted to FP when a small program of tens of instructions could do the same thing on one or more integer pipes?

Correct me if I'm wrong, but adjusting the clocks according to the power consumption for all programmable cores at once seems simpler than power gating all dedicated logic individually. You need the ability to adjust the clocks anyway to deal with heat issues (hot climate or failing fan), and for low power states.
Adjusting clocks does not affect static leakage, which can leave 20-30% of the power budget on the table.
Within the remaining power budget, there are limits to what clocks can be reached, and in the absence of perfect load-balancing, there will be cores that need to clock higher and ones that can be idled.
Low-power states above complete shutdown involve putting significant parts of the core to sleep, so there is going to be fine-grained tracking of units.

So I definitely wasn't suggesting leaving any performance unused. On the contrary, adjusting the clocks of programmable logic allows to always get the maximum performance for a given power consumption.
That is maximum performance possible with that particular programmable implementation, not in an absolute sense, and not necessarily always. As I said, it is already possible to hit the power ceiling before running out of transistor budget, and the incremental costs of additional die area are falling. In cases of being pad-limited, the extra area is has very low incremental cost.

Programmable logic is not immune to hot spots, and there are going to be clock domains regardless.
Without local domains, clock distribution will take more energy than necessary, and there will be more extraneous activity than is needed.
There are dedicated units, such as multimedia engines, that are used because they are low-power.
 
The act of making something "generic" as in just more fully programmable resource will result it it consuming more power for the same performance. Basically you may see what looks like a good area trade off in replacing a fixed function unit with programmable units, but when you look at power it rarely proves to be a win.
It all depends on the utilization. Clearly when a dedicated component is never used, it means that to achieve the same performance for the rest of the chip you'd have a higher power consumption than when you have the entire chip for performing the task. Note that reaching higher clock frequencies also requires increasing voltage, so the power consumption increase can be substantial. So dedicated hardware has to have a certain utilization level before it becomes worthwhile. More software diversity means the average utilization of non-generic components is decreasing though.

Also note that making things programmable enables clever software optimizations. Today's applications take a lot of detours just to fit the graphics pipeline. It's utterly ridiculous that on modern GPUs you often get no speedup from using LOD techniques. I'd rather use that headroom for something more useful, and I'm not alone.

Even simply using dynamic code generation with specialization can avoid a lot of useless work. "The fastest instruction is the one that never gets executed."
The unification of vertex and pixel shading is a bad example as they where already both programmable units with very similar functionality so it was logical to merge them.
Despite both being programmable, vertex and pixel processing were quite different (at the time of the unification). Vertices need high precision, they need streaming memory access, and they use lots of matrix transforms. Pixels can generally use lower precision, need lots of texture accesses, use exp/log for gamma correction, use interpolator data, etc.

Also from a power consumption point of view, unifying them certainly wasn't logical. The GeForce 7900 and GeForce 8800 had competitive performance for contemporary games, but the latter had a significantly higher power consumption.

Despite these things, they did unify vertex and pixel processing, and they never looked back. So the real reason for the unification was to enable new capabilities and new workloads. And nowadays developers are still pushing the hardware to the limits, by using it for things it was hardly designed for. So while it might be hard to imagine what exactly they'll do with it, more programmability is always welcomed. It's an endless cycle between hardware designers saying that dedicated hardware is more power efficient, and software developers saying they care more about the flexibility to allow more creativity.

Performance scales at a dazzling rate anyhow. So you might as well make sure that it can be used for something more interesting than what was done with the last generation of hardware. Obviously fixed-function hardware can't be replaced with programmable hardware overnight, but surely by the end of this decade hundreds of TFLOPS will be cheap and power efficient and there will be no point making it more expensive with fixed-function hardware, or making it even cheaper by crippling certain uses.
Conversely, things like rasterisation are very clearly and well defined and are a perfect fit for fixed function and will always be best performed in dedicated HW.
I disagree. Rasterization is evolving, and even the very idea of polygon rasterization is crumbling. There's a lot of research on micropolygons, ray-tracing, volumetric rendering, etc., but IHVs better think twice before they dedicate considerable die space on it. Some games may love the ability to have fine-grained displacement mapping and advanced blur effects, others prefer accurate interreflections and advanced refraction effects, and others don't care about these things at all (and I'm not necessarily talking about GPGPU / HPC applications).

You see, every application has some task that would be executed more efficiently with dedicated hardware, but you simply can't cater for all of them. It's overall more interesting to have a fully programmable chip, than something which prefers certain usage.

When John Carmack first proposed floating-point pixel processing, certain people (including from this very forum) practically called him insane. But these people (understandably) never envisioned that by now we have Shader Model 5.0 and OpenCL, and it's still evolving toward allowing longer and more complex code. So with all due respect I think it would be really shortsighted to think that in ten years from now dedicated hardware will be as important as it is today.
Ultimately you only have to look at the already huge power consumption of modern desktop cards to see where excessive and compeletly unecessary programmability is going to lead.
The existence of cards with very high power consumption doesn't mean those are the norm. It also doesn't mean the programmability is excessive. Nobody cares if Unreal Tournament 2004 could run faster and more efficiently with fixed-function hardware. People care about Crysis 2 and every other contemporary application, which depend heavily on the shading performance. The next generation of consoles will also spur the creation of more diverse games which demand generic computing cores.

The simple reality is that GPUs won't increase in theoretic performance as fast as they did before. They've only been able to exceed Moore's Law because they played catch-up in process technology, aggressively increased the die size, and made the GPU the hottest part of your system. The only way the're now able to significantly increase performance, is by waiting for another process shrink to allow more transistors for a given power envelope. But do you want that to benefit particular applications, or do you want it to benefit all applications? So far the programmability has always increased, and I don't see this ending any time soon...
 
Last edited by a moderator:
a) Slides 14 and 15 say pretty clearly that power efficiency demands fixed function.
That's the conclusion for the short term future, and I fully agree.

But in the long term fixed-function just isn't interesting. Features like bump mapping which once wowed us all and demanded a hardware upgrade, are now ridiculously simple to 'emulate' in the shaders, and actually we now want high quality tesselation instead. There's no doubt that one day dedicating any hardware to tesselation will be as ridiculous as fixed-function bump mapping, alpha testing, fog tables, T&L, etc.

Keep in mind that for every hardware designer, there are probably hundreds of software developers. So if you dedicate transistors to feature A, and applications implement feature X, Y and Z in software, your GPU will either be slower than the more programmable chip of the competition, or it will have a higher power consumption to reach the same performance.
b) 4 mm2, by a bunch of grad students with a few man-months of effort at best, how is that large?
That's for a micropolygon rasterizer only. What I said is that "a dedicated rasterizer for large as well as tiny polygons, which also supports motion and defocus blur, is going to be relatively large". And as I also noted before, their area and power consumption numbers are for the isolated rasterizer. An actual integrated rasterizer unit with all of its supporting circuitry will be larger and hotter. Their work also doesn't include any hardware to make the wide pixel pipeline process fragments from multiple micropolygons.

So I sincerely doubt that it would take only 4 mm² to make GPUs capable of efficiently processing micropolygons.
c) Utilization isn't a gating factor. Perf/W is.
Performance/Watt assumes 100% utilization. That makes all fixed-function hardware look great, even stuff that's practically useless!

So utilization really is a key factor in determining whether fixed-function hardware is a win or not.
FF rasterizers still beat sw rasterizers in terms of effective performance.
Not necessarily. Effective total performance depends on utilization. If you have long shaders, the rasterizer will be idle most of the time. So you're wasting space, which could have been used for more shader units instead. To achieve the same effective performance you'd have to overclock the available cores (frequency + voltage), which consumes more power than just having more shader units available.

If you completely get rid of the rasterizer, you obviously need the extra shader units to perform the rasterization, but as long as the average utilization is low this results in higher effective total performance.
Also, why choose between programmable hw and ff hw? Just keep both around. It's not like you gain anything by getting rid of rasterizer, and possibly other ff units as well.
If the rasterizer is underutilized it may sound tempting to make it smaller and use the programmable hardware if the available throughput is exceeded, but the problem is it doesn't scale down well. The rasterizer needs access to several resources, for which additional ports/buffers/wires are required. Halving the required throughput does not reduce these to exactly half the size. The control logic also doesn't scale down.

Note that just "keeping it around" as the rest of the architecture expands is practically the same as making it smaller. At a certain point it simply complicates things more to fit dedicated hardware into your architecture than to just do it in software.
 
Even more naively, just tile your screen like fermi and forget about tri sizes.
I'm not aware how Fermi does its screen tiling. Would you be so kind as to update me on this technology? I'm quite interested to know how this would allow to forget about triangle sizes.
 
You see, every application has some task that would be executed more efficiently with dedicated hardware, but you simply can't cater for all of them. It's overall more interesting to have a fully programmable chip, than something which prefers certain usage.
"Jack of all trades, master of none"?
 
"Jack of all trades, master of none"?
"Jack of all trades, master of none, though ofttimes better than master of one".

Ultimately the question is whether IHVs want to create successful GPGPU devices. Multi-core, AVX, FMA and gather/scatter are making the CPU significantly more powerful for high throughput tasks than their ancestors. GPUs on the other hand struggle to become more efficient at tasks other than graphics, because they hang on to a significant amount of fixed-function hardware, and hiding all latency through threading. This is why things like GPU physics have so far only had mediocre success at best.

Personally I think they should bite the bullet and make the GPU highly generic. This shifts a lot of the responsibility to the software, but the potential is huge. It may not reach the same peak performance and/or power efficiency for certain legacy tasks, but it's more efficient when the utilization varies wildly and it allows to do things completely differently, and even enables totally new applications. Engine developers want the API to go away, both because of the overhead and because they want to create their own frameworks.

So yes, I believe a Jack of all trades is way better than a master of one. Any GPU can already do satisfactory rasterization graphics. Now we need GPUs to evolve into something capable of and efficient at generic computing.
 
I think the aim should be flexibility, not genericness.

You need to have caches specialized for narrow accesses (texture caches need to be accessible in 32 bit aligned chunks of 64 bits). You need to have processor cores truly specialized for highly parallel long latency operations (ie. 4 way SMT ain't going to hack it, you can reduce instruction latencies all you like but texturing can't be overcome by OoOE, and dumping the register set in cache is a very poor alternative).

Heterogeneous architectures are the future ... most processor cores should be able to do anything, just some things better than others. Most cache pools should be able to store anything, just be suited more for some things than others (and with different ways of guaranteeing coherency).
 
I think the aim should be flexibility, not genericness.
Could you give me an example of something that increased flexibility, but clearly wasn't a step toward genericness?
You need to have caches specialized for narrow accesses (texture caches need to be accessible in 32 bit aligned chunks of 64 bits).
The TEX:ALU ratio is still going down. At the same time, more generic memory accesses are needed. So instead of having multiple specialized caches with varying utilization, you could have one cache hierarchy which doesn't detriment any usage.

Texture reads don't require narrow accesses. You just need gather/scatter.
You need to have processor cores truly specialized for highly parallel long latency operations (ie. 4 way SMT ain't going to hack it, you can reduce instruction latencies all you like but texturing can't be overcome by OoOE, and dumping the register set in cache is a very poor alternative).
And yet 2-way SMT works like a charm for SwiftShader. Combined with out-of-order execution and prefetching, memory access latency is not an issue.

The number of threads is becoming a significant problem for GPUs. A lot of GPGPU applications achieve only a fraction of the core utilization because they just don't have enough data to process in parallel. And it's not getting any better as the core counts increase! Also, with ever longer code the context for each thread requires lots of on-die storage.

The only solution is reducing latencies. And because it allows a smaller register set it doesn't have to cost a lot, or even anything at all. Plus it makes the architecture capable of running a wider variety of workloads.
Heterogeneous architectures are the future ... most processor cores should be able to do anything, just some things better than others. Most cache pools should be able to store anything, just be suited more for some things than others (and with different ways of guaranteeing coherency).
I don't think you can gain a lot by specializing generically programmable cores. You can cut out something here and there, but it's only a minor reduction in size while making some workloads run horribly.
 
Texture reads don't require narrow accesses. You just need gather/scatter.
If you really want to just throw away most of the data you retrieve and get pathetic throughput you can put a gather layer on a generic wide cache like Larrabee did .... but it's not efficient. If you actually want to get decent throughput on the gathers then at least the top level cache needs to deal with them with narrow ports, instead of chopping them up into consecutive accesses (at lower levels you can spend a few cycles trying to combine multiple gathers into wide accesses, assuming the processor has enough threads to actually keep issuing them of course). Unless you want an architecture only suitable for rasterizing large triangles efficiently.

Similar for the scatters.
And yet 2-way SMT works like a charm for SwiftShader. Combined with out-of-order execution and prefetching, memory access latency is not an issue.
Okay, say a trace of Crysis warhead with texture mod ... what percentage of the time does it stall?
 
Last edited by a moderator:
If you really want to just throw away most of the data you retrieve and get pathetic throughput you can put a gather layer on a generic wide cache like Larrabee did .... but it's not efficient.
LRB3 has 64-byte cache lines and supposedly 64-byte gather/scatter per cycle. With good coherence (as typical for texture accesses), hardly any data gets thrown away.
If you actually want to get decent throughput on the gathers then at least the top level cache needs to deal with them with narrow ports...
Fast caches with lots of ports are expensive. You can duplicate the data in multiple banks, but that obviously also isn't cheap. And narrow cache lines means a larger percentage is overhead.

So you really want to take advantage of spacial coherence with gather/scatter (and you can still combine it with multi-port or multi-bank to handle less coherent accesses).
Okay, say a trace of Crysis warhead with texture mod ... what percentage of the time does it stall?
Crysis 2 runs only 6% faster when forcing the texture LOD to the smallest mipmap (on a Core i7 920). And that's not just due to reduced misses but also bandwidth reduction. So again, memory latency is not an issue when you have SMT, out-of-order execution and prefetching.
 
LRB3 has 64-byte cache lines and supposedly 64-byte gather/scatter per cycle. With good coherence (as typical for texture accesses), hardly any data gets thrown away.
Supposedly, according to what citable source?
64b per cycle fits within the relased data for LRB1 if all loads hit the same cache line. That number does not seem feasible if the scatter/gatter crosses cache line boundaries.
Most formats tend to favor density, not alignment, which was one cited reason behind having TMUs separated from LRB's cores.

Fast caches with lots of ports are expensive. You can duplicate the data in multiple banks, but that obviously also isn't cheap. And narrow cache lines means a larger percentage is overhead.
Architectures with fast caches and multiple ports already exist. Non-specialized architectures with fast caches and scatter/gather do not.

So you really want to take advantage of spacial coherence with gather/scatter (and you can still combine it with multi-port or multi-bank to handle less coherent accesses).
Can you clarify this? If there were already spatial coherence, why the need to gather/scatter?
Just use wide loads and shuffle.

It isn't the case that any vector instruction is quite as generic as a scalar, nor is a gather/scatter as generic as a load/store. They can potentially draw close, but there are corner cases that would mean that the scalar/vector ops cannot be fully interchangable, and scalar ops are the baseline for being generic (within the defined data types and operand widths they support).

I'm not certain why we should waste silicon on a gather/scatter. It is readily implemented by generic loads and stores, and imagine the higher silicon utilization. Much better than having the locality checks and multiplexing hardware that would just sit there idle with scalar loads.

Crysis 2 runs only 6% faster when forcing the texture LOD to the smallest mipmap (on a Core i7 920). And that's not just due to reduced misses but also bandwidth reduction. So again, memory latency is not an issue when you have SMT, out-of-order execution and prefetching.
Settings (res,AF,AA,etc)? FPS, in-game, benchmark, etc?
 
Texture accesses have good spatial coherence, but they will not generally fall on the same cache-line. A straightforward large triangle rasterizer using scanlines for instance will access the textures on a pretty much random angled axis in UV space. Add tiled storage and the chances of all workitems hitting the same cacheline rapidly goes to 0 without some rather complex sorting of workitems by texture accesses.

PS. just for reference, would it be possible to determine which percentage of texture accesses during normal operation stall? You obviously do a lot more work in between texture accesses, which distorts the picture a bit.
 
Last edited by a moderator:
Back
Top