Hardware MSAA

Nick · Mar 15, 2011

Simon F said:
And bigger "generic" power supplies and cooling systems?

Generic doesn't have to mean high power consumption. Dedicated hardware has to achieve high performance while being squeezed into a small area, leading to higher power consumption while at the same time other parts of the chip are idle. If instead you have more generically programmable cores, the tasks can use a larger portion of the chip, which can then be more optimized for power.

Of course it's a delicate balancing act. But in the case of vertex and pixel pipeline unification it worked out rather well, especially since it also enabled new techniques. Note also that merely a decade ago people were nearly declared mental to suggest floating-point pixel processing. So while it's hard to predict exactly what the developers will do with it, more generic programmability has always proven to be a success if it's introduced gradually.

3dilettante · Mar 15, 2011

Squilliam said:
So in this respect a GPU is likely to have a much higher percentage of it's total transistors actively switching at any one moment, right? So given the fact that both AMD and Nvidia have hit a hard cap with respect to overall power useage one could expect that GPU die area would be expected to fall with every process node transition henceforth?

I am not sure if the ratio is comparable between an ASIC and an MPU, since the designs are on different processes and have different targets.
If one allows for the fact that GPUs on discrete cards can have higher TDPs than most mainstream socketed processors, maybe yes. On the other hand, the transistor counts tend to be much higher for the big chips, so the ratio may be lower than the TDP may suggest.
If one goes by the lamentation of people who think we can move completely over to software and finally utilize all those idling transistors, no.

The designs devote transistors to different things as well. OoOE, speculation, many pipeline stages, and low-latency forwarding add transistors. These transistors are part of the execution process, so they are active on any task being performed.

Wider throughput designs that rely on less aggressive scheduling and relaxed latency can spend fewer transistors on the meta-work of execution, often at the cost of utilization.

GPU and CPU designers are striving hard to add more fine-grained clock gating and power management. With the addition of power gating, even more area is being devoted to power reduction. Many of these features actually increase area, so die size doesn't need to go down with future chips.

Squilliam · Mar 16, 2011

Ahh, thanks!

MfA · Mar 16, 2011

Nick said:
Out-of-order execution doesn't imply speculation. You can still use multiple threads to hide branching latencies (or any other latency for that matter). But generic caches and out-of-order execution can dramatically lower the average latency and thus allow to advance the instruction pointer much faster. This in turn means less on-die storage is wasted on thread contexts, and workloads don't have to be ridiculously data parallel to get good efficiency.

A generic cache in and of itself doesn't mean you need OOE, if you have a single level of cache and miss fucks you up far beyond the ability of OOE to help you ... then it won't. The impact OOE can make in a desktop CPU with very wide issue and no qualms about leaving instruction slots empty to just get a little more single thread speed is also not really comparable to what it can do for a narrow issue CPU which never wants an instruction slot empty ever and has plenty of threads. Of course you can add SMT for that, at the cost of more complexity.

Ailuros · Mar 16, 2011

trinibwoy said:
Now that many of the major developers of 3D engines have chosen a deferred shading pipeline should we expect hardware accelerated MSAA to be dropped from future architectures? I'm not referring so much to rasterization or MSAA buffer generation but the compression, blending and resolve steps of the process.

That hardware appears to be pretty much useless on the latest engine technology so why keep it around? I expect new hardware to be fast enough to eat the bandwidth and compute costs of emulating these steps on the shader units for older forward renderers.

Thoughts?

How about some food for thought instead?

http://bps10.idav.ucdavis.edu/talks/11-raganKelley_DecoupledSampling_BPS_SIGGRAPH2010.pdf

http://bps10.idav.ucdavis.edu/talks/10-fatahalian_MicropolygonsRealTime_BPS_SIGGRAPH2010.pdf

There's a lot more research material available, but any of those (or even a clever combination of multiple ideas) aren't possibly just subject to hw but also sw changes of the future and not necessarily DX12 (which I of course don't have a clue what it could contain).

A more generic question to the actual topic here would be a dilemma between working in the direction of picking still low hanging fruit for efficiency of existing hw or take the much higher risk of completely different approaches too soon.

rpg.314 · Mar 16, 2011

3dilettante said:
That's wide-issue, which is orthogonal to OoO.

Technically, yes. Practically, no. I fail to see any point whatsoever behind a single issue ooo core.

The paths would need to be wider, but the scheduler wouldn't need to care about it beyond the location identifier.

True, but my point was that the overall cost of ooo for very wide vectors is hardly minor.

rpg.314 · Mar 16, 2011

Nick said:
Generic doesn't have to mean high power consumption. Dedicated hardware has to achieve high performance while being squeezed into a small area, leading to higher power consumption while at the same time other parts of the chip are idle. If instead you have more generically programmable cores, the tasks can use a larger portion of the chip, which can then be more optimized for power.

A bunch of grad students presented a paper (@HPG10, iirc) which implemented a ~3B/sec micropolygon rasterizer with some pretty fancy features not found in gpu rasterizers. It used 4.1 mm2 in TSMC 40nm. Just how many more programmable cores you think you will be able to buy in that much area? How many will you be able to buy in 10mm2? If you can get decent rasterizer in say, 2mm2, I'd say put one in each core.

Idle hw is perfectly fine, if it's area and power consumption is tuned well.

Nick · Mar 16, 2011

3dilettante said:
This just about the easiest problem to scale with process technology. Putting aside that dedicated or specialized hardware is almost always smaller than a generic replacement, transistor budget for a device of a given price is about the only thing Moore's law covers.

Yes, but despite that the graphics pipeline has become ever more programmable, and there's no end in sight. That's because consumers aren't interested in running last year's game at 120 FPS. They want the latest game to run at 60 FPS (plus they want to run GPGPU applications). More diversity means more flexibility is required. So dedicated hardware, power efficient as it may be, is useless if it's going to be underutilized.

This is a mischaracterization of the situation. Power consumption has been a limiter for years now. Perhaps for so long that it has become an accepted part of reality and no longer noticed.
Chips expend swaths of die space to cut power consumption, because power is the limiter of performance.

Power consumption is absolutely a limiter. No argument there. But I still think that cost is at least as big a limiter. You can't justify having logic that is going to be idle much of the time.

This is an inversion of current trends. Power concerns are significantly capping performance growth, while transistor counts per device have gone up in line with Moore's law.
In four node transitions, future chips could house 16x the transistors, but may only cut power consumption per transistor by half.

At what frequency and voltage? It seems to me that it still results in higher performance to use all of those transistors at less than their maximum performance, versus leaving a large part idle and using the other part at maximum performance.

It's already the case that the vast majority of a given CPU must be inactive in a given clock cycle, because no modern performance IC as of perhaps the .13 or .18 micron node can actually be fully "on" for a sustained period of time.

And yet even if I stress all of my CPU's threads, Turbo Mode still clocks it higher.

Nick · Mar 16, 2011

MfA said:
A generic cache in and of itself doesn't mean you need OOE...

I wasn't implying that. Actually I implied the reverse. Out-of-order execution lowers the latency of arithmetic code, but that's not very helpful if the thread is still suspended for long periods when accessing memory. So the average memory access latency has to be lowered, to achieve good benefit from out-of-order execution.

Fortunately they reinforce each other. Lowering the latencies means needing less threads in flight which means less storage for contexts and also less cache trashing. So you don't really need extra area for larger caches.

The peak performance doesn't change, but you get much better efficiency at complex workloads with limited data parallelism.

3dilettante · Mar 16, 2011

Nick said:
Yes, but despite that the graphics pipeline has become ever more programmable, and there's no end in sight. That's because consumers aren't interested in running last year's game at 120 FPS.

They are interested in running last year's game at over 12 FPS with non-embarrassing settings without needing a quad-socket board to do so.

They want the latest game to run at 60 FPS (plus they want to run GPGPU applications). More diversity means more flexibility is required. So dedicated hardware, power efficient as it may be, is useless if it's going to be underutilized.

It is useless only if it is never used, and this is only a problem if there is something really compelling that could be put in the area it takes up.

Power consumption is absolutely a limiter. No argument there. But I still think that cost is at least as big a limiter. You can't justify having logic that is going to be idle much of the time.

Cost is not a straightforward thing to calculate, and the cost of logic that is idle much of the time has been cut in half every 18 months or so.

At what frequency and voltage? It seems to me that it still results in higher performance to use all of those transistors at less than their maximum performance, versus leaving a large part idle and using the other part at maximum performance.

Frequency will depend on the design and process. Voltage in the near to mid-term is going to be rather close to what it is now. I'm not sure if voltage scaling has been declared dead, but it has dramatically leveled off and I am not sure anything has been promised on a process roadmap that changes things. Since voltage scaling is a critical (quadratic) component of power improvement, losing it has made power per transistor lag far behind the number of transistors.

As far as using transistors at less then their max performance goes, I don't quite follow.
A transistor is switching and burning power, it is stable and leaking, or in some designs the block it is in may be gated off.
I'm not sure what metric of performance you mean. In terms of switching activity, specialized logic can get away with fewer clock cycles, fewer transistors, and less area for a given task.

It is not always better to spread the work out, particularly in the case of chips with power gating. Since leakage can take up to a third of total power consumption, there are situations where it is better to concentrate work in one area and power down the rest. Since the process of gating circuits involves some power devoted to switching the large power gates, and has a latency cost, it helps if the idle periods are long and predictable.
In that situation, spreading a task around and having multiple blocks working at half-speed means that they may not be able to turn as much off.

Intel desktop chips since Nehalem have devoted an entire microcontroller just for the purpose of managing the speed and gating of cores.

And yet even if I stress all of my CPU's threads, Turbo Mode still clocks it higher.

I'm not sure which version of Turbo it is, but the answer is that there are loads that can conceivably heat the core to the point that it would downclock below Turbo mode.
The idea is to cut as close to TDP as possible if performance is needed. There are workloads that would constrain turbo functionality, or introduce corner cases depending on the case environment and cooling solution.
Turbo is not Standard because the manufacturer cannot guarantee that speed at all times for all of its customers.
Even without that, there are workloads and loops that can heavily utilize your CPU enough to force throttling. Intel and AMD have internal apps that do just that for testing, just in case something out there happens to do the same.

Nick · Mar 16, 2011

rpg.314 said:
I fail to see any point whatsoever behind a single issue ooo core.

It means you can process individual threads more quickly. Instead of waiting ~25 cycles to execute the next instruction (which may or may not be dependent), you can start executing independent instructions from the same thread every cycle. This means you can have less threads in flight, which in turn means less on-die storage is wasted on context data.

It allows much longer shaders/kernels without running out of registers. It also allows a bigger call stack. And it allows algorithms which don't parallelism very well to still run efficiently.

...my point was that the overall cost of ooo for very wide vectors is hardly minor.

It doesn't have to cost much at all. GF104 already supports superscalar issue, meaning it can check dependencies between two consecutive instructions. It doesn't take a lot of extra control logic to extend this window to several more instructions. First simple scoreboarding can be used, and later this can be extended with register renaming (Tomasulo). Even the most straightforward and tiny implementation can significantly reduce the number of threads required to be in flight. It doesn't have to be as aggressive as in a CPU.

For things like result forwarding there's also a choice between not forwarding at all, forwarding at one or two stages, or aggressively forwarding at every possible stage. The aggressive choice doesn't make sense for GPUs, but modest forwarding can eventually offer significat benefit. In fact I wonder what the power consumption implication is of keeping certain results in the pipeline for much longer than necessary...

Nick · Mar 17, 2011

rpg.314 said:
A bunch of grad students presented a paper (@HPG10, iirc) which implemented a ~3B/sec micropolygon rasterizer with some pretty fancy features not found in gpu rasterizers. It used 4.1 mm2 in TSMC 40nm. Just how many more programmable cores you think you will be able to buy in that much area?

That's hardly a relevant question. You have all of the programmable cores available for any task, exactly when you need them and for as long as you need them. So the area, power consumption and performance of dedicated versus programmable rasterization is a complex function of the utilization and other factors. Simply comparing the area for dedicated hardware versus programmable cores tells you nearly nothing.

Idle hw is perfectly fine, if it's area and power consumption is tuned well.

Sure, but that's a big "if". I didn't find that paper you're referring to, but if they used an aggressive clock frequency to achieve that performance level then it might just be more efficient to spread the rasterization workload over many power optimized programmable cores.

Here's a presentation which suggests that efficient software rasterization is within reach: https://attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf

Nick · Mar 17, 2011

3dilettante said:
Voltage in the near to mid-term is going to be rather close to what it is now. I'm not sure if voltage scaling has been declared dead, but it has dramatically leveled off and I am not sure anything has been promised on a process roadmap that changes things.

According to the ITRS ORTC-6 table, the voltage continues to drop by a factor 0.9 every three years. The capacitance also decreases with every process node so power per area remains nearly constant. You shouldn't waste area on logic with low utilization though; that would cost performance or money for a bigger chip.

Since voltage scaling is a critical (quadratic) component of power improvement, losing it has made power per transistor lag far behind the number of transistors.

That appears to be a premature conclusion. But anyhow, whether you're using dedicated or programmable hardware, doubling the performance requires (at least) double the switching activity. In other words, the increasing transistor budget does not change the power consumption ratio between them.

So I think it mainly comes down to utilization, and as the workloads get more complex and more diverse this favors programmable hardware.

It is not always better to spread the work out, particularly in the case of chips with power gating. Since leakage can take up to a third of total power consumption, there are situations where it is better to concentrate work in one area and power down the rest. Since the process of gating circuits involves some power devoted to switching the large power gates, and has a latency cost, it helps if the idle periods are long and predictable.
In that situation, spreading a task around and having multiple blocks working at half-speed means that they may not be able to turn as much off.

With programmable cores you'd have near full utilization all the time. So you can design it to stay within the power envelope at maximum utilization, without requiring power gating.

Intel desktop chips since Nehalem have devoted an entire microcontroller just for the purpose of managing the speed and gating of cores.

That's because some CPU workloads are single-threaded and would thus allow higher operating speeds for certain cores. With a GPU all workloads are multi-threaded so you don't need such complex speed, voltage and gating management.

MfA · Mar 17, 2011

Nick said:
I wasn't implying that. Actually I implied the reverse. Out-of-order execution lowers the latency of arithmetic code, but that's not very helpful if the thread is still suspended for long periods when accessing memory. So the average memory access latency has to be lowered, to achieve good benefit from out-of-order execution.

Still, if you have only a single "reasonable" cache latency to deal with and everything else is too long for OOE to make a dent then you can just use static scheduling for it, assuming a hit, and swap threads on a miss. Only with variable range of latencies which can all be covered by OOE will it improve significantly on static scheduling.

So you need multilevel caches where more than 1 level has a "small enough" latency.

rpg.314 · Mar 17, 2011

Nick said:
It means you can process individual threads more quickly. Instead of waiting ~25 cycles to execute the next instruction (which may or may not be dependent), you can start executing independent instructions from the same thread every cycle. This means you can have less threads in flight, which in turn means less on-die storage is wasted on context data.

The alu latency is ~25 cycles. The memory latency is ~700 cycles (and growing?). Reducing alu latency will not reduce the number of in flight threads required to hide memory latency. Reducing onchip storage will cripple pretty much all of present day gpu workloads. And history shows us that no architecture has succeeded which sacrifices performance of existing workloads for the sake of those workloads that do not yet exist.

It doesn't have to cost much at all. GF104 already supports superscalar issue, meaning it can check dependencies between two consecutive instructions.

And for some reason, nv decided to keep the register file fetch bandwidth of GF100. How long do you think a few in-pipe registers can sustain multiple issue for complex workloads, even if ILP is there?

Besides, it is not clear if the dependency resolution happens in hw or statically.

rpg.314 · Mar 17, 2011

Nick said:
That's hardly a relevant question. You have all of the programmable cores available for any task, exactly when you need them and for as long as you need them. So the area, power consumption and performance of dedicated versus programmable rasterization is a complex function of the utilization and other factors. Simply comparing the area for dedicated hardware versus programmable cores tells you nearly nothing.

PAPER: http://graphics.stanford.edu/papers/hwrast/

The power/perf situation is pretty clear. You will lose a lot of performance by going to sw rasterization. You will gain almost nothing in power or area to compensate for that. Even for micropolygons, you get ~ one tri/clk, much better than ~30 ops per sample which will take close to ~10 cycles even on a VLIW4 architecture, switching vast swathes of die compared to a rasterizer.

Utilization isn't everything. Perf/W/$ is.

Also see http://bps10.idav.ucdavis.edu/talks...panelFutureFixedFunction_BPS_SIGGRAPH2010.pdf

Sure, but that's a big "if". I didn't find that paper you're referring to, but if they used an aggressive clock frequency to achieve that performance level then it might just be more efficient to spread the rasterization workload over many power optimized programmable cores.

Historical trends of power optimized cores suggest narrow issue, in-order cores clocked lower.

Here's a presentation which suggests that efficient software rasterization is within reach: https://attila.ac.upc.edu/wiki/images/9/95/CGI10_microtriangles_presentation.pdf

Nice find. Thanks for this.

3dilettante · Mar 17, 2011

Nick said:
According to the ITRS ORTC-6 table, the voltage continues to drop by a factor 0.9 every three years.

This is below the rate of decrease from the early half of the 2000s.
Vdd scaling is expected to be very difficult to scale further, as the threshold voltage is not scaling and the voltage margin is getting narrow. There are elements of the chip that are more difficult to lower voltage for, such as SRAM.

That appears to be a premature conclusion. But anyhow, whether you're using dedicated or programmable hardware, doubling the performance requires (at least) double the switching activity. In other words, the increasing transistor budget does not change the power consumption ratio between them.

What is the ratio, however? The ceiling for power consumption is not changing, so which has room to double?

So I think it mainly comes down to utilization, and as the workloads get more complex and more diverse this favors programmable hardware.

My expectation is that designs will have a core amount of programmability, with adjunct specialized coprocessors or fixed function blocks it is able to offload or pick between.

With programmable cores you'd have near full utilization all the time.

I don't want that unless the utilization=overall performance.
This is not a direct relationship, if we compare using an FP unit versus emulating FP in software.

So you can design it to stay within the power envelope at maximum utilization, without requiring power gating.

We may need to define what we mean by utilization. If you mean max possible utilization=max number of switching transistors, setting the design's general performance cap so that the chip cannot exceed its power envelope in rare spike instances will leave performance on the table, since this must be a conservative estimate.

That's because some CPU workloads are single-threaded and would thus allow higher operating speeds for certain cores. With a GPU all workloads are multi-threaded so you don't need such complex speed, voltage and gating management.

The latest AMD GPUs already have gained power management capability that is somewhat comparable to what AMD is doing for its CPUs. We'll be getting the functionality anyway.

Andrew Lauritzen · Mar 17, 2011

rpg.314 said:
The power/perf situation is pretty clear. You will lose a lot of performance by going to sw rasterization.

... if all you're doing is rasterizing (in the way that the FF rasterizer in question is designed to do), sure. Questions like these are always only relevant in the context of a full workload. Saying "doing task X in hardware is lower power than in software" is not interesting by itself. Hell a full implementation of UE3 in hardware would use less power than it does in software

The more complex bit is identifying sub-tasks that are done exactly the same way frequently enough, and are somewhat inefficient when implemented in software for implementation in fixed-function hardware.

Texturing is an obvious example, as it uses unique enough data types/sizes and memory access patterns to be worth a dedicated implementation. Rasterization is *probably* the same, but it's less clear for several reasons.
1) We do want to do more complex things than current rasterizers can do. Conservative and logarithmic rasterization come to mind, with the former actually coming up in people implementing software rasterizers that run on the CPU in their games...
2) Scheduling out of the rasterizer being fixed is not really ideal. This is a large reason why people turn to deferred shading right now... just to decouple it from the rasterizer!
3) Rasterization in ALUs is not actually that inefficient. The math fits pretty well actually. The data paths are more complex in software, but probably not intractable.

Anyways I'm not making the argument that we need to switch to software rasterizers now or anything... if that switch ever does happen (and it may not) it'll be because at some point everyone has just started using software versions for the flexibility and the HW ones are just wasting area. More likely though is both software and hardware rasterization are used in most titles for various things.

My real point though is that the discussion of the utility of FF hardware can never be made in a vacuum where you test just that HW. It always has to be "for algorithm X in engine Y", etc. or similar. The key is it must be analyzed relative to some complete workload.

rpg.314 · Mar 17, 2011

Andrew Lauritzen said:
... if all you're doing is rasterizing (in the way that the FF rasterizer in question is designed to do), sure. Questions like these are always only relevant in the context of a full workload. Saying "doing task X in hardware is lower power than in software" is not interesting by itself. Hell a full implementation of UE3 in hardware would use less power than it does in software The more complex bit is identifying sub-tasks that are done exactly the same way frequently enough, and are somewhat inefficient when implemented in software for implementation in fixed-function hardware.

Texturing is an obvious example, as it uses unique enough data types/sizes and memory access patterns to be worth a dedicated implementation. Rasterization is *probably* the same, but it's less clear for several reasons.
1) We do want to do more complex things than current rasterizers can do. Conservative and logarithmic rasterization come to mind, with the former actually coming up in people implementing software rasterizers that run on the CPU in their games...
2) Scheduling out of the rasterizer being fixed is not really ideal. This is a large reason why people turn to deferred shading right now... just to decouple it from the rasterizer!
3) Rasterization in ALUs is not actually that inefficient. The math fits pretty well actually. The data paths are more complex in software, but probably not intractable.

Anyways I'm not making the argument that we need to switch to software rasterizers now or anything... if that switch ever does happen (and it may not) it'll be because at some point everyone has just started using software versions for the flexibility and the HW ones are just wasting area. More likely though is both software and hardware rasterization are used in most titles for various things.

My real point though is that the discussion of the utility of FF hardware can never be made in a vacuum where you test just that HW. It always has to be "for algorithm X in engine Y", etc. or similar. The key is it must be analyzed relative to some complete workload.

Let me rephrase. There's not much to be gained by getting rid of hw rasterizer. And if it's there, and since it is so fast, might as well use it. And sw rasterization needs a binning step as well, possibly including a geometry stream out. If you are doing geometry binning, might as well put a mini-rasterizer in each core.

Under sw control of course.

Conservative and log rasterization seem useful, but hw is unlikely to do them unless API support is there.

Andrew Lauritzen · Mar 17, 2011

rpg.314 said:
If you are doing geometry binning, might as well put a mini-rasterizer in each core. Under sw control of course.

Why? The tough part is the data and once it's "in the core" adding some extra interface for fixed-function hardware is just overhead. Evaluating edge functions works great on SIMD stuff... at best you need a couple fancy instructions.

Again I'm not saying we necessarily get rid of "hardware rasterizers", which is a nebulous term. If you're measuring how tiny they are you're probably talking about just the part that does edge functions and such, and that's not complicated nor do we need dedicated hardware for that. The more interest part is all of the queues and scheduling around them (like I said, data-paths) and that stuff does take up more space.

rpg.314 said:
Conservative and log rasterization seem useful, but hw is unlikely to do them unless API support is there.

Sure but that's the chicken/egg of all features.

Hardware MSAA

Nick

3dilettante

Squilliam

Beyond3d isn't defined yet

MfA

Ailuros

Epsilon plus three

rpg.314

rpg.314

Nick

Nick

3dilettante

Nick

Nick

Nick

MfA

rpg.314

rpg.314

3dilettante

Andrew Lauritzen

Moderator

rpg.314

Andrew Lauritzen

Moderator

Similar threads