Hardware MSAA

trinibwoy

Meh
Legend
Supporter
Now that many of the major developers of 3D engines have chosen a deferred shading pipeline should we expect hardware accelerated MSAA to be dropped from future architectures? I'm not referring so much to rasterization or MSAA buffer generation but the compression, blending and resolve steps of the process.

That hardware appears to be pretty much useless on the latest engine technology so why keep it around? I expect new hardware to be fast enough to eat the bandwidth and compute costs of emulating these steps on the shader units for older forward renderers.

Thoughts?
 
Eventually, yes, I would expect all fixed-function hardware to go away. Developers are using the hardware in ways it was not originally designed for. So it makes more sense to just make things fully generic and highly programmable.

It's going to take many years though. High-end hardware may have bandwidth and compute resources to spare to implement the fixed-function functionality in software for older applications, but the average consumer doesn't buy such high-end hardware. So it takes a slow evolution on both the hardware and software side.

But looking at how vastly different graphics chips were just a decade ago, I'm fairly confident that we're going to see Larrabee-like architectures from NVIDIA and AMD before the end of this decade. And in the low-end market there's also clearly going to be a collision with CPUs that feature gather/scatter support.
 
That's a fair point but what good is fixed function hardware on mainstream cards when the software is no longer making use of it? The future will present an option of compute or bust. I'm not sure how fixed function units can be slowly phased out on upcoming engines.
 
I'm not referring so much to rasterization or MSAA buffer generation but the compression, blending and resolve steps of the process.
Compressing is still useful when doing deferred MSAA - in fact it's even more useful than with forward MSAA since you're rendering to wider buffers! Not sure what you mean by "blending" wrt MSAA. Resolve could very well go away (or more likely just be implemented in software), but it's not very complex hardware anyways so it's not like it's a huge win to get rid of it. Also note that fixed-function hardware isn't that costly when it can be completely powered off when not in use. Area is not as much of a concern as power these days.

I severely doubt that the hardware that "does MSAA" (i.e. coverage samples from non-grid locations) will go away. It's far too useful and while all the rage is on the screen-space reconstruction stuff right now, there's very little chance that it will be used exclusively in the future. There's a very good reason why you need to super-sample visibility and an even better reason why you don't do it in an ordered grid.
 
Compressing is still useful when doing deferred MSAA - in fact it's even more useful than with forward MSAA since you're rendering to wider buffers! Not sure what you mean by "blending" wrt MSAA. Resolve could very well go away (or more likely just be implemented in software), but it's not very complex hardware anyways so it's not like it's a huge win to get rid of it. Also note that fixed-function hardware isn't that costly when it can be completely powered off when not in use. Area is not as much of a concern as power these days.

Blending was probably the wrong word. I was referring to updates on the MSAA buffer with samples from overlapping geometry. Point taken on compression, didn't realize it was generally applicable outside of hardware MSAA.

I severely doubt that the hardware that "does MSAA" (i.e. coverage samples from non-grid locations) will go away. It's far too useful and while all the rage is on the screen-space reconstruction stuff right now, there's very little chance that it will be used exclusively in the future. There's a very good reason why you need to super-sample visibility and an even better reason why you don't do it in an ordered grid.

Yeah I agree, that's why I made the distinction earlier between rasterization and downstream processing.
 
Yeah I agree, that's why I made the distinction earlier between rasterization and downstream processing.
Ah ok. Yeah I imagine we'll keep the rasterizer + ROP/compression logic (so you can basically rasterize a compressed MSAA buffer of arbitrary data) but it's not important to have fixed-function resolve. My guess is it will just be generalized to allow the programmable stuff to deal with the compressed data slightly more efficiently.
 
Eventually, yes, I would expect all fixed-function hardware to go away. Developers are using the hardware in ways it was not originally designed for. So it makes more sense to just make things fully generic and highly programmable.
With power consumption being the number one constraint these days it's likely fixed function HW will keep us company for a long long time..
 
I imagine we'll keep the rasterizer...
Not likely. Tesselation is making some triangles really tiny, while other triangles remain pretty large. So you'd need to spend a large portion of the die area on a dedicated rasterizer to sustain the maximum triangle rate, but it's going to be idle much of the time. So eventually it's more efficient to just replace the rasterizer with more shader cores and get high utilization all the time.

Other tasks also benefit more from having extra programmable cores versus a bulky rasterizer.

It's very similar to the vertex and pixel pipeline unification that took place several years ago. Applications were held back by the ratio of vertex and pixel pipelines. Unification fixed this and also enabled new uses. Programmable rasterization is one of the next steps to ensure you can throw almost anything at the GPU and have it processed efficiently.
...ROP/compression logic (so you can basically rasterize a compressed MSAA buffer of arbitrary data) but it's not important to have fixed-function resolve. My guess is it will just be generalized to allow the programmable stuff to deal with the compressed data slightly more efficiently.
Framebuffer compression can also perfectly be handled by programmable hardware. More local storage is needed though, and to make that available GPUs should reduce the number of threads they need to keep in flight, by reducing execution latencies (Fermi needs to hide at least 24 cycles, Larrabee only 4 cycles, and it can be further reduced with out-of-order execution - which isn't all that expensive when you have very wide vectors).

Rasterizers, ROPs and even texture samplers, can in time all be replaced by more generic cores.
nAo said:
With power consumption being the number one constraint these days it's likely fixed function HW will keep us company for a long long time..
While power consumption is definitely a big constraint, I don't think it's the number one constraint. You can't have too much dedicated hardware even when taking power consumption out of the equation, because the chip would just get too big (read: expensive). Performance/dollar still dominates performance/Watt.

Looking at the future evolution, peak performance/Watt will steadily improve with process technology, but effective performance/dollar improves more slowly. Cost determines the die size and power consumption determines the peak performance, but getting high effective performance requires high utilization.
 
Not likely. Tesselation is making some triangles really tiny, while other triangles remain pretty large. So you'd need to spend a large portion of the die area on a dedicated rasterizer to sustain the maximum triangle rate, but it's going to be idle much of the time. So eventually it's more efficient to just replace the rasterizer with more shader cores and get high utilization all the time.
Perhaps, but it's also possible we'll just reach a point where it can do "enough" triangles. I agree that the math/logic for doing rasterization isn't actually very hard or ill-suited to programmable hardware, the tougher bit is the data paths.

Framebuffer compression can also perfectly be handled by programmable hardware.
Hard to say - there's advantages to having logic nearer the memory hardware. We've seen this with ROPs/atomics and it's not totally clear that it won't continue to be a win vs. paying excessive instruction counts/penalties for amplification operations.

I do agree that latencies need to drop on GPUs though moving forward. I don't think it's going to continue to be feasible to fill such a wide machine with such tiny local memories/caches in the general case.
 
it can be further reduced with out-of-order execution - which isn't all that expensive when you have very wide vectors).
Meh, maybe ... but you're trading less storage for more wasted cycles (speculation wastes more cycles than vertical multithreading) and more processor hardware. Hardly a guaranteed win.
 
Not likely. Tesselation is making some triangles really tiny,
That, to me, implies that the tessellation algorithm is broken. :???:
Framebuffer compression can also perfectly be handled by programmable hardware.....
Rasterizers, ROPs and even texture samplers, can in time all be replaced by more generic cores.
And bigger "generic" power supplies and cooling systems?
 
While power consumption is definitely a big constraint, I don't think it's the number one constraint. You can't have too much dedicated hardware even when taking power consumption out of the equation, because the chip would just get too big (read: expensive).
This just about the easiest problem to scale with process technology. Putting aside that dedicated or specialized hardware is almost always smaller than a generic replacement, transistor budget for a device of a given price is about the only thing Moore's law covers.

Performance/dollar still dominates performance/Watt.
This is a mischaracterization of the situation. Power consumption has been a limiter for years now. Perhaps for so long that it has become an accepted part of reality and no longer noticed.
Chips expend swaths of die space to cut power consumption, because power is the limiter of performance.

Looking at the future evolution, peak performance/Watt will steadily improve with process technology, but effective performance/dollar improves more slowly.
This is an inversion of current trends. Power concerns are significantly capping performance growth, while transistor counts per device have gone up in line with Moore's law.
In four node transitions, future chips could house 16x the transistors, but may only cut power consumption per transistor by half.
It's already the case that the vast majority of a given CPU must be inactive in a given clock cycle, because no modern performance IC as of perhaps the .13 or .18 micron node can actually be fully "on" for a sustained period of time.
Both Bulldozer and Sandy Bridge have tech papers obsessing about power and variation (which affects power and performance). They both expend scads of die area on the problem.

Cost determines the die size and power consumption determines the peak performance, but getting high effective performance requires high utilization.
The only real direct relationship amongst the items in this listing is one not actually stated.
Power consumption is proportional to utilization. The rest are related to each other in many complicated ways.
 
Adding OoO for very wide vectors means adding that many more ports/banks to your register file. And they are not exactly cheap especially when you have 32/64 wide vectors. So it is far from obvious that OoO is not very expensive, or even a net perf/mm win with a very wide vectors.

Of course generic hw can do FB compression/decompression. The question is at what power cost? Idle hw is cheap and hardly a concern in this day and age. Especially if it is for a simple operation.

Also, if you tile your screen like fermi, then you don't need bulky rasterizers.
 
The width of the vector register can be separated for the most part from the OoOE engine with a physical register file. Then, it's just passing around a pointer to the register data, instead of copying the value itself to various reservation stations.
Sandy Bridge's implementation allowed for 256b registers being added and the rename number of registers to be increased at the same time.
 
But to make useful use of ooo, wouldn't you need to to be able to fetch multiple operands per cycle to the ALUs? Will that not need wider datapaths into and out of the register file? Also, wouldn't you need multiple ALU's to sustain multiple issue? Which will ironically, lower utilization.

IOW, there's more to ooo than dependency resolution. Isn't it?

Also, what about the area penalty needed to lower the alu/reg file latency from ~20 cycles to ~5 cycles. Will it come anywhere close to being compensated?
 
But to make useful use of ooo, wouldn't you need to to be able to fetch multiple operands per cycle to the ALUs?
That's wide-issue, which is orthogonal to OoO. Wide in-order chips still need to read multiple operands.
If data width becomes a limitation, it would be in other parts of the pipeline, which could impact the need for an aggressive scheduler.

Will that not need wider datapaths into and out of the register file?
The paths would need to be wider, but the scheduler wouldn't need to care about it beyond the location identifier.

IOW, there's more to ooo than dependency resolution. Isn't it?
There are things such as exception tracking and speculation recovery in addition to dependence tracking. For a given amount of throughput, it is actually cheaper to have 1 wide instruction versus multiple narrow ones, since this is per-register and per-instruction, not per-bit. Actually using the width is a separate matter.

Also, what about the area penalty needed to lower the alu/reg file latency from ~20 cycles to ~5 cycles. Will it come anywhere close to being compensated?
An increase in width would make data movement more energy-intensive and increase the size of the data paths around the register file, which typically are already larger than the SRAM arrays themselves. That could worsen power and latency.
 
Perhaps, but it's also possible we'll just reach a point where it can do "enough" triangles.
That means it's either a bottleneck, or idle silicon. That would be ok if it was small and offered a substantial benefit, but that balance appears to be tipping as the workloads become ever more diverse.
I agree that the math/logic for doing rasterization isn't actually very hard or ill-suited to programmable hardware, the tougher bit is the data paths.
I'm actually more concerned about the data paths for dedicated hardware. There are ever more pipeline stages and different configurations. You can either have many different dedicated data paths, or a single general purpose data path using a cache hierarchy to pull data from and store results into, to be consumed by the next stage.
 
It's already the case that the vast majority of a given CPU must be inactive in a given clock cycle, because no modern performance IC as of perhaps the .13 or .18 micron node can actually be fully "on" for a sustained period of time.
Both Bulldozer and Sandy Bridge have tech papers obsessing about power and variation (which affects power and performance). They both expend scads of die area on the problem.


The only real direct relationship amongst the items in this listing is one not actually stated.
Power consumption is proportional to utilization. The rest are related to each other in many complicated ways.

So in this respect a GPU is likely to have a much higher percentage of it's total transistors actively switching at any one moment, right? So given the fact that both AMD and Nvidia have hit a hard cap with respect to overall power useage one could expect that GPU die area would be expected to fall with every process node transition henceforth?
 
Meh, maybe ... but you're trading less storage for more wasted cycles (speculation wastes more cycles than vertical multithreading) and more processor hardware. Hardly a guaranteed win.
Out-of-order execution doesn't imply speculation. You can still use multiple threads to hide branching latencies (or any other latency for that matter). But generic caches and out-of-order execution can dramatically lower the average latency and thus allow to advance the instruction pointer much faster. This in turn means less on-die storage is wasted on thread contexts, and workloads don't have to be ridiculously data parallel to get good efficiency.
 
Back
Top