If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
Meh
Join Date: Mar 2004
Location: New York
Posts: 9,809
|
Now that many of the major developers of 3D engines have chosen a deferred shading pipeline should we expect hardware accelerated MSAA to be dropped from future architectures? I'm not referring so much to rasterization or MSAA buffer generation but the compression, blending and resolve steps of the process.
That hardware appears to be pretty much useless on the latest engine technology so why keep it around? I expect new hardware to be fast enough to eat the bandwidth and compute costs of emulating these steps on the shader units for older forward renderers. Thoughts?
__________________
What the deuce!? |
|
|
|
|
|
#2 |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Eventually, yes, I would expect all fixed-function hardware to go away. Developers are using the hardware in ways it was not originally designed for. So it makes more sense to just make things fully generic and highly programmable.
It's going to take many years though. High-end hardware may have bandwidth and compute resources to spare to implement the fixed-function functionality in software for older applications, but the average consumer doesn't buy such high-end hardware. So it takes a slow evolution on both the hardware and software side. But looking at how vastly different graphics chips were just a decade ago, I'm fairly confident that we're going to see Larrabee-like architectures from NVIDIA and AMD before the end of this decade. And in the low-end market there's also clearly going to be a collision with CPUs that feature gather/scatter support. |
|
|
|
|
|
#3 |
|
Meh
Join Date: Mar 2004
Location: New York
Posts: 9,809
|
That's a fair point but what good is fixed function hardware on mainstream cards when the software is no longer making use of it? The future will present an option of compute or bust. I'm not sure how fixed function units can be slowly phased out on upcoming engines.
__________________
What the deuce!? |
|
|
|
|
|
#4 | |
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,841
|
Quote:
I severely doubt that the hardware that "does MSAA" (i.e. coverage samples from non-grid locations) will go away. It's far too useful and while all the rage is on the screen-space reconstruction stuff right now, there's very little chance that it will be used exclusively in the future. There's a very good reason why you need to super-sample visibility and an even better reason why you don't do it in an ordered grid.
__________________
The content of this message is my personal opinion only. |
|
|
|
|
|
|
#5 | ||
|
Meh
Join Date: Mar 2004
Location: New York
Posts: 9,809
|
Quote:
Quote:
__________________
What the deuce!? |
||
|
|
|
|
|
#6 |
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,841
|
Ah ok. Yeah I imagine we'll keep the rasterizer + ROP/compression logic (so you can basically rasterize a compressed MSAA buffer of arbitrary data) but it's not important to have fixed-function resolve. My guess is it will just be generalized to allow the programmable stuff to deal with the compressed data slightly more efficiently.
__________________
The content of this message is my personal opinion only. |
|
|
|
|
|
#7 |
|
Nutella Nutellae
Join Date: Feb 2002
Location: San Francisco
Posts: 4,297
|
With power consumption being the number one constraint these days it's likely fixed function HW will keep us company for a long long time..
__________________
[twitter] More samples, we need more samples! [Dean Calver] The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way |
|
|
|
|
|
#8 | ||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Not likely. Tesselation is making some triangles really tiny, while other triangles remain pretty large. So you'd need to spend a large portion of the die area on a dedicated rasterizer to sustain the maximum triangle rate, but it's going to be idle much of the time. So eventually it's more efficient to just replace the rasterizer with more shader cores and get high utilization all the time.
Other tasks also benefit more from having extra programmable cores versus a bulky rasterizer. It's very similar to the vertex and pixel pipeline unification that took place several years ago. Applications were held back by the ratio of vertex and pixel pipelines. Unification fixed this and also enabled new uses. Programmable rasterization is one of the next steps to ensure you can throw almost anything at the GPU and have it processed efficiently. Quote:
Rasterizers, ROPs and even texture samplers, can in time all be replaced by more generic cores. Quote:
Looking at the future evolution, peak performance/Watt will steadily improve with process technology, but effective performance/dollar improves more slowly. Cost determines the die size and power consumption determines the peak performance, but getting high effective performance requires high utilization. |
||
|
|
|
|
|
#9 | ||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,841
|
Quote:
Quote:
I do agree that latencies need to drop on GPUs though moving forward. I don't think it's going to continue to be feasible to fill such a wide machine with such tiny local memories/caches in the general case.
__________________
The content of this message is my personal opinion only. |
||
|
|
|
|
|
#10 |
|
Regular
|
Meh, maybe ... but you're trading less storage for more wasted cycles (speculation wastes more cycles than vertical multithreading) and more processor hardware. Hardly a guaranteed win.
__________________
Cinematic is the new streamlined. |
|
|
|
|
|
#11 | |
|
Tea maker
Join Date: Feb 2002
Location: In the Island of Sodor, where the steam trains lie
Posts: 4,382
|
That, to me, implies that the tessellation algorithm is broken.
Quote:
__________________
"Your work is both good and original. Unfortunately the part that is good is not original and the part that is original is not good." -(attributed to) Samuel Johnson "I invented the term Object-Oriented, and I can tell you I did not have C++ in mind." Alan Kay |
|
|
|
|
|
|
#12 |
|
Regular
|
How so? If you're not using displacement mapping you could have small tris at silhouettes and areas of high curvature and large tris in flat regions.
__________________
Cinematic is the new streamlined. |
|
|
|
|
|
#13 | ||||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,121
|
Quote:
Quote:
Chips expend swaths of die space to cut power consumption, because power is the limiter of performance. Quote:
In four node transitions, future chips could house 16x the transistors, but may only cut power consumption per transistor by half. It's already the case that the vast majority of a given CPU must be inactive in a given clock cycle, because no modern performance IC as of perhaps the .13 or .18 micron node can actually be fully "on" for a sustained period of time. Both Bulldozer and Sandy Bridge have tech papers obsessing about power and variation (which affects power and performance). They both expend scads of die area on the problem. Quote:
Power consumption is proportional to utilization. The rest are related to each other in many complicated ways.
__________________
Dreaming of a .065 micron etch-a-sketch. |
||||
|
|
|
|
|
#14 |
|
Senior Member
|
Adding OoO for very wide vectors means adding that many more ports/banks to your register file. And they are not exactly cheap especially when you have 32/64 wide vectors. So it is far from obvious that OoO is not very expensive, or even a net perf/mm win with a very wide vectors.
Of course generic hw can do FB compression/decompression. The question is at what power cost? Idle hw is cheap and hardly a concern in this day and age. Especially if it is for a simple operation. Also, if you tile your screen like fermi, then you don't need bulky rasterizers. |
|
|
|
|
|
#15 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,121
|
The width of the vector register can be separated for the most part from the OoOE engine with a physical register file. Then, it's just passing around a pointer to the register data, instead of copying the value itself to various reservation stations.
Sandy Bridge's implementation allowed for 256b registers being added and the rename number of registers to be increased at the same time.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#16 |
|
Senior Member
|
But to make useful use of ooo, wouldn't you need to to be able to fetch multiple operands per cycle to the ALUs? Will that not need wider datapaths into and out of the register file? Also, wouldn't you need multiple ALU's to sustain multiple issue? Which will ironically, lower utilization.
IOW, there's more to ooo than dependency resolution. Isn't it? Also, what about the area penalty needed to lower the alu/reg file latency from ~20 cycles to ~5 cycles. Will it come anywhere close to being compensated? |
|
|
|
|
|
#17 | ||||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,121
|
Quote:
If data width becomes a limitation, it would be in other parts of the pipeline, which could impact the need for an aggressive scheduler. Quote:
Quote:
Quote:
__________________
Dreaming of a .065 micron etch-a-sketch. |
||||
|
|
|
|
|
#18 | ||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Quote:
|
||
|
|
|
|
|
#19 | |
|
Beyond3d isn't defined yet
Join Date: Jan 2008
Location: New Zealand
Posts: 3,042
|
Quote:
__________________
It all makes sense now: Gay marriage legalized on the same day as marijuana makes perfect biblical sense. Leviticus 20:13 "A man who lays with another man should be stoned". Our interpretation has been wrong all these years! |
|
|
|
|
|
|
#20 |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Out-of-order execution doesn't imply speculation. You can still use multiple threads to hide branching latencies (or any other latency for that matter). But generic caches and out-of-order execution can dramatically lower the average latency and thus allow to advance the instruction pointer much faster. This in turn means less on-die storage is wasted on thread contexts, and workloads don't have to be ridiculously data parallel to get good efficiency.
|
|
|
|
|
|
#21 |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Generic doesn't have to mean high power consumption. Dedicated hardware has to achieve high performance while being squeezed into a small area, leading to higher power consumption while at the same time other parts of the chip are idle. If instead you have more generically programmable cores, the tasks can use a larger portion of the chip, which can then be more optimized for power.
Of course it's a delicate balancing act. But in the case of vertex and pixel pipeline unification it worked out rather well, especially since it also enabled new techniques. Note also that merely a decade ago people were nearly declared mental to suggest floating-point pixel processing. So while it's hard to predict exactly what the developers will do with it, more generic programmability has always proven to be a success if it's introduced gradually. |
|
|
|
|
|
#22 | |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,121
|
Quote:
If one allows for the fact that GPUs on discrete cards can have higher TDPs than most mainstream socketed processors, maybe yes. On the other hand, the transistor counts tend to be much higher for the big chips, so the ratio may be lower than the TDP may suggest. If one goes by the lamentation of people who think we can move completely over to software and finally utilize all those idling transistors, no. The designs devote transistors to different things as well. OoOE, speculation, many pipeline stages, and low-latency forwarding add transistors. These transistors are part of the execution process, so they are active on any task being performed. Wider throughput designs that rely on less aggressive scheduling and relaxed latency can spend fewer transistors on the meta-work of execution, often at the cost of utilization. GPU and CPU designers are striving hard to add more fine-grained clock gating and power management. With the addition of power gating, even more area is being devoted to power reduction. Many of these features actually increase area, so die size doesn't need to go down with future chips.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
|
#23 |
|
Beyond3d isn't defined yet
Join Date: Jan 2008
Location: New Zealand
Posts: 3,042
|
Ahh, thanks!
__________________
It all makes sense now: Gay marriage legalized on the same day as marijuana makes perfect biblical sense. Leviticus 20:13 "A man who lays with another man should be stoned". Our interpretation has been wrong all these years! |
|
|
|
|
|
#24 | |
|
Regular
|
Quote:
__________________
Cinematic is the new streamlined. |
|
|
|
|
|
|
#25 | |
|
Epsilon plus three
Join Date: Feb 2002
Location: Chania
Posts: 7,767
|
Quote:
http://bps10.idav.ucdavis.edu/talks/...GGRAPH2010.pdf http://bps10.idav.ucdavis.edu/talks/...GGRAPH2010.pdf There's a lot more research material available, but any of those (or even a clever combination of multiple ideas) aren't possibly just subject to hw but also sw changes of the future and not necessarily DX12 (which I of course don't have a clue what it could contain). A more generic question to the actual topic here would be a dilemma between working in the direction of picking still low hanging fruit for efficiency of existing hw or take the much higher risk of completely different approaches too soon.
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs. |
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|