Software/CPU-based 3D Rendering

They should just be generic SIMD operations. Texturing is little more than a generic mipmap LOD calculation, a generic texel address calculation, a generic gather operation, and a generic filter operation. All of this can and has been done in shaders already. Likewise programmable rasterization is currently a hot topic in graphics research.
From what I've gathered not so much in application research but from common sense is that processor designs are much more power focused than area focused. Take for example Nvidia dropping their hot clock: Area-saving feature vs. power saving feature.

With even smaller process geometries, you get an increasing amount of transistors per area (in effect: per dollar), but less improvement in calculations per watt. So, unless you're in a business that does not care about power (yet), you might already be designing to meet specific power targets rather than specific area targets. IOW you deliberately spend more area because you know that your processor cannot switch all it's gates (in the cores) at once in your power target anyway.
 
A pity you didn't attend SIGGRAPH / HPG where there were a few talks showing just how many times bigger a performance-equivalent programmable unit was compared to the fixed-function implementation.
That's comparing apples and oranges. Of course a unit which can do anything is much bigger. A fixed-function units is a waste of hardware except for the times you can have it do the one single thing it was designed for, and can be a bottleneck at other times.

So what was your point anyway? Everything should be fixed-function and GPUs are heading in the completely wrong direction?

Did those talks take SIMD into account? It makes programmable units many times more powerful, at a relatively low hardware cost. It is what has made it perfectly feasible for GPUs to become highly programmable thus far. Also, a lot of things can be optimized with a handful of new instructions. The area required for these new instructions is often negligible when you already have the rest of the programmable cores as a framework.

There's no point in having smaller units anyway. Today's GPUs have massive amounts of fully programmable computing power, but they start running into the bandwidth wall. Future architectures can have lots more computing power, and will be completely bandwidth limited. So you can choose between programmable functionality and be bandwidth limited, or fixed-function and be bandwidth limited. Not a hard pick.
 
They should just be generic SIMD operations. Texturing is little more than a generic mipmap LOD calculation, a generic texel address calculation, a generic gather operation, and a generic filter operation. All of this can and has been done in shaders already. Likewise programmable rasterization is currently a hot topic in graphics research.
From what I've gathered not so much in application research but from common sense is that processor designs are much more power focused than area focused.
I didn't say area is an issue.
With even smaller process geometries, you get an increasing amount of transistors per area (in effect: per dollar), but less improvement in calculations per watt. So, unless you're in a business that does not care about power (yet), you might already be designing to meet specific power targets rather than specific area targets. IOW you deliberately spend more area because you know that your processor cannot switch all it's gates (in the cores) at once in your power target anyway.
None of this implies that fixed-function is the solution.

You're right that calculations per Watt doesn't increase at the same rate as transistor count, but it still increases at a very substantial rate. In particular, it increases faster than bandwidth, at all levels. And this brings us back to power. Moving data around costs more power than performing operations on it. So no matter how power efficient a fixed-function unit is, sending data to it from a programmable core and back eventually costs more than performing the same operations with a few instructions.
 
I was under the (laymans) impression, that fixed function hardware is more efficient in almost every way except area than fully programmable units.

Moving data around in a programmable core seems also not unproblematic, since you'll be moving data from ultra-high-speed registers to cache and vice versa (at the said ultra high speeds). With texture data whose latency can be hidden, you only need 1/4th of the data to move into the programmable cores (i.e. the bilinear filtered sample instead of the raw data) at the very energy intensive speed.

But surely there are more knowledgeable people than me to judge this. I'll just lean back and look at the trends in current power efficient hardware and upcoming generations.
 
Fixed function hardware should be more area efficient. Assuming you can power the programmable hardware a primary tradeoff becomes how often is the function needed.
 
GPUs have steadily ditched many dedicated hardware blocks for various functions and effects, to keep only the most critical stages of the pipeline we still see today -- primitive setup and rasterization, texturing, pixel writes/blends, etc. Functional units are cheap now, with the advancement of the manufacturing process and the billions of transistors that can be crammed in a midrange chip, maintaining custom hardware logic is not a problem. Data movement and availability is what makes or break a design now. Anyone -- big and small -- can do fine grained power and clock gating of different parts of a chip, depending on the load type, but few can overcome the bandwidth limitations and the power drain of moving too much data in and out.
 
Fixed function hardware should be more area efficient.
I meant from a global point of view: FF units add area to the whole chip, most of the time more so than enabling their functionality inside of programmable units.
 
GPUs haven't "ditched" any significant FF HW in ages.
Several works at HPG/SIGGRAPH this year actually show renewed interest in adding FF HW from a few hardware/IP vendors, to not mention the ever increasing amount of FF HW present on mobile SoCs.
 
Moving data around in a programmable core seems also not unproblematic, since you'll be moving data from ultra-high-speed registers to cache and vice versa (at the said ultra high speeds). With texture data whose latency can be hidden, you only need 1/4th of the data to move into the programmable cores (i.e. the bilinear filtered sample instead of the raw data) at the very energy intensive speed.
The texture units are also fed from a cache. So you can't reduce the overall data movement by shifting the problem to a fixed-function unit.

In several years from now for any given TDP you can have more programmable computing power than what you can feed with data. So there's no point in having fixed-function units. It may still sound reasonable now, just to increase the battery life even more or reduce the cost, but in the future it would be as insane as suggesting to go back to a Direct3D 7 architecture for the same reasons. Fully programmable hardware enables new possibilities that will eventually make current hardware look very restrictive.
 
GPUs haven't "ditched" any significant FF HW in ages.
I wouldn't call it ditching, but the fast evolution from fixed-function OpenGL ES 1.x to highly programmable OpenGL ES 3.0 does mean that proportionally ever less hardware is spent on fixed-function units. Project Logan, aka Mobile Kepler, also shows that desktop-level functionality isn't a major issue for power restricted designs.

And like I said before, a lot of this is binary in nature. You either have the fixed-function unit, or you don't. So any evolution toward "ditching" a specific fixed-function unit will appear sudden. For instance the utilization of MSAA hardware has gone down steadily in recent years due to many image-based anti-aliasing techniques. Those use programmable shader cores instead (and they often want unfiltered, non-mipmapped, non-perspective texture samples). So sooner or later that dedicated hardware will 'suddenly' disappear and the programmable cores will be made more suitable to take over this functionality.
Several works at HPG/SIGGRAPH this year actually show renewed interest in adding FF HW from a few hardware/IP vendors...
That's no different from any other year. Researchers love to think that their technique is more important than anyone else's and deserves dedicated hardware. But you can't cater to all of them. So eventually they have to settle for an efficient shader-based implementation. If they're lucky they get a few new instructions.

What also has to be taken into account is that hardware manufacturers are scrambling to find the next 'killer app' that makes people want to buy a big piece of dedicated silicon. Integrated graphics is gaining market share, and with the CPU's throughput computing power skyrocketing it's only a matter of time before the integrated GPU gets unified. Unless GPU manufacturers can find something that even a million programmable units with specialized instructions can't handle and which consumers will value dearly...
...to not mention the ever increasing amount of FF HW present on mobile SoCs.
That's not an increase.
 
A GeForce 4 wasn't good at physics at all. That's when Ageia was founded. GPUs became good at physics purely because graphics itself evolved toward generic computing. That evolution hasn't stopped, and at the same time CPUs are becoming good at graphics.

And GPUs are becoming even better at graphics than that. In addition, they're becoming better at things that were once CPUs tasks like physics and video encoding.
 
A GeForce 4 wasn't good at physics at all. That's when Ageia was founded. GPUs became good at physics purely because graphics itself evolved toward generic computing. That evolution hasn't stopped, and at the same time CPUs are becoming good at graphics.
And GPUs are becoming even better at graphics than that.
Sure, but there will be a point where the difference between good and better no longer matters. We could have had an Ageia PhysX card that became even better at physics than the GPU has become, but it still went the way of the dodo because any minor advantage in efficiency didn't justify the cost.

The same thing is happening to graphics, albeit much more slowly. In fact the GPU is converging towards a more CPU-like architecture, so becoming "even" better at graphics than the CPU still means the gap is closing.
In addition, they're becoming better at things that were once CPUs tasks like physics and video encoding.
That's just a consequence of graphics becoming more like generic computing. I mean, when two programmable architectures converge then for a while there is bound to be some overlap in the applications they run. It doesn't mean one has become superior to the other. Note that not all physics and video encoding is now happening on the GPU. The average consumer GPU is pretty low-end and doesn't outperform the CPU at these tasks.

So once again this is a case where becoming "better" at something just indicates convergence. It will eventually lead to unification.
 
The texture units are also fed from a cache. So you can't reduce the overall data movement by shifting the problem to a fixed-function unit.

In several years from now for any given TDP you can have more programmable computing power than what you can feed with data. So there's no point in having fixed-function units. It may still sound reasonable now, just to increase the battery life even more or reduce the cost, but in the future it would be as insane as suggesting to go back to a Direct3D 7 architecture for the same reasons. Fully programmable hardware enables new possibilities that will eventually make current hardware look very restrictive.
I am not arguing your last couple of sentences, but some before.
While I also think it is true, that you can have more math or programmable computing power than you can feed with data, I would tend towards a shift in paradigms as well. And I won't type the same area vs power sermon for the nth time, because of...

...Note that not all physics and video encoding is now happening on the GPU. The average consumer GPU is pretty low-end and doesn't outperform the CPU at these tasks.
So once again this is a case where becoming "better" at something just indicates convergence. It will eventually lead to unification.
(my bold) Isn't that stuff in CPUs used for video encoding called QuickSync and mainly composed of FF hardware?
 
Isn't that stuff in CPUs used for video encoding called QuickSync and mainly composed of FF hardware?
The argument I was responding to is "[GPUs are] becoming better at things that were once CPUs tasks like physics and video encoding". QuickSync doesn't support that. Nor does it support the opposite for that matter...

Allow me to quote my previous answer to video decoding and such using fixed-function hardware: "Those are all excellent examples of I/O computing. There's no strong need to unify them, because there's no data locality between the components. The data flows in one direction, in or out. In effect, it's not collaborative heterogeneous computing."

So I'm not opposed to fixed-function hardware. All I'm saying is that programmable computing is converging and will eventually unify and enable new possibilities. The GPU has some remaining fixed-function hardware for graphics that makes this non-trivial, but it is becoming underutilized, its use cases are diversifying, and it can be efficiently replaced by specialized instructions. So that's not going to stop this convergence of programmable computing.

That said, QuickSync might also be a transitional thing. Note that MP3 decoding/encoding used to benefit from fixed-function hardware, but now the CPU can easily handle any (consumer) audio processing. Also, many new codecs have emerged that make dedicated hardware a waste and call for programmability instead. The same things are happening with video. It used to take a lot of CPU power and there were only a few codecs so it made sense to add dedicated hardware to offload the CPU for the most important use cases. Eventually it will also only require a fraction of CPU power, so the dedicated hardware can be dropped.
 
Interactive graphics will always be limited by computing resources.

I think that we will already know in this decade what rendering principle is most efficient and practical for infinitely scalable graphics rendering with classical computer architectures.
Additionally we will have to face the definite and abrupt end of the historic pace of falling prices for computing performance in microelectronics within two to six years.

Thus, the future undoubtedly belongs to fixed function hardware.
 
Interactive graphics will always be limited by computing resources.
Which itself is limited by bandwidth.
I think that we will already know in this decade what rendering principle is most efficient and practical for infinitely scalable graphics rendering with classical computer architectures.
Additionally we will have to face the definite and abrupt end of the historic pace of falling prices for computing performance in microelectronics within two to six years.
Yes, the CPU and GPU won't unify if process technology stops scaling or if the moon crashes into earth. Neither of those things are likely to happen in the foreseeable future though. A massive industry depends on it, with far-reaching consequences if it were to grind to a halt. Some might prefer the moon crashing into earth instead. Ok, I'm obviously being overly dramatic there, but I'm just trying to point out how terrible your argument is, in every way. A lot of things can be claimed not to ever happen if figuratively something 'fell out of the sky'.

Of course the laws of physics won't change just because a trillion dollar industry wants them to. But the end of semiconductor scaling has been claimed many times before and has never come true. It's like doomsday predictions; you might expect something to happen, but the day after ends up being just like the day before. People don't seem to learn from that though and keep coming up with new fatalistic events that will happen within their lifetime. Although the earth will certainly eventually stop turning, in all likelihood it won't happen in an event that is 'visible' to us. To get back on topic, Pat Gelsinger said something very similar about silicon scaling: "We see no end in sight and we've had 10 years of visibility for the last 30 years."

As long as you have a decade of visibility, it is highly likely that after this decade has passed, you have many more years of continued scaling. If all research labs in the world came to a conference empty-handed, that's when you have to start worrying. We're nowhere near that. Worst case we're seeing that the progress has slowed down a little. But it's worth noting that we're ahead of the curve for several things, so it's to be expected that thing slow down until we're back to a more natural pace, especially in today's global economic climate. And even if the slowing down is an early sign of a gradual tapering off, we'll still have several decades of very significant progress ahead of us.

Which brings me to how much time might be required for the CPU and GPU to unify. If hypothetically the integrated GPU was dropped, an 8-core CPU with AVX-512 could be pretty mainstream at 14 nm, and would be capable of 2 TFLOPS. That would really push the limits of what the DDR4 bandwidth will be able to help sustain. That's plenty for doing all graphics on the CPU, for the low-end market. And while 14 nm certainly is too soon for something this disruptive to happen (because there's much work left to be done), the basic building blocks are theoretically available and we'll have 10, 7 or 5 nm by the time it's really the next evolutionary step.
Thus, the future undoubtedly belongs to fixed function hardware.
So we're going back to DirectX 7 graphics?

Undoubtedly not. Programmability is a must since more computing power is worthless unless you can do something the previous generation couldn't, while still retaining legacy capabilities. And you can't cater for one technique because chances are that developers want something different and your fixed-function block becomes a waste of money. But we can have the best of both worlds with 'specialized' instructions. This means unification can happen while still getting most of the same benefits that fixed-function hardware offers. Determining instructions that can do a maximum of useful work while at the same time being generic enough to be useful for a diverse range of applications, is the real challenge going forward.
 
Which itself is limited by bandwidth.
Bandwidth will always be a problem, but I think the latest FPGAs demonstrate that internal and external data bandwidth can be increased dramatically with modern IC manufacturing technologies.

Yes, the CPU and GPU won't unify if process technology stops scaling or if the moon crashes into earth. Neither of those things are likely to happen in the foreseeable future though. A massive industry depends on it, with far-reaching consequences if it were to grind to a halt. Some might prefer the moon crashing into earth instead. Ok, I'm obviously being overly dramatic there, but I'm just trying to point out how terrible your argument is, in every way. A lot of things can be claimed not to ever happen if figuratively something 'fell out of the sky'.

Of course the laws of physics won't change just because a trillion dollar industry wants them to. But the end of semiconductor scaling has been claimed many times before and has never come true. It's like doomsday predictions; you might expect something to happen, but the day after ends up being just like the day before. People don't seem to learn from that though and keep coming up with new fatalistic events that will happen within their lifetime. Although the earth will certainly eventually stop turning, in all likelihood it won't happen in an event that is 'visible' to us. To get back on topic, Pat Gelsinger said something very similar about silicon scaling: "We see no end in sight and we've had 10 years of visibility for the last 30 years."

As long as you have a decade of visibility, it is highly likely that after this decade has passed, you have many more years of continued scaling. If all research labs in the world came to a conference empty-handed, that's when you have to start worrying. We're nowhere near that. Worst case we're seeing that the progress has slowed down a little. But it's worth noting that we're ahead of the curve for several things, so it's to be expected that thing slow down until we're back to a more natural pace, especially in today's global economic climate. And even if the slowing down is an early sign of a gradual tapering off, we'll still have several decades of very significant progress ahead of us.
The slowed pace of performance/price scaling for IC manufacturing is already an established fact of reality right now and will naturally increase exponentially with time. What Intel says are just sweet unspecific words to keep investors and their personal pride calm.

Which brings me to how much time might be required for the CPU and GPU to unify. If hypothetically the integrated GPU was dropped, an 8-core CPU with AVX-512 could be pretty mainstream at 14 nm, and would be capable of 2 TFLOPS. That would really push the limits of what the DDR4 bandwidth will be able to help sustain. That's plenty for doing all graphics on the CPU, for the low-end market.
Why would you want to stall a gigantic serial processing pipeline with massive parallel streaming/processing? Right! It makes no sense at all.

And while 14 nm certainly is too soon for something this disruptive to happen (because there's much work left to be done), the basic building blocks are theoretically available and we'll have 10, 7 or 5 nm by the time it's really the next evolutionary step.
There is already barely any cost advantage per transistor from the 32/22 nm step.

So we're going back to DirectX 7 graphics?

Undoubtedly not. Programmability is a must since more computing power is worthless unless you can do something the previous generation couldn't, while still retaining legacy capabilities. And you can't cater for one technique because chances are that developers want something different and your fixed-function block becomes a waste of money. But we can have the best of both worlds with 'specialized' instructions. This means unification can happen while still getting most of the same benefits that fixed-function hardware offers. Determining instructions that can do a maximum of useful work while at the same time being generic enough to be useful for a diverse range of applications, is the real challenge going forward.
You should stop thinking about the classical OGL/D3D rendering pipeline. Computer graphics rendering is no magic that needs endless iterations of new approaches.
The list of all rendering problems can put on a single page and the list of unique concepts to solve them can be condensed to a couple lines.


Btw.: Tim Sweeney should STFU and realize that he never published/introduced a novel solution for interactive graphics. He says so many obviously wrong things that my eyes are bleeding.
 
Last edited by a moderator:
Back
Top