Software/CPU-based 3D Rendering

All any processor can do is various arithmetic functions. PhysX cards filled a niche because compute shaders didn't exist back then.
A processor moves data and it operates on it. The more interesting aspect to the PPU was the data movement part that it used to send state updates to and from the processing units.

I'm not sure the elements on a GPU or CPU are a perfect match to that.
 
I think in 20 years, computers won't have CPUs at all.

I think you're slightly mad. ;) Today even your mouse has CPU. And you definitely need one to translate whatever data you want your GPU to push to the GPU-specific format. There's no way around it, really, unless you end up with a HW monopoly.
 
I think you're slightly mad. ;) Today even your mouse has CPU. And you definitely need one to translate whatever data you want your GPU to push to the GPU-specific format. There's no way around it, really, unless you end up with a HW monopoly.

Of course I meant high-performance CPUs in personal computers. Your mouse can't run OpenGL.
 
Of course I meant high-performance CPUs in personal computers. Your mouse can't run OpenGL.

The high-performance CPU I had in my personal computer 20 years ago was utterly incapable of doing what a simple mouse does today in real-time. Neither can/could run OpenGL, and of course that didn't/doesn't stop them being good at what they did/do.

Personally I think you lack depth of perspective. Perhaps I'm wrong.
 
The high-performance CPU I had in my personal computer 20 years ago was utterly incapable of doing what a simple mouse does today in real-time. Neither can/could run OpenGL, and of course that didn't/doesn't stop them being good at what they did/do.

Personally I think you lack depth of perspective. Perhaps I'm wrong.

What does a mouse do in real time that is so demanding?
 
What does a mouse do in real time that is so demanding?
Not that I agree with the no-CPUs in x years theory, but the amount of processing done in optical mice of today must be quite substantial: AFAIK it's all processing of CCD images at very high refresh rates. (Though a major part is probably fixed function image processing...)

Still 20y ago, we were talking 486 at 33MHz? It shouldn't be too hard to find tiny micro-controllers with more integer performance than that. ;)
 
What kind of res and refresh is that?, 160x120 and 125Hz? OK the refresh can be higher but maybe it's lower res?

20 years ago we had that 486 at 33MHz, or even a DX/2 66 which is perhaps the first common CPU we can say is powerful. Those beasts were for real time 3D rendering, sometimes with textures, or high end raycasted games like Doom and Duke3D.
 
The high-performance CPU I had in my personal computer 20 years ago was utterly incapable of doing what a simple mouse does today in real-time. Neither can/could run OpenGL, and of course that didn't/doesn't stop them being good at what they did/do.

Personally I think you lack depth of perspective. Perhaps I'm wrong.

And here you create an imaginary problem of a generational gap that completely misses my point.
 
You missed the point. It's not about the company disappearing, it's about the dedicated chips disappearing. They disappeared because of cost and because physics calculations merely consist of various arithmetic formulas. This is just generic computing, which can be handled by a CPU or a highly programmable GPU. Any kind of hardware specialization for certain types of physics calculations would create limitations that hamper innovation.

You may remember the discrete FPU, maybe you were too young to know these.
You could say these have disappeared, but in fact they are still there...
 
Not sure what you mean. At least for Intel the "FPU" is part of the same scheduler logic as Integer, so it's not even remotely a discreet part.
 
Not sure what you mean. At least for Intel the "FPU" is part of the same scheduler logic as Integer, so it's not even remotely a discreet part.

That's the point. At one point in history it was a separate ASIC (the 8087).
 
And the PPU was redundant from the beginning because GPUs were already good at physics. CPUs aren't good at graphics.
A GeForce 4 wasn't good at physics at all. That's when Ageia was founded. GPUs became good at physics purely because graphics itself evolved toward generic computing. That evolution hasn't stopped, and at the same time CPUs are becoming good at graphics.

AVX-512 offers eight times more FLOPS per core than what most software renderers are currently using. Replacing the integrated GPU with more cores would practically double that, while TSX reduces the synchronization overhead. AVX-512's gather support, its 32 registers, and its optional exponential instructions should also make a significant difference. And if extended to 1024-bit, they could execute them in two cycles to help hide latency and remove front-end bottlenecks while saving some power.

So CPUs can become really, really good at graphics. And the possibilities for new APIs and algorithms are endless.
 
You may remember the discrete FPU, maybe you were too young to know these.
I was five years old when the 80387 was launched. Of course I remember it. Vividly.
You could say these have disappeared, but in fact they are still there...
Sure, but there is no area on any chip to date which can under any definition be designated as the Physics Processing Unit. It has completely vanished. That doesn't mean the functionality itself is lost. It just moved to software that is executed on programmable cores. Likewise I claim that in the distant future we will no longer be able to distinguish the GPU, and all functionality will move to software.

The FPU's history shows us that functional units and instructions do survive unification. We can observe that GPU-like SIMD units and their associated instructions are also finding their way into the CPU cores, and AVX-512 appears to be the next big evolutionary step. Everything else you might desire for graphics can be added as relatively generic instructions as well.
 
Some dedicated hardware is still there, for some years there's been that craze for h264 decoders in graphics card, then cell phones and we're almost done with it.. Because most every hardware integrates it by now. Not still sorted out as linux is seeing nascent support for them in open source drivers for AMD/ATI and nvidia cards/chipset/APUs.
And then we'll need h265 and/or VP9 decoding (unless everyone sticks to h264)

That's an example, not the whole grand future of things but look at what we find inside a phone SoC or even a PC CPU : CPUs, GPU, image processor (tons of phone/embedded chips but also Intel QuickSync), audio codec and other audio-related DSP, video decoders, video encoders, hardware blocks dedicated to software radio, crypto accelerators, TCP/IP off-loading (in e.g. Gb and 10Gb ethernet interfaces).

It's even increasing : Moore's law give more transistor benefits than power benefits so in critically power/battery limited chips you have a lot of "dead silicon" i.e. units that are turned on occasionally, and then power gated for several milliseconds or more (down to turning off CPU cores, or an entire CPU in the 4+1 or 4+4 arrangements)

The HSA foundation basically exists to promote stuff working better together.
It's not necessarily incompatible with stuff also getting more generic, i.e. graphics cards gained GPGPU abilities at the same time their video decoding abilities were increasing (and soon shaders were doing scaler/filter things)


Regarding external FPU example : we ended with a FPU inside the CPU, rather than the CPU becoming exceptionnally strong at emulating it with integer code. [/edit: well you're making that same point, Nick]
So now we have the GPU getting inside the CPU (which even brings the h264 decoder and stuff in) but not quite getting games run on software renderer yet.

Maybe they will fusion a bit more, so we end up with those 512bit wide CPUs doing graphics duties. But who knows, there may be texture filtering units built right in the CPU pipeline, S3TC and friends decompressors or whatever critical stuff is needed.
 
Last edited by a moderator:
The FPU's history shows us that functional units and instructions do survive unification. We can observe that GPU-like SIMD units and their associated instructions are also finding their way into the CPU cores, and AVX-512 appears to be the next big evolutionary step. Everything else you might desire for graphics can be added as relatively generic instructions as well.

If you look at the evolution of SOCs, you see a very different story.
Look at the floorplan of the dies, a lot of silicon area is dedicated to specialized non CPU units.
 
...look at what we find inside a phone SoC or even a PC CPU : CPUs, GPU, image processor (tons of phone/embedded chips but also Intel QuickSync), audio codec and other audio-related DSP, video decoders, video encoders, hardware blocks dedicated to software radio, crypto accelerators, TCP/IP off-loading (in e.g. Gb and 10Gb ethernet interfaces).
Those are all excellent examples of I/O computing. There's no strong need to unify them, because there's no data locality between the components. The data flows in one direction, in or out. In effect, it's not collaborative heterogeneous computing.

This is also why I think discrete graphics will survive for a fairly long time to come. Things like conditional rendering (based on occlusion query results) and the ability to spawn new tasks on the GPU contribute to making graphics unidirectional. That said, there are many more motivations for unification than to benefit from data locality. The GPU's vertex and pixel processing will never split up again, despite the unidirectional dependency. Making the GPU more independent, requires making it more CPU-like, which attracts more non-graphics workloads, which calls for closer integration...

So it's important not to draw the wrong parallels. Granted, my PPU example was about highly interactive collaborative generic computing so unification was clearly inevitable, but I'm just trying to illustrate that there's a (sliding) scale to these things and multiple arguments are at play. Low-end graphics is slowly but surely approaching unification.
 
If you look at the evolution of SOCs, you see a very different story.
Look at the floorplan of the dies, a lot of silicon area is dedicated to specialized non CPU units.
Don't mistake that for an increase in heterogeneity. They're just bringing what used to be external chips, onto the same die. So you should look at the entire system. This is convergence, not divergence!

Depending on the type of processing being done, this may lead to unification or not. Pure I/O functionality won't unify, but for all intents and purposes the FPU did unify. Graphics is somewhere in the middle of this scale. It evolves slowly, but the forces of convergence are stronger than those to keep things separate, and it's getting stronger with every generation.
 
Even though there's definitely a trend towards it, graphics is not just compute. Rasterization and texture filtering are two examples that come to mind, where dedicated units seem to compare favourably versus generic computation core usage.

You might say of course that both techniques themselves are clutches that will be done with in the near future, but that's been proposed for quite a while as well and has yet to catch on in commercial production.
 
Even though there's definitely a trend towards it, graphics is not just compute. Rasterization and texture filtering are two examples that come to mind, where dedicated units seem to compare favourably versus generic computation core usage.
Not having dedicated units doesn't mean there shouldn't be sufficiently specialized hardware for common graphics operations!

They should just be generic SIMD operations. Texturing is little more than a generic mipmap LOD calculation, a generic texel address calculation, a generic gather operation, and a generic filter operation. All of this can and has been done in shaders already. Likewise programmable rasterization is currently a hot topic in graphics research.

Now why would someone replace a fixed-function pipeline that can do all of these operations in parallel and which produces one filtered texel per cycle, with programmable units that take multiple instructions? For the same reasons why this happened to the fixed-function vertex and pixel pipelines! Shaders can have hundreds of instructions, which are even broken up into scalar operations these days. But this is acceptable because instead of a few deep pipelines we now have a massive number of shallow ones. This has enabled programmability and all the goodness it has brought to the world of graphics, at a small cost that was well worth the added flexibility.

Programmability also eventually improves performance, especially for complex algorithms. It would be unthinkably slow or even practically impossible to try to implement some of today's graphics algorithms with fixed-function pipelines. Also, some of the classic graphics pipeline stages are more often than not disabled, and can be replaced with programmable operations which overall is a more efficient use of the hardware. In other cases fixed-function hardware is a bottleneck, which would disappear if all cores were capable of the necessary basic operations.
You might say of course that both techniques themselves are clutches that will be done with in the near future, but that's been proposed for quite a while as well and has yet to catch on in commercial production.
I am not denying that this will take a while. The bandwidth wall cometh but several viable solutions exist to keep scaling it a little while longer. Also keep in mind that this is a binary thing, like vertex and pixel processing unification: there's nothing in between. So just because you haven't seen anything in commercial products yet doesn't mean it's not getting nearer.
 
A pity you didn't attend SIGGRAPH / HPG where there were a few talks showing just how many times bigger a performance-equivalent programmable unit was compared to the fixed-function implementation.
 
Back
Top