I don't like CPUs. OK *spin*

See history of X360 hacks. :)

That was prior to the current concept of asynchronous compute on the GPU. The fact that this is a thing is specifically because a current GPU has been architected to not let that happen.
The appropriate response by the OS and CPU if this is bypassed is some kind of machine fault and shutdown.

Command processor is a "normal" CPU in all current GPU assemblies. Obviously the idea of using a CPU to control GPU and not the other way around is not carved in stone.
Pretty much every peripheral or slave device has a "normal" CPU. Disk storage has one or more processors, bus controllers have processors, there are dozens on a GPU. The CPU die itself has multiple micro controllers.
Their place in the hierarchy hasn't been a topic of debate. Some controllers can be mostly autonomous, but they are similarly not permitted to range freely outside of their designated zones in system memory, and they don't interface with the full range of signals and microcode needed to interact with the system at large.
 
Aren't you in danger of turning a GPU into a collection of bloated mini CPU cores if you try to add all CPU functionality to each CU?

And if in a "CPU less" system a command processor were able to perform some actions faster on it's own than issuing work to a CU ... isn't that kind of a CPU?

Wouldn't it make sense to retain some degree of specialisation in PC processes as there are likely to be a range of different tasks to be worked on?
 
There would need to be a more pervasive change to the GPU and programming model to give any single component full CPU functionality.
I noted that the GPU may not exist below a certain layer of abstraction.
It really is up to a combination of API, machine commands, internal microcode, and a raft of simple processors to perform what appears to be a simple command on a command queue (running in memory allocated for that purpose at the discretion of the host system).

The GPU's cache system is not that complex, and it supports a very simplistic view of memory. This is allowed because it assumes the CPU will handle all the complexities of a memory system with varying permissions, cacheability settings, fault handling, and interrupts before the CU has to work with it.

Each step in the processing of a queue command has a number of hidden steps. The command processor is not a single processor, but a custom block of 2 or three processors with each handling a subset of the "ISA" of queue commands it draws from the command queue. There's dispatch hardware, at least one processor in each ACE, and a processor of sorts per CU, and some amount of processing that might occur in the export process. A lot of this is tied together in some non-standard ways, like the subcomponents of the command processor and the way CUs depend on something else to initiate the contexts they will then run.

The GCN ISA document does not talk about the queue commands the front end processes, and the APIs we have do not talk about the ISA commands. GCN code cannot navigate the space of concerns involved in getting itself to run. It assumes that's someone else's problem, and a lot of that someone else is outside the GPU.
 
Aren't you in danger of turning a GPU into a collection of bloated mini CPU cores if you try to add all CPU functionality to each CU?

Not to all, you have pipeline, instruction decoders, OOE and other logic (to emulate x86) on CPU anyway right now. "Emulating CPU" on the GPU (using similar logic) should not be a more complex task.
 
Not to all, you have pipeline, instruction decoders, OOE and other logic (to emulate x86) on CPU anyway right now.
What does pipelining and OoOE have to do with emulating x86? Those things have to do with your performance characteristics and have basically nothing to do with the ISA.

"Emulating CPU" on the GPU (using similar logic) should not be a more complex task.
Translating x86 to shaders would be doable, but it doesn't get around the performance characteristic issues; CUs are designed specifically for embarrassingly parallel computational tasks, and would be very slow at executing some of the stuff you're running on CPU, since they simply don't feature good sequential performance.

Things like CPU-style pipelining and OoOE would help get around that, but that's exactly what comments like "turning a GPU into a collection of bloated mini CPU cores" are referring to. The extreme efficiency of GPU hardware (in both power and die space) is only possible because the CUs don't have to worry about juggling tons of things with each individual execution thread.
 
Aside from all the things GPUs currently are not allowed to do, at all, emulation of a different CPU architecture frequently requires an order of magnitude better straightline performance over the implementation being emulated.
GPU straightline performance is horrific relative to the CPU. It's asking for things to be ten times worse on an absolute clown-shoes situation. The DX12 vx DX11 comparisons weren't run on a single core CPU with clocks listed in KHz.
 
IIRC there was a network card that could run its own o/s a cut down version of linux. The Killer NIC

Right, and there are network switches, cars, traffic lights and hair-dryers that run their own cut-down version of linux. But they all run on CPUs not GPUs.
 
Take for example Tomorrow's Children, they do 3 draw calls per each voxel in the scene! I think PC would die very very fast there. And they do it not because they want to cripple something, but because it gets them real-time GI with fully dynamic environment (destruction and such), and it looks gorgeous.
Do you have a link to a presentation about this?
 
But you may need to do it for some "real next-gen" graphics. Take for example Tomorrow's Children, they do 3 draw calls per each voxel in the scene!
Lol! No...
Objects are voxelized using the hardware rasterizer by performing a draw call for each axis.
This is totally standard 3-plane voxelization. In fact it's not even clear if they are being conservative at all or just relying on the super-sampling to get them "close enough" to avoiding holes.

Curiously psorcerer, do you write graphics code/algorithms or are you just reading through presentations?
 
Sounds pretty easy to implement a synchronous fall back path (possibly handy for Intel which doesn't support async in it's current GPU's?)
You don't need a fallback path, it will work regardless. It just won't necessarily be any faster on current Intel GPUs.

Note that the potential performance benefit from async compute is highly architecture dependent. Wider architectures and architectures that have more trouble keeping execution units busy (due to various constraints) will see a much higher benefit than architectures with less constrained scheduling.
 
I don't like CPUs.

Not even this one?

44225_01_details-intels-next-gen-knights-landing-platform.jpg
 
You don't need a fallback path, it will work regardless. It just won't necessarily be any faster on current Intel GPUs.

Excellent thanks, this is the answer I've been trying to get to for the last couple of days. Good to know.

Note that the potential performance benefit from async compute is highly architecture dependent. Wider architectures and architectures that have more trouble keeping execution units busy (due to various constraints) will see a much higher benefit than architectures with less constrained scheduling.

Is that your way of saying "we don't need it anyway"? ;) Good point though, it sounds like it's not a big deal for Intel right now.
 
Some interesting points about the use of async compute there:
  • Being used for graphics based tasks in this instance (rather than offloading the CPU)
  • Around a 15-18% speedup vs not using it
  • Sounds pretty easy to implement a synchronous fall back path (possibly handy for Intel which doesn't support async in it's current GPU's?)

From 18% to 30% speedup according to this dev post. Apparently the gains are better when the scene is more demanding.
 
Obviously all turing complete machines (which most/all GPUs cores are now) can do the job of CPUs but if they were built for GPU they would suck majorly at being CPUs. Anybody whose thinks otherwise, is at best confused.
There are reasons, that Intel, Apple, ARM, NVIDIA etc. all spend lots of RnD in OoOE CPUs (for example), they all have GPUs and if they could just drop the CPUs they would. The power savings alone would make any software complexity worth it in the mobile markets! but they don't and in fact CPUs get consistently more powerful every generation.
 
Is that your way of saying "we don't need it anyway"? ;) Good point though, it sounds like it's not a big deal for Intel right now.
I'll just say the relative priority of it compared to other features depends a lot on the architecture. It's obviously something we have looked at and will continue to look at though.

You can actually go take a look for yourself in GPA or similar analysis tools - for a given game the "EU Idle" cycles are the best you are ever going to be able to make use of, and that would be with a zero overhead implementation and an application that always has async compute work available (not realistic, but yeah). Additionally the win is less clear on power-constrained SoCs as idle execution units can be power gated. That's not to say that filling in more stuff and "racing to idle" probably still isn't somewhat better, but it's less of a win than on a discrete chip where those idle execution units are just wasted cycles.
 
Back
Top