Virtual machines, CPUs and (GP)GPUs

Frank · Dec 8, 2006

First, let's start by dissecting an x86/64 CPU.

There is a byte-stream of instruction code going in, that is taken apart and scheduled on multiple, different execution units. The execution units themselves don't run x86/64 instructions directly. Even more so: the whole architecture of the execution units is quite different from what the semantics of the instruction code makes you believe it looks like. It's a virtual CPU.

What execution units does it contain?

It starts with the JIT compiler (also known as the instruction decoder). Next would be the dispatcher, that forwards the instructions to the right execution unit. Functions (also known as complex instructions) are unrolled, or dispatched to the API (also known as a complex instruction unit, that uses microcode and could issue new instructions). And after that we get the scheduler, that schedules the instructions to the actual execution units. Which are basically the REAL processor cores.

And the same goes for GPUs. While they do most of the post-processing in a driver that runs on the CPU, they use the same model. Both are virtual processors. The actual execution units are inside a black box, and completely vendor-specific in their actual implementation.

If you want to run multiple tasks at the same time, you can use basically two different models: processes and threads. The main difference is, that each process runs in it's own virtual machine, while threads run in the same virtual machine as the parent process that spawned them. But the interesting part is in the virtual machine. Different processes are completely separated from each other. And it doesn't matter one bit on which processor or execution unit they run.

The whole multitasking model of modern computers is to give each process the illusion that it runs on it's own computer. Not only the processors are virtually cloned, but the memory and subsystems as well. Although there are of course I/O devices that can be accessed by all, sequentially.

If you put multiple processors on the same die, it's definitely much easier to just copy the cores. But, would the processes notice it if you would simply spread out the execution units (the real processor cores) and have the dispatcher and scheduler handle the distribution? Or if you switch to a different kind of core and update the JIT compiler? Of course not.

And adding a GPU is easy, when it's running in a VM. Using it's pipelines to run FP, MMX, 3DNow!, SSEx or whatever instructions makes sense. They're just more execution units. And the unified, scalar GPGPUs are ready for it.

The only potential problem I see here is, that you basically have to embed an OS on the chip to make it happen. Like a VMware ESX server: it's a black box, that runs their own (linux derived), invisible OS, on top of which you can run other OSes and/or applications. Do Intel and AMD want that hassle?

Then again, what else is there to do than add more of the same (and less useful) cores every process stap? Much cheaper in R&D, yes, but with fast diminishing gains. And the whole industry is heading that way in either case.

mhouston · Dec 8, 2006

Peakstream and Rapidmind kinda do what you are talking about already for GPUs. In Peakstream's case, you program using a C++ API, which then passes through a JIT and hits the target architecture be it multi-core or GPUs. At a lower level, CTM really is a virtual machine which abstracts the GPU as a data parallel processor array. The instruction stream is not JIT'ed or virtualized, just offload to the GPU.

In the end, it basically comes down to the issue of building JITs capable of (data) parallel processing to match some of these architectures. This is a hard problem for a compiler or a JIT to extract parallelism unless the original code is already expressed in a parallel fashion. However, there is lots of research into this area as a whole as processors are becoming more parallel and traditional compilers and programming languages struggle to meet the challenge.

Arun · Dec 8, 2006

DiGuru said:
And the same goes for GPUs. While they do most of the post-processing in a driver that runs on the CPU, they use the same model. Both are virtual processors. The actual execution units are inside a black box, and completely vendor-specific in their actual implementation.

There's just one catch here: assuming your GPU is fully in-order, it will need 100-1000x+ more threads than a CPU to run at anything but laughable efficiency. The ALUs' latency are AFAIK higher than on a CPU, and they're hidden exclusively by the vast number of threads in flight; there's no smart scheduling shit to save the day here. And the ALUs are 16-wide, each channel working on a separate thread.

You'll literally need a thousand threads to run at anything but laughably low efficiency on a modern GPU; and add a few more thousands if you want to hide that ridiculously high memory latency. In other words, the devil is in the details. And there certainly are plenty of other details (read: problems) I won't go into here.

So, sure, the model works, but you need to significantly revamp the way basically everything works. And most likely, the end result will be a fair bit less efficient too. yay?

Uttar
P.S.: Arguably, if your GPU is out-of-order (and there are patents for that, which rely on the compiler letting the GPU know which instructions depend on which other one/ones, believe it or not), in the absolute peak case of instruction-level parallelism, you'll only need 128 threads on a G80-like chip to run at full efficiency; How likely is it that tens of instructions in a row are independent in realworld workloads, though?

Frank · Dec 8, 2006

Yes, but you don't need OOO on the hardware level, or so many threads. Most other execution units are pipelined as well, and you can have a pool of 'FP' processors you can attach to a task. It might make sense to use a quad for generic FP processing as well, and simply do register reordering for the next instruction (on a different member of the quad) if there are dependencies. And use the full quad for vector processing.

The actual sceduling is the same as with a CPU in that regard.

mhouston · Dec 8, 2006

But in the case of GPUs, Uttar is correct that you need thousands of parallel executing tasks (threads) to run efficiently. You also can't branch efficiently. In short, you really want wide SPMD execution with high branch and memory coherence.

Frank · Dec 8, 2006

mhouston said:
But in the case of GPUs, Uttar is correct that you need thousands of parallel executing tasks (threads) to run efficiently. You also can't branch efficiently. In short, you really want wide SPMD execution with high branch and memory coherence.

Yes, but you don't need to use the whole GPU. Only a single execution unit. Pipelined.

3dilettante · Dec 8, 2006

DiGuru, what exactly are you proposing?

Software instruction scheduling?

Frank · Dec 8, 2006

To turn it around: you could as well embed a CPU on a GPU. It doesn't really matter, as long as the JIT compiler supports both.

Forget about those labels. Think about execution units, threads, latency and pipelining instead. If you want heavy graphics, just assign all FP/vector units in the pool to that. And for each process, you run a different VM. You want to hide latency in a single process? Use more threads. Etc.

Frank · Dec 8, 2006

3dilettante said:
DiGuru, what exactly are you proposing?

Software instruction scheduling?

Dissect multiple CPUs and a GPU, distribute all their execution units over a die, and add a single (or more) instruction decoder(s) to schedule the load.

3dilettante · Dec 8, 2006

DiGuru said:
Dissect multiple CPUs and a GPU, distribute all their execution units over a die, and add a single (or more) instruction decoder(s) to schedule the load.

One issue with that is the difficulty in passing instructions and data between multiple decoders and many units.
If they are distributed like that, any decoder must be able to talk to any execution unit, and any execution unit must signal back. This all has to be done very quickly.

Long-distance parallel cross-communication like that usually has timing delays that scale with the square of the number of different networked units (or worse).

Multicore adds units, but it purposefuly limits communications to stay within the core. This cuts down on an explosion of wiring delays and difficult routing.

Frank · Dec 8, 2006

Yes, but an x86/64 core already works that way. So it's definitely doable.

Frank · Dec 8, 2006

Btw, if you're running multiple processes/VMs, you can use a different decoder for each of them. And you can simplify the scheduling by assigning execution units to them, and use multiple schedulers as well.

3dilettante · Dec 8, 2006

DiGuru said:
Yes, but an x86/64 core already works that way. So it's definitely doable.

There's no physical reason why it can't be done, just physical reasons why it can't be done fast.

The limited number of exection units in a CPU and the limited number of instructions that can execute at the same time is due to these physical limits.

Btw, if you're running multiple processes/VMs, you can use a different decoder for each of them. And you can simplify the scheduling by assigning execution units to them, and use multiple schedulers as well.

That's just having multiple cores. The connected set of decoder+scheduler+(the dedicated set of execution units) is what makes up a core.

Frank · Dec 8, 2006

3dilettante said:
That's just having multiple cores. The connected set of decoder+scheduler+(the dedicated set of execution units) is what makes up a core.

Does it matter? It's a black box, containing one or more virtual processor cores.

3dilettante · Dec 8, 2006

DiGuru said:
Does it matter? It's a black box, containing one or more virtual processor cores.

It matters to the silicon.
If each decoder has a set of schedulers and execution units that only it can talk to, that's a core.
If that is your suggestion, then it's already being done.

What it looked like you were suggesting was a bunch of decoders and a sea of common execution units for the entire chip. That would be great for a software virtual machine, but a nightmare for the silicon.

Frank · Dec 8, 2006

3dilettante said:
What it looked like you were suggesting was a bunch of decoders and a sea of common execution units for the entire chip. That would be great for a software virtual machine, but a nightmare for the silicon.

Yes, that was what I was suggesting. One or more decoders, uploadeable JIT ones if possible. Pick and use the right translation. And spit them between the different VMs.

Virtual machines, CPUs and (GP)GPUs

Frank

Certified not a majority

mhouston

A little of this and that

Arun

Unknown.

Frank

Certified not a majority

mhouston

A little of this and that

Frank

Certified not a majority

3dilettante

Frank

Certified not a majority

Frank

Certified not a majority

3dilettante

Frank

Certified not a majority

Frank

Certified not a majority

3dilettante

Frank

Certified not a majority

3dilettante

Frank

Certified not a majority

Similar threads