i was going through some articles when i found this statement by Charlie Demerjian when he said
"if you add in GPU functionality to the cores, not a GPU on the die, but integrated into the x86 pipeline, you have something that can, on a command, eat a GPU for lunch"
if and only if this happens, then i think this type of chip will be a gamer's only chip.
I wish I could ask just how tightly integrated he thinks it would be.
There are varying degrees of integration; some of which are barely integrated at all.
The least integrated ones (probably not the integration Charlie Demerjian was talking about):
1)The CPU and GPU are on the same chip, with no other changes. With the driver being aware, the software layer could route commands and data so that they are kept in cache.
2)The CPU and GPU are on the same chip, and they share some kind of link: either the memory controller is aware of the command traffic and routes loads and reads appropriately, or the chips share a buffer, or they have common cache access.
This is faster and mostly transparent, though it's not really integrating as much as it seems the INQ writer was saying.
More integrated, but still pretty much separate:
3) Everything from 2, but the x86 core is given a handful of special function calls and escape instructions that can shift a code and data stream to the GPU very quickly. The x86 pipeline has to handle a few extra instructions in the pipeline and some really long-distance branch jumps to the next portion of CPU code.
More integrated still, but both still distinct:
4) Everything from 3, but there is a direct silicon link that can fastpath commands and stores to and from the GPU. The units are still separate, and besides the link, you can easily tell them apart. The x86 pipeline is still essentially independent, the burden is a few extra commands that sit for a while in the ROB, which could be problematic without some special measures or multi-threading.
More integrated, some fuzziness:
5) Everything from 4, but there is a bit of overlap, where some of the fp units in the CPU are shared with the GPU. Some scheduling ports now link to an interface with the GPU.
The integer pipeline is still x86, fp traffic is more complex. Possibly a few extra instructions in SSE5 or whatever.
Highly integrated, still some demarcation:
6) Everything from 5, though the routing instructions may be fewer. The GPU and CPU share more units, both are able to make calls to a crossbar that allows them to offload some instruction packet the other is better at. The two may run from the same cache.
Fully integrated from the ISA standpoint (microcode):
7) The a common decoder handles both x86 and GPU instructions, with the GPU becoming a heavy-duty parallel decoder, analogous to a microcode or vector path.
The back end may still show signs of demarcation, (go to 9 if no, 10 if yes).
Fully integrated from the ISA standpoint (hardware):
8) The common decoder flits to and from x86 and GPU instructions. The back end may still show specialization (go to 9 if no, 10 if yes).
Fully integrated to a stupid, stupid, stupid, degree:
9) The decoded x86 and GPU micro-ops go wherever they please, to any unit, and take up space in the ROB and scheduling hardware.
Fully integrated, but faking it like a porn actress:
10) The decoded x86 and GPU micro-ops go wherever they please, but the chip really isn't trying too hard to be that good at one of the target workloads, it's just mapping execution however it wants and will get back to you when it's done.
Change at the instruction decode level is not to be taken lightly. The x86 decoders are already monstrous, and the ISA is pretty bloated. No function, not even a routing one, should be added unless you are sure it's going to be used by a lot of users for a long time.
Merging decoders is downright "My name's Janet: Miss Jackson if you're nasty" nasty.
Creating a rift in x86 between the classes of chip is also a risky bet.
Merging scheduling is suicidal. The P4's ROB is ~120 micro-instructions with two thread contexts. A single cluster in the G80 has the context of hundreds of threads, with hundreds upon hundreds of in-flight instructions, any one of them being massively long-latency.
The integer x86 side is going to be stalled stupid, so it's unlikely scheduling will merge.
Merging units: might happen, sort of.
Some specialty hardware might be good to go. The CPU could borrow a SQRT unit or some transcendental functions that take hundreds or thousands of cycles anyway.
More than that, and the question arises about how bypass and communications will work at high speeds.
The CPU needs high clocks, it can't rely on data parallelism to just make work appear. The tighter it is coupled to the GPU, the more it suffers.
Even the moderately integrated approaches make the probability of being CPU-limited higher.
The GPU needs wide fetch, wide data, and a lot of scheduling context. It tolerates huge latencies, because there's always another pixel to work on.
The more it has to cater to the CPU, the less it has to work with.
Sharing caches is problematic, especially if it is a shared L1. Even one additional port can be painful. Lower and slower levels are more likely, the first level to be shared between all cores is most likely. However, doing this means that the GPU and CPU cores stay rather distinct.
Sharing a socket: very painful for a top GPU.
The memory bandwidth will never be what it could be if the GPU had its own card.
Whatever can be done to increase the bandwidth of a socket can just as easily be done to increase it for a board, and the board can do it better.
Since CPUs and GPUs differ in their market segmentation and their need for changeable memory configurations, CPUs aren't likely to give up on having a socket.
If I had my way, I'd have a full-size video card that also plugged into a Torrenza socket, just to get the (relatively) best of both worlds.