Is everything on one die a good idea?

You talk about 10 years in the future. And about 1TFLOPS being a big deal. Let's look 7 years in the past, and I see a GT200 breaking the 1TFLOPS barrier. On a worse process. Without full custom design.
AVX-512 is rumored to be supported by Skylake, scheduled for release next year. Not in 10 or 7 years from now! So you have to compare that quad-core with today's or next year's integrated graphics, and then you'll realize that it is indeed a big deal to have that amount of processing power in an otherwise modest CPU. Over the next ten years we still have AVX-1024 and higher core counts to come, so no need to worry about the raw computing power. The trickier part is power consumption, but Haswell and Broadwell already achieve impressive FLOPS/Watt, so I'm sure AVX-512 will close the remaining gap.
With dedicated resources for texture and ROP, something your Intel processor will have to do with the general FLOPS pool.
TEX:ALU ratios have been steadily going down. Also, several games now do part of the advanced filtering in the shaders, without much impact on performance. That's because of the Memory Wall: you can have lots of arithmetic operations per datum you load from RAM. In the case of the GeForce Titan, you can have over 60 operations on every floating-point value. Of course caches and compression can improve the effective bandwidth, but the fact remains that it's really hard to get arithmetic limited. And the Memory Wall is only getting worse over time. So I'm not worried about CPUs lacking dedicated sampler units. The only thing that makes a big difference is parallel gather support, and that's an AVX-512 feature. The low latency of a direct gather, instead of a full texture unit, is an added advantage that helps keep the thread count low and locality high, to defeat the Memory Wall.

Likewise ROPs are not a big deal. Some mobile (!) chips don't even have them, and doing the blending in the shaders offers new capabilities. Fixed-function hardware is dissolving into the programmable units, and it brings us another step closer to unification.
You talk about a discrete GPU not being able to drive a retina display as if that somehow strengthens the argument about unification. I see it as exactly the reason why Intel is increasing GPU area in their dies instead of the other way around.
I didn't say it strengthens the unification theory. I'm saying it's not relevant in the long run. Yes, the integrated GPU has grown as a result of retina displays, but AVX-512 doubles the throughput of the CPU cores. So the CPU side isn't given any less attention. Also, there's dual-core and quad-core Iris chips, and there's dual-core and quad-core non-Iris chips as well. So cherry picking doesn't prove anything. The majority of laptops don't have a retina display yet, and the transition is happening quite slowly. And even though I'm sure retina will become standard everywhere, it's only a relatively minor increase in comparison to the increase in parallelism from Moore's Law. From about 640x480 to the resolutions we have today, the GPU has had the opportunity to outgrow the CPU many times over, but instead we observe that they're still pretty much in the same ballpark, due to the CPU growing its transistor count aggressively as well. So you're grasping at straws, and this is the last one.
And 1080p is supposed to be the remain gold standard, but at the same time you're talking retina display on your laptop.
That's only anecdotal. It's a very expensive laptop, and does not represent the average. For what it's worth, my other laptop, which is also brand new, has a resolution of 1600x900. So it's not even Full HD.
It would be very interesting to so what you wrote 5 years ago about this upcoming unification, because I'm pretty sure that 5 years from now, we'll still be talking GPUs the way we do right now.
Five years ago, we still didn't have CPUs and GPUs on the same die. Five years ago, we only had 128-bit SSE. Clearly we're talking about things very differently today. At this pace of change, unification in 10 years isn't a far fetched idea at all.
By carving out a very narrow part of the market, tailored to your argument, you can make everything work. Hell, business desktops have been able to get by without GPUs worth the name since forever. But unification? If it took more than 7 years for a CPU to barely catch up with a GPU, then the slowdown of Moore's Law is more likely to stall the march towards unification than to further it.
The time it took for CPU vectors to become wider than 128-bit was an artifact of legacy programming models, not of any technical impossibility. 512-bit is almost here, and while unification will probably require AVX-1024, scaling to that width should not be a major issue within this 10 year span. Also, this catching up in SIMD width only needs to happen once, and doesn't demand much from Moore's Law. If GPUs can do it, CPUs can too.
 
Caches are increasingly shared further and further up the chain... on Haswell already there's not really a concept of data "on the GPU" at the granularity you're talking about. Future APIs, OS updates and hardware will further blur these lines and drive down "latency". Ultimately there's no compelling reason to assume that communicating with the various execution units on the GPU is any more expensive than communicating with another core on the system... the hardware is mostly there already (at least big core Intel parts) and the software is on the path.
The problem here is the latency. Again :)

GPU is running often at least half a frame late compared to the CPU (and DirectX can buffer up to 3 frames in worst case by default). 8 MB L3 caches in the high end consumer Haswells will get fully trashed several times during the half frame period, so the CPU generated command buffer will not remain in the cache for long enough time to save the memory traffic. And this is especially true when generating 500k draw calls. Each will be 4 x uint32 parameters + command id (assuming 64 bits). Total = 500k * 24 bytes = 12 MB. That will trash the cache already. Compare that to a single multidraw command that pushes only 24 bytes to the command list. Much better for the caches :)
 
GPU is running often at least half a frame late compared to the CPU (and DirectX can buffer up to 3 frames in worst case by default).
Right but there's no compelling reason to do that in the (very near) future... honestly that amount of buffering is a symptom of the current APIs being designed for discrete, which is entirely my point :) There really is no reason why the GPU generating a buffer then consuming it is any better than the CPU generating it, then the GPU consuming it. Purely a software issue (at least on Intel) and one that is largely going away soon.
 
There really is no reason why the GPU generating a buffer then consuming it is any better than the CPU generating it, then the GPU consuming it.
There will be a good reason to do this for quite some time though as Intel doesn't address the mid to high end of the market and all other architectures I'm aware of will benefit from sebbbi's approach.
 
Right but there's no compelling reason to do that in the (very near) future... honestly that amount of buffering is a symptom of the current APIs being designed for discrete, which is entirely my point :) There really is no reason why the GPU generating a buffer then consuming it is any better than the CPU generating it, then the GPU consuming it. Purely a software issue (at least on Intel) and one that is largely going away soon.
It's always good to hear that companies are doing hard work to bring CPU->GPU->CPU latencies down. This is excellent news, and will result in reduced texture popping (virtual texture streaming latency) and reduced input lag :)

However the API model needs to completely change in order to ensure that the command data is in the caches when the GPU is ready to process the commands. In current APIs, the GPU and CPU are running asynchronously. The commands are inserted at the end of a buffer. In order to keep the GPU running smoothly, there should always be commands in the buffer. This means that at the point where the CPU adds the command, the GPU must first execute the previous commands already in the command buffer before it starts executing the new one. It's very hard to predict how much each command takes GPU time. For example a deferred lighting pass might take 5 milliseconds to complete, while a low triangle count geometry draw call that is completely hi-z culled is practically free. CPU time to submit the deferred lighting pass (2d plane) is likely less than it requires to setup the low polygon geometry draw call (matrix setup, viewport culling, etc). Thus it's very hard to make the CPU and GPU run in lock step without stalling either of them.

Running in lockstep (or almost lockstep) is very hard on PC, since the CPU and GPU performances vary so much. Compared to an mid class class gaming PC, a high end Intel mobile CPU has roughly twice the CPU performance and half the GPU performance. Let's assume the game is designed to run at 60 fps on the mid class "balanced" gaming PC, and it utilizes 16.6 milliseconds of both CPU and GPU time to render a frame. The game could be running roughly at CPU<->GPU lockstep on this setup. However when you run the same game on the Intel high end mobile part, the GPU time per frame is 33.3 ms and the CPU time per frame is 8.3 ms. This means that the CPU has submitted the whole frame to the command buffer 25 milliseconds earlier than the last GPU command finishes. There's no way that untouched data in 8 MB L3 cache shared by CPU and GPU lasts for 25 milliseconds. However it becomes really interesting if the L4 caches become big enough to cover all the memory accesses inside a frame. 128 MB (Crystalwell) is not yet enough, but we are talking about the future here.

An alternative model would fire a realtime priority CPU callback whenever the GPU ring buffer is "almost empty", and the CPU would put the commands and the needed data to the command buffer just before they are needed. This way the commands and the data would be in the L3 cache when the GPU execution starts. But there's a problem with this model as well. It's hard to define "almost empty" in a way that ensures that the GPU never runs out of work. If it's too long time, then the CPU generated data has already gone out of the caches. If it's too short, then the GPU might stall (for example some object is fully hi-Z culled, and pixel cost was much less than expected).

These unsolved problems give more credibility to Nick's view of the future. A CPU core with wide vector processing units can shift instantly from serial "setup mode" to full blown wide execution mode. Data doesn't even need to leave L1 cache. Obviously we need some sort of hyperthreading so ensure that the wide vector execution units are properly utilized by another task when the other task is executing serial setup code. Two way HT might not be enough (Haswell), but Intel already has 4-way HT in Xeon Phi, so this is definitely not a problem when going forward. AVX-512 instruction set is very well suited for executing GPU-like code (mask registers + masked gather+scatter). I also like the compress and expand instructions and the conflict detect instructions a lot. These are perfect for efficient branchless parallel processing.
 
These unsolved problems give more credibility to Nick's view of the future. A CPU core with wide vector processing units can shift instantly from serial "setup mode" to full blown wide execution mode. Data doesn't even need to leave L1 cache. Obviously we need some sort of hyperthreading so ensure that the wide vector execution units are properly utilized by another task when the other task is executing serial setup code. Two way HT might not be enough (Haswell), but Intel already has 4-way HT in Xeon Phi, so this is definitely not a problem when going forward.
Instead of increasing the Hyper-Threading, which could harm data locality and causes more synchronization overhead, we could have AVX-1024 instructions which are issued on 512-bit units in two cycles. This way a single thread can keep the SIMD cluster very busy while only occupying half of the slots in the scheduler.

This would also make it feasible to have two SIMD clusters (for a total of four 512-bit FMA units, just like GCN). While that may appear to reverse the problem again, it retains the locality, latency hiding, and scheduling advantages. Also, I don't think we should reason in terms of "serial setup mode" and "wide execution mode". All loops with independent iterations are a candidate for vectorization, even in what would in legacy terms be considered setup code. Likewise vectorizable code still contains 30% uniform or affine operations. So regardless of what code you run, both the scalar and vector units should see good utilization. Note that compared to today's situation, where the CPU's SIMD units are often left completely unutilized and you always end up blocking the CPU or the GPU (sometimes both within the same frame), it's really not a shame to have some partial utilization at certain times. The important thing is to have the different resources instantly available when you need them.

Alternatively we could have two SIMD clusters with 1024-bit units, running at half the frequency. The issue rate would be the same, but power consumption could be lower (kind of like Fermi vs. Kepler).
 
You know, I had a thought. (wide) SIMD is something of a worst case for OOO.

First, consider OOO. For it to actually have a benefit, it needs to take advantage of dynamic out of order possibilities - we can assume that the compiler will optimize and reorder static cases. Which instructions are these?

ALU operations : NO! They have constant latency in the vast majority of cases, so static reordering can arrange them optimally, meaning there's nothing left for the OOO hardware to do.

Branches : KINDA, if the branching behavior is dynamic (compiler can't figure it out), but fairly predictable at runtime. Note that this is going through the branch predictor rather than "OOO hardware", so you could argue about whether it counts. It's not part of the hardware that's generally referred to when people say OOO, so I'm not going to count them for the purposes of my discussion.

Load\Store : YES! You don't know ahead of time whether something lives in the cache, and worst case latency is pretty nasty, so the compiler won't always get it right. This is actually more subtle than it appears, since the compiler's main strategy is to put loads as early as possible, and stores as late as possible. This means that the register file size is very important!

So about that register file: CPUs often use register renaming to effectively increase their register file size. The problem is that register renaming isn't as versatile as simply having more registers. It works for unrolling, but it can't stop register spilling once the compiler runs out of logical registers. GPUs have the advantage here, since they're not locked into a static ISA. If the GPU designer thinks more registers (or less, for that matter) will increase performance, they're free to add them, since binary compatibility is not needed; GPUs use various virtual machine languages that are JiTed to a GPU specifc binary at runtime. You can't do this with a CPU, since most programs are compiled to binary and require backwards compatibility. This ultimately means the GPU architectures are more tolerant to variable latencies since the extra registers lets the compiler put more code between the loads and the stores without having to spill to memory when it runs out of logical registers.

So, about OOO and (wide) SIMD not working together very well. Consider when you have gather/scatter (and you WILL if you want a flexible programming model!). What happens when just one of the lanes stalls from a cache miss? It stalls the entire instruction! The problem is, as you increase SIMD width, the frequency of some lane or other stalling at any given point increases, which puts it closer to the static worst case latency. Thus, the OOO hardware has less and less chance to be able to find anything the compiler can't, since you're losing dynamic OOO opportunities.

There's a clear benefit to having ALU operations OOO with respect to LS, since otherwise you stall whenever you hit a LS! This is simple to do (you just keep track of which registers are up to date and have no pending instructions pointing to them - if a instruction would read/write from a register that's not ready, stall at that point in the pipeline), and is done on pretty much any architecture with multiple instructions pipes (both CPUs and GPUs).

The ultimate problem is that full OOO hardware is notoriously expensive, both in terms of die space and especially power. This is due to the way it works, in that it has to do bookkeeping on several buffers in addition to everything it already has to do to execute the actual instruction. There's just no way around this. You can get away with adding it to a handful of large cores, but for many small cores your power use goes through the roof.
 
You know, I had a thought. (wide) SIMD is something of a worst case for OOO.

First, consider OOO. For it to actually have a benefit, it needs to take advantage of dynamic out of order possibilities - we can assume that the compiler will optimize and reorder static cases. Which instructions are these?

ALU operations : NO! They have constant latency in the vast majority of cases, so static reordering can arrange them optimally, meaning there's nothing left for the OOO hardware to do.
Optimal scheduling is NP-hard, so no optimizer wastes time trying to achieve it. Especially in JIT compilers. Of course GPUs avoid the issue by making all arithmetic operations have equal latency, but that’s throwing out the baby with the bath water. Also, CPUs do frequently change their instruction latencies. It is one of the reasons why x86 has thrived for over three decades now. Different vendors have different latencies too. And of course it only takes a single variable-latency instruction to completely mess up your static scheduling.

Out-of-order execution automatically provides good scheduling results, and adapts to any run-time circumstance. It not only improves performance, but it also increases developer productivity. The latter should not be underestimated. While in certain cases static scheduling can beat dynamic scheduling in performance/Watt, it takes great effort and it’s easier to get it wrong than right. Out-of-order execution provides a great deal of comfort by not having to worry about scheduling.

GPUs still have to catch up with CPUs in this regard. As the complexity of the software increases, you really don’t want to have to worry about these low-level issues. It’s the same reason why managed storage is making way for generic caches. Caches can’t achieve the same efficiency as optimally managed storage, but they still perform really well when the complexity increases and things get more dynamic.
Branches : KINDA, if the branching behavior is dynamic (compiler can't figure it out), but fairly predictable at runtime. Note that this is going through the branch predictor rather than "OOO hardware", so you could argue about whether it counts. It's not part of the hardware that's generally referred to when people say OOO, so I'm not going to count them for the purposes of my discussion.
Branch prediction is very much a part of out-of-order execution. Which branch is (assumed) taken instantly affects the scheduling. Out-of-order execution can deal with that, while static scheduling can’t. Also keep in mind that with SIMD you typically execute both branches.
Load\Store : YES! You don't know ahead of time whether something lives in the cache, and worst case latency is pretty nasty, so the compiler won't always get it right. This is actually more subtle than it appears, since the compiler's main strategy is to put loads as early as possible, and stores as late as possible. This means that the register file size is very important!

So about that register file: CPUs often use register renaming to effectively increase their register file size. The problem is that register renaming isn't as versatile as simply having more registers. It works for unrolling, but it can't stop register spilling once the compiler runs out of logical registers. GPUs have the advantage here, since they're not locked into a static ISA. If the GPU designer thinks more registers (or less, for that matter) will increase performance, they're free to add them, since binary compatibility is not needed; GPUs use various virtual machine languages that are JiTed to a GPU specifc binary at runtime. You can't do this with a CPU, since most programs are compiled to binary and require backwards compatibility. This ultimately means the GPU architectures are more tolerant to variable latencies since the extra registers lets the compiler put more code between the loads and the stores without having to spill to memory when it runs out of logical registers.
First of all, you definitely can JIT-compile SIMD code on the CPU. That’s what I’ve always done. Secondly, the GPU really doesn’t have many more registers. It has to share its register file with lots of threads. If the threads need more registers, it means you can only have a limited number of threads and your ability to hide latency suffers. On the CPU, no matter how many temporary variables you need, the most frequently used ones can reside in a register and the rest is quite efficiently spilled to / restored from memory.

So once again the CPU is much more efficient at dealing with complex cases. And as the software complexity of GPU code increases, they have no choice but to adopt similar techniques. The GPU’s logical register file more closely resembles cache memory anyway, and operands are read from it over multiple cycles. The ALUs read from and write into the operand register file, which is small and constantly has to spill to / restore from the larger SRAM. Also note that while the CPU’s physical register file is larger than the logical one, many operands come from the more efficient bypass network.
So, about OOO and (wide) SIMD not working together very well. Consider when you have gather/scatter (and you WILL if you want a flexible programming model!). What happens when just one of the lanes stalls from a cache miss? It stalls the entire instruction! The problem is, as you increase SIMD width, the frequency of some lane or other stalling at any given point increases, which puts it closer to the static worst case latency. Thus, the OOO hardware has less and less chance to be able to find anything the compiler can't, since you're losing dynamic OOO opportunities.
GPUs also suffer from divergent memory accesses. When they're out of ready threads, they stall. CPUs have both out-of-order execution and Hyper-Threading to keep going for a while. Furthermore, automatic prefetching detects access patterns so future misses are less likely. Last but not least, large L3 or even L4 caches keep a lot of data close with reasonable access latency.
The ultimate problem is that full OOO hardware is notoriously expensive, both in terms of die space and especially power. This is due to the way it works, in that it has to do bookkeeping on several buffers in addition to everything it already has to do to execute the actual instruction. There's just no way around this. You can get away with adding it to a handful of large cores, but for many small cores your power use goes through the roof.
That is simply not true. Mobile CPUs now use out-of-order execution, and consume just a couple Watts or less. Even Core M, derived from Intel’s desktop architecture, consumes just 4.5 Watt. The power consumed by out-of-order execution logic is a fraction of that (there’s memory controllers, an integrated GPU, caches, ALUs, etc.). Furthermore, the power consumption is amortized by SIMD width. With AVX-1024, each CPU core would have as many lanes as a GCN core. The GPU’s bookkeeping logic for scheduling threads does not come for free either. So it’s not unimaginable for them to pay the small price of out-of-order execution and benefit from better data locality to achieve lower power consumption.
 
Branch prediction is very much a part of out-of-order execution. Which branch is (assumed) taken instantly affects the scheduling. Out-of-order execution can deal with that, while static scheduling can’t.

There's a lot of wrong in your reply, but I want to address this point specifically. If branch prediction makes a process "out of order", then the majority of what are considered "in order" machines need to be rebranded.

I think you're confusing "out of order execution" with "speculative execution". The latter can (and is) applied in "in order" processors. In-order processors can (and typically do) process some part of the one of the branch paths, then roll-back the execution on a branch mispredict, just like out-of-order processors can. It may just be the instruction prefetching, but more commonly is also the start of the execution of one of the branch paths.

As an example, the original Pentium (http://en.wikipedia.org/wiki/Intel_P5) has branch prediction and is typically considered to be an in-order processor.
 
I'll make this short. Speculation is inefficient in terms of dynamic power. It's often cheaper to just stall and clock gate - the "race to sleep" isn't always true. Therefore, the only future where OoOE makes sense for graphics is in a dystopian world where leakage is something like ~99% of total power. That will nearly certainly not happen (ignore the process doom mongering and consider innovations like fine-grained power gating as well). While you can always trade area for power at the implementation level (especially in terms of voltage) you're certainly not saving enough area with your ideas to do so.

There may be bandwidth benefits to unification but I'm skeptical it's anywhere near enough to justify such an architecture. There are certain kinds of unifications that I do like (e.g. MIMD-SIMD hybrid ala NVIDIA's Einstein, tightly coupled scalar core beyond AMD GCN, etc.) and these may have very exciting programming model benefits, but what you are advocating is too inefficient to ever become viable IMO. There is no way the future of graphics will sacrifice power efficiency for the sake of architectural simplicity.

I'm going to go one step further and say that excessively IPC-focused CPUs are a temporal fluke that only matter so much due to the lack of innovation in the programming language world. I cringe every time someone talks about Amdahl's Law for problems that are not inherently serial (but only become serial because of the lackluster tools or design philosophy used to approach them).

The earlier discussion on parallel reduction is a good example - Andrew is right that it's a perfect example of Amdahl's Law, but Sebbi is also right that in the vast majority of cases, there is some other workload you could run in parallel - making it a serial bottleneck for that algorithm, but not the overall program. We don't need as much "serial" performance as you think we do; we just need it in the right places.
 
Optimal scheduling is NP-hard, so no optimizer wastes time trying to achieve it. Especially in JIT compilers. Of course GPUs avoid the issue by making all arithmetic operations have equal latency, but that’s throwing out the baby with the bath water. Also, CPUs do frequently change their instruction latencies. It is one of the reasons why x86 has thrived for over three decades now. Different vendors have different latencies too. And of course it only takes a single variable-latency instruction to completely mess up your static scheduling.

Out-of-order execution automatically provides good scheduling results, and adapts to any run-time circumstance. It not only improves performance, but it also increases developer productivity. The latter should not be underestimated. While in certain cases static scheduling can beat dynamic scheduling in performance/Watt, it takes great effort and it’s easier to get it wrong than right. Out-of-order execution provides a great deal of comfort by not having to worry about scheduling.
NVIDIAs Denver seems to beat the OoO competition according to their internal benchmarks (I know this is a questionable result until we get third party results). And it doesn't seem to have huge dips in any of the common benchmarks they used. It is based on JIT compiling (and static JIT scheduling). OoO machinery schedules/renames the same hot code (inside inner loops) thousands of times every frame. It's a nice idea to do this once (and fine tune the scheduling by periodic feedback from the execution units). Obviously you can't react to the most erratic cache misses this way, but at least you know which instructions cause the biggest misses. This might provide the JIT compiler enough information to add prefetch instructions and necessary extra code. I have done this manually with a profiler so many times on in-order PPC cores that I believe a monkey (= JIT compiler) could do most of it automatically.
I'm going to go one step further and say that excessively IPC-focused CPUs are a temporal fluke that only matter so much due to the lack of innovation in the programming language world. I cringe every time someone talks about Amdahl's Law for problems that are not inherently serial (but only become serial because of the lackluster tools or design philosophy used to approach them).
I couldn't have said this better myself :). People are always assuming that you need or want a big serial core for scheduling work for many smaller parallel cores. But scheduling isn't a serial task. You can do all the scheduling you need on the parallel cores.
 
Branch prediction is very much a part of out-of-order execution. Which branch is (assumed) taken instantly affects the scheduling. Out-of-order execution can deal with that, while static scheduling can’t.
There's a lot of wrong in your reply, but I want to address this point specifically. If branch prediction makes a process "out of order"...
That's not what I meant. I didn't word it very well, but it should have been clear from the context: keldor314 was leaving branch prediction out of the discussion about the benefits of out-of-order execution, while I think it's very much part of it. It can schedule instructions from before and after a branch (even multiple ones), while the compiler cannot schedule past a branch (or it makes a guess and gets it all wrong when a different branch is taken at run-time).
 
Optimal scheduling is NP-hard, so no optimizer wastes time trying to achieve it.

OOO hardware does not by any means schedule optimally! It effectively uses a crude greedy algorithm, which is likely worse than what a compiler could do offline.

Of course GPUs avoid the issue by making all arithmetic operations have equal latency, but that’s throwing out the baby with the bath water.

I doubt it - where did you hear this?

Also, CPUs do frequently change their instruction latencies. It is one of the reasons why x86 has thrived for over three decades now. Different vendors have different latencies too. And of course it only takes a single variable-latency instruction to completely mess up your static scheduling.

Or you could do a JiT stage and schedule specifically for the processor in question. Instruction scheduling is cheap - so much so that it can be done dynamically at runtime millions of times for each instruction in a loop.

Out-of-order execution automatically provides good scheduling results, and adapts to any run-time circumstance. It not only improves performance, but it also increases developer productivity. The latter should not be underestimated. While in certain cases static scheduling can beat dynamic scheduling in performance/Watt, it takes great effort and it’s easier to get it wrong than right. Out-of-order execution provides a great deal of comfort by not having to worry about scheduling.

GPUs still have to catch up with CPUs in this regard. As the complexity of the software increases, you really don’t want to have to worry about these low-level issues. It’s the same reason why managed storage is making way for generic caches. Caches can’t achieve the same efficiency as optimally managed storage, but they still perform really well when the complexity increases and things get more dynamic.

Who's writing in assembler and dealing with this? That's the job of the compiler optimizer / JiTter, and is done completely behind the scenes without the developer having to lift a finger.

Also keep in mind that with SIMD you typically execute both branches.

...Which would completely defeat the purpose of branch prediction.

This isn't really true anyway - a well optimized algorithm can organize execution so that quite a bit of the data falls into uniform branches.

First of all, you definitely can JIT-compile SIMD code on the CPU. That’s what I’ve always done.

You're missing the point. In order for the ISA to be tweaked, for instance, to add more registers, ALL programs must be JiT-compiled. Any that aren't will break! The CPU ecosystem is no where near this point.

Secondly, the GPU really doesn’t have many more registers. It has to share its register file with lots of threads. If the threads need more registers, it means you can only have a limited number of threads and your ability to hide latency suffers. On the CPU, no matter how many temporary variables you need, the most frequently used ones can reside in a register and the rest is quite efficiently spilled to / restored from memory.

That's not completely true. Even at fairly high occupancy, a GPU has 32ish registers per thread. If you problem needs more immediate storage, you can lower the thread count and increase the register count per thread quite a bit - new Nvidia cards support up to 255 registers per thread! Of course, this does decrease the amount of hyperthreading you have available, but the important part is that the developer gets to tune it themselves!

So once again the CPU is much more efficient at dealing with complex cases. And as the software complexity of GPU code increases, they have no choice but to adopt similar techniques. The GPU’s logical register file more closely resembles cache memory anyway, and operands are read from it over multiple cycles. The ALUs read from and write into the operand register file, which is small and constantly has to spill to / restore from the larger SRAM. Also note that while the CPU’s physical register file is larger than the logical one, many operands come from the more efficient bypass network.

I can't speak to how GPUs efficiently deal with large register files. The vendors in question are too secretive.:cry:

GPUs also suffer from divergent memory accesses. When they're out of ready threads, they stall. CPUs have both out-of-order execution and Hyper-Threading to keep going for a while. Furthermore, automatic prefetching detects access patterns so future misses are less likely. Last but not least, large L3 or even L4 caches keep a lot of data close with reasonable access latency.

Prefetching costs power and bandwidth and pollutes the cache if it guesses wrong...

That is simply not true. Mobile CPUs now use out-of-order execution, and consume just a couple Watts or less. Even Core M, derived from Intel’s desktop architecture, consumes just 4.5 Watt. The power consumed by out-of-order execution logic is a fraction of that (there’s memory controllers, an integrated GPU, caches, ALUs, etc.). Furthermore, the power consumption is amortized by SIMD width. With AVX-1024, each CPU core would have as many lanes as a GCN core. The GPU’s bookkeeping logic for scheduling threads does not come for free either. So it’s not unimaginable for them to pay the small price of out-of-order execution and benefit from better data locality to achieve lower power consumption.

The boost number I've heard bandied about for OOOe is 30%. Keep in mind that this is for a superscalar architecture - for something like a GPU, it would be considerably less since there are fewer instructions in flight, and since it has hyperthreading (the 30% was referring to an ARM architecture with no hyperthreading). I suspect it would end up around 10% or less. Is that worth the increase in die size and power use? If it was, I'm sure Nvidia and AMD would be using it right now.
 
How does it happen that half all all threads get into nicks personal "I want my 2-way interleaved SIMD-1024" threads. I though he had his own thread for that..

That is simply not true. Mobile CPUs now use out-of-order execution, and consume just a couple Watts or less. Even Core M, derived from Intel’s desktop architecture, consumes just 4.5 Watt.

You are talking about watts. Where I work we are talking about milliwatts, microwatts and picojoules.

"Couple of watts" is a _lot_ of power for many cases.


But OOE is really needed to get good performance for latency-critical workloads with lots of dynamic behavior.
 
I'll make this short. Speculation is inefficient in terms of dynamic power. It's often cheaper to just stall and clock gate - the "race to sleep" isn't always true. Therefore, the only future where OoOE makes sense for graphics is in a dystopian world...
Except GPUs don't normally stall and clock gate. They swap threads and thus push data the previous thread was using, out of the nearest most power efficient storage. People are often horrified by the ~5% of cases where speculation is a loss, but forget about the ~95% of time it works perfectly and keeps the data right where you want it.
There may be bandwidth benefits to unification but I'm skeptical it's anywhere near enough to justify such an architecture. There are certain kinds of unifications that I do like (e.g. MIMD-SIMD hybrid ala NVIDIA's Einstein, tightly coupled scalar core beyond AMD GCN, etc.) and these may have very exciting programming model benefits, but what you are advocating is too inefficient to ever become viable IMO. There is no way the future of graphics will sacrifice power efficiency for the sake of architectural simplicity.
They already have. I mentioned before the example of caches versus managed storage. Likewise the GPU’s thread scheduling is dynamic to improve the worst case, but it costs power efficiency over the best case static scheduling. Clearly there is value in making things easier for developers so their average results are better. And you can’t arbitrarily draw the line somewhere, so out-of-order execution should be considered a possibility. Note that mobile CPUs, for which power efficiency is sacred too, are all switching to out-of-order execution!

RAM accesses cost up to half the power consumption of a GPU, so cutting that back even a little by making better use of caches can easily compensate for the cost of out-of-order execution, and allows programmers to focus on functionality.
I'm going to go one step further and say that excessively IPC-focused CPUs are a temporal fluke that only matter so much due to the lack of innovation in the programming language world. I cringe every time someone talks about Amdahl's Law for problems that are not inherently serial (but only become serial because of the lackluster tools or design philosophy used to approach them).
Crying about it doesn’t help. You either have to develop tools to help the application developers, or create hardware that’s better at executing mediocre code. You’ll quickly find that improving all the software in the world is a lot more work than improving the hardware that runs it. And unless you guarantee improvement on all fronts, you’ll have to rewrite all software again with every major architectural change. That's not workable. AMD tried to lower IPC in favor of a higher core count with Bulldozer by expecting a major breakthrough in software development, and failed miserably.

There’s a strong desire to make the GPU a first-class citizen of the system, with guarantees about compatibility with a virtual ISA and a certain degree of memory coherency. It’s the only way to achieve some form of fruitful software ecosystem where you can build on previous components instead of reinventing the wheel every time. But that’s going to expose the GPU to a lot more software than a handful of game engines. Because there won't be much of a driver left, they’ll have to run all legacy code on their latest architecture at least as efficiently as on their previous ones. This means exploiting more ILP, more DLP, more TLP, and more data locality. Even if it may seem excessive at some point, that will just be the norm from which there is no way back.

CPUs already have a fruitful software ecosystem and IPC has already pretty much reached its limit, so next up is AVX-512 and then AVX-1024, together with things like TSX to keep scaling the number of cores. So it’s all evolving in the same direction, and unification of the CPU and GPU is inevitable.
The earlier discussion on parallel reduction is a good example - Andrew is right that it's a perfect example of Amdahl's Law, but Sebbi is also right that in the vast majority of cases, there is some other workload you could run in parallel - making it a serial bottleneck for that algorithm, but not the overall program. We don't need as much "serial" performance as you think we do; we just need it in the right places.
For argument’s sake let’s say “the vast majority of cases” is 80%. That leaves 20% for which you can’t easily find another workload to run in parallel. No big deal, right? It’s only a minority of applications. Then the next generation there’s again 80% of the applications for which there are still no scaling issues because you can add more workload. But that’s now 80% of 80%... and after the third generation you’re only left with half of the applications for which Amdahl’s Law isn't an issue, and it keeps getting worse.

So reality is situated somewhere between Amdahl’s Law and Gustavson’s Law. Which one applies the most depends on the type of application, but eventually Amdahl’s Law does catch up with you. We’re nearing the end of pixel scaling for most devices, and we also can’t afford any more increase in latency. So to continue scaling performance by adding more ALUs, you have to eventually find the parallelism at the instruction level. Hence out-of-order execution for GPUs is bound to happen.
 
There’s a strong desire to make the GPU a first-class citizen of the system, with guarantees about compatibility with a virtual ISA and a certain degree of memory coherency. It’s the only way to achieve some form of fruitful software ecosystem where you can build on previous components instead of reinventing the wheel every time. But that’s going to expose the GPU to a lot more software than a handful of game engines. Because there won't be much of a driver left, they’ll have to run all legacy code on their latest architecture at least as efficiently as on their previous ones. This means exploiting more ILP, more DLP, more TLP, and more data locality. Even if it may seem excessive at some point, that will just be the norm from which there is no way back.

No, there is no need to extract more ILP when executing embarassingly parallel code. It does not matter that it's old code but when the intermediate language and the program is made for embarassingly parallel code, ILP is not needed. Just add more cores or SIMD or SIMT lanes in future to get better perfromance.

CPUs already have a fruitful software ecosystem and IPC has already pretty much reached its limit, so next up is AVX-512 and then AVX-1024, together with things like TSX to keep scaling the number of cores. So it’s all evolving in the same direction, and unification of the CPU and GPU is inevitable.

Because IPC is reaching it's limits, in order to get some last remaining percents of that serial performance the CPU's have to throw even more resources into thing like branch prediction.

We’re nearing the end of pixel scaling for most devices, and we also can’t afford any more increase in latency. So to continue scaling performance by adding more ALUs, you have to eventually find the parallelism at the instruction level. Hence out-of-order execution for GPUs is bound to happen.

Display resolutions have been increasing, 4k displays are coming. There are quite a lot parallelism available to draw all of those pixels, you can just add wider SIMD/SIMT or put more cores.

Another point of view:
Frames have to be drawn 60 times per second. With 1.5 Ghz clock rate that means there are 25 million clock cycles per frame. You can run quite complicated shader programs with that 25 million cycles with quite bad ipc if you have own processor(or SIMD lane) for every pixel.
 
One of the (big!) problems that no one's mentioned WRT to CPU GPU unification is clock rate. For good serial performance you need a high clock rate. The problem is that power usage increases *quadratically* to clock rate. So if you take a 1 GHz GPU and scale it to 4 GHz you will use 16x the power! Normalizing performance (assume 1 GHz to 4 GHz makes it 4x faster, that the longer pipelines you need are offset by needing fewer threads), you still have 4x the power for an equal amount of compute!

daciI.png
 
Last edited by a moderator:
Thus it's very hard to make the CPU and GPU run in lock step without stalling either of them.

Running in lockstep (or almost lockstep) is very hard on PC, since the CPU and GPU performances vary so much.
...
This means that the CPU has submitted the whole frame to the command buffer 25 milliseconds earlier than the last GPU command finishes. There's no way that untouched data in 8 MB L3 cache shared by CPU and GPU lasts for 25 milliseconds.
This is actually partly a response to some of the earlier discussion on transactional memory, that I was too slow to think about.
The first point on lockstep is interesting, in the sense that true lockstep execution is generally avoided outside of high-RAS setups where cores are purposefully slaved together at the level of silicon or hypervisor/firmware, and it's a non-trivial thing to accomplish.
I think those developing for discrete PCs were insulated by the horrendous latencies, but I'm not certain even the current console APUs have peeled away quite enough of that latency to see the gnarly reality underneath. Perhaps "lockstep" here is more tolerant up to some number of stall cycles.

To segue into the transactional memory bit: the discussion of short-lived 8MB cache.
At least for architectures like GCN, I'm not certain transactional memory is a win with GPUs as we know them. Looking Intel's somewhat mottled introduction of limited transactional support, we have transactions that currently rely on keeping their write set in core-local storage with a likely short wall clock period for when a transaction becomes globally visible.

GCN is not a very good fit for that. Its concept of coherence relies on missing the CU-local L1 and writing to a very dumb L2. Intel's transactions will fail if there's an eviction outside of the limited storage domain (not always if it's a line in the read set, not sure what that's about), and GPUs miss a lot.
IBM's Blue Gene implementation has an L2 that is capable of monitoring and invalidating a smaller number of L1 caches, which is a much more robust memory system. Its memory subsystem is probably faster at operations like draining queues and otherwise preparing for transactions compared to what it takes for GCN GPUs to drain their queues.

Because GPUs do so well on miss-prone and/or bandwidth hungry workloads, the likely amount of data and the length of time it takes to make anything globally visible makes an Intel or IBM transaction look a little shaky to me. There's such a wide window of time for transaction-aborting events on so much data, and there would be overheads associated with getting a pristine transaction set when memory is so inconsistent.
The SIMT paradigm is another headache, or at least could be. I'm really not sure what AMD's position is on SIMT or SPMD. If AMD continued on insisting each work item were a thread, the chance of 1 of 64 "threads" annihilating a transaction when everything else operates at the level of a wavefront sounds unappealing.
If the hierarchy doesn't change, it might work better if versioning were built into specific areas of memory on-die or physical addresses at the memory controller level, although this still seems like it's going to incur additional latencies.


You know, I had a thought. (wide) SIMD is something of a worst case for OOO.
The SIMD instructions themselves should be well-behaved for rename or the tracking of hazards and operand bypass.* Vector pipelines and FPUs also typically operate in a less precise manner than the more unforgiving standard the integer pipeline is held to.

*Unless you start playing with the number of operands, or start to get cute with the scheduling.
That the integer pipeline in various workloads does get a decent workout when the SIMD is running well can encourage a broader design than a core that can target one more than the other.

The memory subsystem is one point where the sides of the core don't always share the same tradeoffs, what with one being happy with smaller data elements and intolerant of latency, while the other latency tolerant, wide, and possibly involving some shifting or conversion as part of the process.
It's not actually required that they share all of it. Itanium, for example, had had FP loads and stores go directly the L2, but it's a slippery slope for those who want to maintain some kind of aesthetic purity. I mean, once you start differentiating one of the dominant architectural facets of a von Neumann machine, what else might you be tempted to split off...?

So about that register file: CPUs often use register renaming to effectively increase their register file size. The problem is that register renaming isn't as versatile as simply having more registers.
The renamer doesn't have to stop at the same points as a compiler. It can take whatever registers the CPU's dynamic execution goes down without care if said instructions are in a function the compiler could not optimize for.
Optimizations that bloat the instruction stream can lead to more instruction cache misses. It's something that early dynamic optimization schemes like Dynamo could take into account since they could target an OoO core. A loop the optimizer can leave as-is can mean multiple cache lines not at the risk of thrashing, or a powered-down front end in the case it fits in a loop buffer.

Being able to handle loops that can be renamed relatively painlessly is one area where I think Denver could have some architecturally-specific optimizations for, or could significantly to improve if it did go OoO someday.

The size of the register file and how it's designed is also an important consideration if a core is targeting serial performance, which at least I hope some will still try for in the future.

I'm going to go one step further and say that excessively IPC-focused CPUs are a temporal fluke that only matter so much due to the lack of innovation in the programming language world.
This is begging the question. If it's excessive, why should a thing persist?
It's also a bit too late to call decades of increasing single-threaded performance a fluke, up until this most recent chapter, the whole of modern computing was underpinned by it.
Reality keeps asking our idealized architectures to run code that wasn't written by the binary angels, so I admire hardware that isn't perfectly circumscribed by its glass jaws.

I don't see a zero-sum situation, although may arbitrarily declare things unacceptable if the whimsy strikes me. ;)


One counterpoint to having everything on one die is a physical one. AMD's CPU cores have suffered mightily transitioning from a physical process that was tailored for them, and on the flip side the GPU portions of their APUs have had a progressively easier time as the CPUs have lost ground.
There are still gains to be had from recognizing different physical requirements, but the relative cost of going off-die these days is quite high.
For AMD, and probably any consumer device manufacturer other than Intel, one-die makes even more sense because it's unlikely they have can have a bleeding edge CPU or the process+engineering to go with it.

Intel is at an odd point, currently.
Intel's new process has tightened the pitch of its interconnect layers, which has closed a density gap with foundry processes it's had since 65nm or so.
Broadwell-Y shows a changed physical target for its design, which may have figured into the claim that Intel had shifted to a 2:1 perf/power tradeoff requirement, which leaves me antsy for a non-mobile variant or the advent of Skywell.


OOO hardware does not by any means schedule optimally! It effectively uses a crude greedy algorithm, which is likely worse than what a compiler could do offline.
The exact workings of today's high-end CPUs can be more complicated than that. It's not something that frequently gets disclosed, but if there were things that that deserve the label of "secret sauce", the internal settings, design tweaks, and heuristics that have been iterated and implemented in the real world for decades would be good candidates.


I doubt it - where did you hear this?
GCN is pretty heavy on the 4-cycle VALU loop, and most operations in the ALU category tend to be fixed within an implementation (but can vary between designs, like DP).
Nvidia's are pretty regular as well, although last I checked they had more variety.

GCN appears to divide its CU issue capability down into a series of imperfectly abstracted domains, where certain sets of operations with reasonably similar behaviors get managed semi-independently.
Certain things do break the veil, and necessitate explicit wait states.

CPU ops are more free to have varying latencies, but those that may vary a lot sometimes are only partly pipelined or get handled by microcode.

Or you could do a JiT stage and schedule specifically for the processor in question. Instruction scheduling is cheap - so much so that it can be done dynamically at runtime millions of times for each instruction in a loop.
At least current designs are going to have a problem. At some point the system is going to write the new optimized instructions out to memory and at some point a core may need to fetch them.
The optimizer needs to balance the cost of its own optimization and the cost in time and energy incurred by having to invalidate cached instructions and fetching them.

The thing about doing things in silicon is that they probably will mess up from time to time, but their mistakes in many cases can be small or may converge very rapidly with the lowest external cost.
 
There are multiple nice new parallel languages in development, but I haven't had time to properly experiment with any of them.

Personally I like C++11 a lot. It improved things nicely over the old C++ standard, and C++14 seems to be continuing the trend. C++ is moving forward nicely. You can even finally create variable parameter functions that are type safe (thanks to variadic templates) :)

There are many nice parallel programming libraries developed on top of modern C++. There are CPU based ones such as Intel TBB and Microsoft PPL. Both can be used to implement modern task/job based systems, but unfortunately neither is cross platform compatible (Windows, Linux, Mac, XB1, PS4, iOS, Android, etc), so we game developers still need to write our own systems. Nothing bad about that either, since C++ is a very flexible language, so you can usually add things that the core language doesn't support. One thing that is not possible to implement on top of straight C++ (without super excessive expression template hackery that takes years to compile) is writing SPMP-style programs that could compile to optimized SoA AVX2 vector code (and automatically multithreaded). For this you need to use something like Intel SPMD compiler, and that's not cross platform compatible either.

C++AMP is nice for writing GPU code inside your CPU code. It uses standard C++11 primitives (lambdas) to present kernels, and it's fully C++11 compatible in syntax with templates and all (one extra language keyword added). Auto-completion and all the other modern IDE productivity improvements work properly with it. You could also theoretically compile C++AMP code to standard CPU code by implementing a small wrapper library. However this wouldn't automatically SoA vectorize your code (but you could multithread it easily). C++AMP debugging is also fully integrated to Visual Studio. You can step inside the code easily and inspect the memory and variables and go though call stacks. I would prefer writing all my GPU code in C++AMP instead of HLSL, but Unfortunately C++AMP is only available on Windows, and that's a big show stopper.

If you compared C++AMP to the ideal solution, it sorely lacks the ability to call other kernels inside a kernel. Kepler now has that feature on latest CUDA version (Dynamic Parallelism). This feature is sorely needed for all GPU hardware and all GPU APIs. Being able to spawn other shaders (proper function calls) inside a shader is one of the things that has limited the ability to run general purpose code on GPUs.

AMD Announces Heterogeneous C++ AMP Language for Developers so now theres c++ amp for windows and Linux
 
I'm not certain transactional memory is a win with GPUs as we know them. Looking Intel's somewhat mottled introduction of limited transactional support, we have transactions that currently rely on keeping their write set in core-local storage with a likely short wall clock period for when a transaction becomes globally visible.

GCN is not a very good fit for that. Its concept of coherence relies on missing the CU-local L1 and writing to a very dumb L2. Intel's transactions will fail if there's an eviction outside of the limited storage domain (not always if it's a line in the read set, not sure what that's about), and GPUs miss a lot.
GPUs do miss a lot. However what I am suggesting here is basically an extended atomic operation (that writes up to two 64 byte cache lines). An operation like this would make it possible generate many data structures more efficiently on GPU (parallel creation where thousands of threads add / remove data from the same structure). GPUs are designed to handle long memory latencies, so it shouldn't matter if the GPU instead needs to wait for another core to finish the atomic 2 cache line operation. CU could just shedule some other waves/warps until the "transaction" is finished. I don't think this would need much extra hardware.
If AMD continued on insisting each work item were a thread, the chance of 1 of 64 "threads" annihilating a transaction when everything else operates at the level of a wavefront sounds unappealing.
I would prefer serialization. An extended atomic (that can simultaneously write to two separate cache lines) instead of a transaction that can fail. GPUs manage memory stalls quite elegantly (and serialization is already used for other conflict cases).
This is begging the question. If it's excessive, why should a thing persist?
It's also a bit too late to call decades of increasing single-threaded performance a fluke, up until this most recent chapter, the whole of modern computing was underpinned by it.
Reality keeps asking our idealized architectures to run code that wasn't written by the binary angels, so I admire hardware that isn't perfectly circumscribed by its glass jaws.
Game developers have been transitioning to data oriented models (some started this already in PS2 era, but it has recently become more common). These models reduce the need for pointer indirections and branches heavily, and often can be nicely executed as parallel for loops (and/or vectorized). At the same time these models make program maintenance and expansion easier. MMOs have in particular been using these models to ensure that the code stays clean during the long post production.

This is a quite nice explanation about data-oriented design (made by a long time game developer): http://www.dataorienteddesign.com/dodmain/

I recommend using Readability (https://www.readability.com/) plugin if you are using Google Chrome. It makes this page so much more usable (small text on a gray background).
At some point the system is going to write the new optimized instructions out to memory and at some point a core may need to fetch them.
The optimizer needs to balance the cost of its own optimization and the cost in time and energy incurred by having to invalidate cached instructions and fetching them.

The thing about doing things in silicon is that they probably will mess up from time to time, but their mistakes in many cases can be small or may converge very rapidly with the lowest external cost.
It would be nice if NVIDIA was as open as Intel and AMD regarding to their CPU architectures. Denver core is quite big for an in-order core, approx two A15 cores according to side by side comparisons in their Tegra K1 slides (http://blogs.nvidia.com/blog/2014/08/11/tegra-k1-denver-64-bit-for-android/). Denver has much bigger L1 caches (4x instruction cache, 2x data cache), but that alone shouldn't explain the massive difference. I wonder how much custom hardware they have for code optimization. Their optimizer is most likely a mixed HW and SW solution. Hopefully NVIDIA releases new whitepapers once Denver launches.
 
Back
Top