Is everything on one die a good idea?

It would be nice if NVIDIA was as open as Intel and AMD regarding to their CPU architectures. Denver core is quite big for an in-order core, approx two A15 cores according to side by side comparisons in their Tegra K1 slides (http://blogs.nvidia.com/blog/2014/08/11/tegra-k1-denver-64-bit-for-android/). Denver has much bigger L1 caches (4x instruction cache, 2x data cache), but that alone shouldn't explain the massive difference. I wonder how much custom hardware they have for code optimization. Their optimizer is most likely a mixed HW and SW solution. Hopefully NVIDIA releases new whitepapers once Denver launches.

Starting with the cache, here's an example of an A15 die shot:

http://www.techinsights.com/inside-samsung-galaxy-s4/

I'm not looking at pixels exactly or anything, but I'd give a really rough estimate that the icache + dcache + tags constitutes about 1/3rd of the core area (this isn't including L2 which is off to the side somewhere) Multiply that by 3 and you've got 5/3 the space - not really 2x but it's getting pretty close. I expect L1 ITLB and DTLB sizes to roughly scale with L1 cache sizes.

nVidia also claims that branch prediction will be better (not just a lower misprediction penalty, which they also claim, but a lower misprediction rate). This tends to imply bigger, sometimes much bigger data structures for tracking different types of branch histories and voting on the results. Loosely related is the 1K optimized branch target cache which normal CPUs won't have.

Denver will save out on not needing as many ARM decoders, renamers, a ROB, schedulers, etc. But if it's anything like Efficeon in this regard I expect it to have larger speculation buffers than a normal in-order core so it can speculate things like alias prediction over larger and less dynamic regions than a pipeline shadow. This will need various buffers and potentially some big structures to lookup conflicts. Likewise, some buffers needed for holding additional state for runahead execution, if it indeed supports it.

Now for a totally made up guess as to how the compilation and optimization process could work. I emphasize, this is all totally made up, but it's too annoying to annotate everything with "I'm heavily speculating on an idea or something very very vaguely similar" so please just read that implicitly :p

I'm thinking something like dynamic statistics for the top N blocks. Where a new block enters the list and increases an execution counter, while all the execution counters are periodically decremented, so all but hotly executed blocks will fall off the list. Blocks that pass a threshold of execution will move to another list, while also being marked for recompilation.

The second list contains some profile statistics for the next top N recompiled blocks. These blocks will be continually monitored for catastrophic events, big slow things that fall outside the critical path like cache and/or TLB misses, branch mispredictions, alias mispredictions, I/O access, and so on. But the amount of actual data that can be stored is limited because of tracking multiple simultaneously. Like the execution counters, the numbers go down in time automatically to emphasize the hottest offenders. When it passes a threshold, it's flagged to be compiled again.

A block that's flagged for compilation or recompilation will run through the plain ARM decoders again, but this time will also collect more indepth profiling data stored in one buffer just for that block. After the block is executed (or maybe executed a few or all times if it's a loop?) the core will interrupt to the recompiler which will have the recent live statistics data in addition to the accumulated statistics data to inform it on its decisions.

For the very hottest blocks that benefit it from the most there could be multiple copies following long traces spanning multiple branches. It's possible that the branch prediction hardware itself works with the software via a mechanism that reduces or eliminates the overhead of picking traces based on future branch paths.
 
NVIDIAs Denver seems to beat the OoO competition according to their internal benchmarks (I know this is a questionable result until we get third party results). And it doesn't seem to have huge dips in any of the common benchmarks they used. It is based on JIT compiling (and static JIT scheduling). OoO machinery schedules/renames the same hot code (inside inner loops) thousands of times every frame. It's a nice idea to do this once (and fine tune the scheduling by periodic feedback from the execution units). Obviously you can't react to the most erratic cache misses this way, but at least you know which instructions cause the biggest misses. This might provide the JIT compiler enough information to add prefetch instructions and necessary extra code. I have done this manually with a profiler so many times on in-order PPC cores that I believe a monkey (= JIT compiler) could do most of it automatically.
The same results can be obtained with profile-guided optimization. The only advantage of doing it at run-time is that it can optimize legacy binaries for newer architectures. That's very helpful, and I mentioned it as one of the advantages of out-of-order execution as well, but it just can't achieve the performance of out-of-order execution for real-world applications. You don't need the memory latencies be very erratic to gain from out-of-order execution. With static scheduling you can't take any variation into account. So if you predict data to be in L2 then you end up stalling for the cases where it's in L3 or RAM, or for the times it's actually in L1 you're unnecessarily delaying instructions which might affect performance.

Also, startup time has a big effect on the user experience, and that's where dynamic code generation makes things worse. The dynamic code optimization thread can preempt foreground programs, and that adds unwanted variability in execution time. JIT-compiled languages are typically used by less performance critical applications, but when it's done as part of the architecture, everything gets penalized. Out-of-order execution does not make things worse for software that is already highly optimized.

It's an interesting variation on Transmeta technology, but I'm highly skeptical it's going to float this time around. To quote Linus Torvalds; it's BS.

Don't get me wrong. I would love for there to be an easier path toward CPU-GPU unification. Denver certainly is a step closer to the in-order architecture of the GPU, and perhaps NVIDIA will achieve unification that way. I just don't think it will be at the same practical performance level, and achieving that level will require out-of-order execution, or an equal amount of complexity. I also think dynamic code generation is very valuable, but it should be used selectively in a controllable or predictable way.

Denver will have to compete against Broadwell, which features a turbo clock of up to 2.6 GHz, AVX2, and 5% higher IPC. Meanwhile NVIDIA's white paper on Denver doesn't exhibit a whole lot of confidence: "It's a big gamble for the company and we're about to see if it paid off". My bets are on the sure thing.
 
The same results can be obtained with profile-guided optimization.

Not really. Profile guided optimization has two big limitations vs runtime recompilation:

1) The statistics are based on the run of one profiler instead of a specific user's inputs and actions
2) The statistics provide one static data point which is supposed to characterize a best case for all runs

And the more you try to generalize #1 the more it'll make #2 a smeared input.

Note that nVidia's approach can include recompiling the same code multiple times at runtime as conditions change. Transmeta has in fact described some conditions where they've done this.

There are some residual effects too, like profiling overhead potentially influencing the execution shape of what you're profiling.
 
Except GPUs don't normally stall and clock gate. They swap threads and thus push data the previous thread was using, out of the nearest most power efficient storage. People are often horrified by the ~5% of cases where speculation is a loss, but forget about the ~95% of time it works perfectly and keeps the data right where you want it.
While you're correct GPUs often push out data the previous thread was using GPUs do clock gate quite often. If they didn't have this opportunity peak power would be achieved more often.
 
I couldn't have said this better myself :). People are always assuming that you need or want a big serial core for scheduling work for many smaller parallel cores. But scheduling isn't a serial task. You can do all the scheduling you need on the parallel cores.
It's not about the scheduling, it is literally about tracing the path through the ideal data dependency graph. The graph being wide doesn't affect that you ultimately still need something to rip through that path. I have never been talking about bad code here, but the ideal complexities of these algorithms. As I've qualified, this is not likely to be a fundamentally issue in what are still relatively narrow GPUs today, but the trajectory is clearly towards it getting worse. This isn't conjecture... you can measure how much time these sorts of critical paths take relative to the throughput of the GPUs and it's obviously getting worse over the years (i.e. they are taking up a larger percentage of the total frame time).

I also regularly need to point out to people that even distributed scheduling and "lockless" data structures obviously require atomics, which is fundamentally synchronization too. It's obviously the way to go and in a good cache system it does localize the conflicts somewhat, but it really just changes the constants, not the overall argument. TSX is similar - it just helps make synchronization dynamic and finer grained (i.e. cache line level) which is often difficult to do statically (as efficiently) with data-driven dependencies.

So while it's not something you guys necessarily need to think about for a while, it absolutely is something people designing hardware have to think about :) I think it's questionable to say that these critical paths will always be short enough that you can - for instance - take a big hit in serial performance and still be fine 10-20 years from now.

Anyways enough said on that as we're sort of off in the weeds, but I wanted to clarify the high level point.

Note that nVidia's approach can include recompiling the same code multiple times at runtime as conditions change. Transmeta has in fact described some conditions where they've done this.
Theory in this space tends to falls on deaf ears I'm afraid... you can always make the argument that a magic optimizing compiler with perfect knowledge can JIT the ideal schedule for any given situation. I've even made similar arguments myself over the years :) Unfortunately these are not new arguments and previous attempts to apply the concepts haven't exactly met with stunning success.

If NVIDIA has somehow done something fundamentally different and it works way better, that's great. But it's fair to say that scepticism is highly warranted until we get to see these chips in the wild. Unfortunately these sorts of architectures are also the type that make benchmarks almost completely useless in terms of predicting the performance in arbitrary workloads. Incidentally that's not all that different from GPUs I suppose :)
 
Theory in this space tends to falls on deaf ears I'm afraid... you can always make the argument that a magic optimizing compiler with perfect knowledge can JIT the ideal schedule for any given situation. I've even made similar arguments myself over the years :) Unfortunately these are not new arguments and previous attempts to apply the concepts haven't exactly met with stunning success.

If NVIDIA has somehow done something fundamentally different and it works way better, that's great. But it's fair to say that scepticism is highly warranted until we get to see these chips in the wild. Unfortunately these sorts of architectures are also the type that make benchmarks almost completely useless in terms of predicting the performance in arbitrary workloads. Incidentally that's not all that different from GPUs I suppose :)

Hey, I'm not under any illusion here that nVidia's approach can close the gap between typical in-order (or even worse, exposed pipeline VLIW) and proper OoOE, particularly for a wide variety of use cases. I'm just saying, it's not right at all to say that runtime compilation and recompilation offers no opportunities beyond traditional PGO.

In nVidia's case I think this is going to be less about making up all the difference with the compiler and more about trying to make it up by spending the saved power budget on other things (more cache, better branch prediction, aggressive prefetchers, run-ahead execution, etc) But for the time being I also don't put any real stock in their numbers, much less would I consider them representative for many programs.
 
GPUs do miss a lot. However what I am suggesting here is basically an extended atomic operation (that writes up to two 64 byte cache lines).
Are you speaking of a double compare and swap?
Is it a double compare and swap of two locations at arbitrary addresses?
Since we are mixing discussion between CPU and GPU, by cache line do you mean it updates in chunks of 64 contiguous and aligned bytes (CPU packed operation or a GCN vector op with sequential addresses) or that the scheme works through the cache subsystem in line quantity, but it doesn't have to actually write those lines fully (divergent/masked/serialization at a lane level)?

I thought the discussion of TSX was arguing for transactional memory in the vein of Intel's, rather than DCAS.
DCAS has the distinction of being both beyond what CPUs support for their atomic operations and yet less ambitious than what AMD flirted with, Intel is bringing out, or what IBM uses.

If that subset of full transactional memory support is what works for what you want, then fine, although it doesn't resolve the unknown as to whether GPUs as we know them are capable of giving what you want.
GCN doesn't have the instruction format, and the ISA docs are dodgy on the ordering behavior of separate writes.
The Sea Islands ISA doc is particularly untrustworthy because AMD forgot whether it was talking about GCN or VLIW in that section, which has been pointed out on this forum and I'm sure AMD is rushing to fix it while we all make a point of politely pretending it is still early 2013.

GPUs are designed to handle long memory latencies, so it shouldn't matter if the GPU instead needs to wait for another core to finish the atomic 2 cache line operation. CU could just shedule some other waves/warps until the "transaction" is finished. I don't think this would need much extra hardware.
There seems to be some assumed mechanism and method for this DCAS implementation that I think should be spelled out.
An Intel-like formulation of transaction processing involves tracking whether a line or lines has been accessed/modified/evicted in core-private cache, and upon completion of the transaction the write set atomically becomes globaly visible. GCN's L2 is the one point that provides something close to this, but it is shared and lacks significant tracking or promises of atomicity or ordering.
The long latencies bring into question the atomicity of transaction commitment and raise the chances of an untoward event cancelling the transaction.

Having a wavefront wait while some other CU's transaction is finished is a thornier problem because the question is how does the CU know a transaction is in process, and is the design shifting from silent discarding of failed transactions to some kind assertion of ownership over lines in a globally visible shared last level cache.


It would be nice if NVIDIA was as open as Intel and AMD regarding to their CPU architectures.
There are various reasons why they might not, since the trends these days are usually not towards openness, even for AMD and Intel.

Denver core is quite big for an in-order core, approx two A15 cores according to side by side comparisons in their Tegra K1 slides (http://blogs.nvidia.com/blog/2014/08/11/tegra-k1-denver-64-bit-for-android/).
I have doubts about relying on marketing slides for this.

In nVidia's case I think this is going to be less about making up all the difference with the compiler and more about trying to make it up by spending the saved power budget on other things (more cache, better branch prediction, aggressive prefetchers, run-ahead execution, etc) But for the time being I also don't put any real stock in their numbers, much less would I consider them representative for many programs.
Also, as was pointed out for Transmeta, a software translation layer is great at covering up problems with the implementation.
The code morphing software did things that pointed to interlock problems or hardware bugs, and at a hardware level it would have been unacceptably primitive and dangerous to use without the interpreter--if things could be made functional.
A modern core isn't a trivial undertaking and time isn't on Nvidia's side.
An honest to goodness competitive OoO core would have been a challenge to get right without an extended history of making such cores.

edit:

One of the (big!) problems that no one's mentioned WRT to CPU GPU unification is clock rate. For good serial performance you need a high clock rate. The problem is that power usage increases *quadratically* to clock rate. So if you take a 1 GHz GPU and scale it to 4 GHz you will use 16x the power! Normalizing performance (assume 1 GHz to 4 GHz makes it 4x faster, that the longer pipelines you need are offset by needing fewer threads), you still have 4x the power for an equal amount of compute!
It's generally a quadratic relationship with voltage, and linear with clock speed.
Since higher clocks eventually necessitate voltage bumps, it actually translates to a cubic relationship over the whole voltage/clock range of a design. The extremely high peak at heavy overclocks is what happens when a design is being pushed past its speed and power targets.
 
Last edited by a moderator:
Also, as was pointed out for Transmeta, a software translation layer is great at covering up problems with the implementation.
The code morphing software did things that pointed to interlock problems or hardware bugs, and at a hardware level it would have been unacceptably primitive and dangerous to use without the interpreter--if things could be made functional.
A modern core isn't a trivial undertaking and time isn't on Nvidia's side.
An honest to goodness competitive OoO core would have been a challenge to get right without an extended history of making such cores.

I wouldn't rule out Nvidia pulling something like this off. They have the resources and a solid product base to fund their research. Of course, I don't expect them to get it all right the first generation either:p

It's generally a quadratic relationship with voltage, and linear with clock speed.
Since higher clocks eventually necessitate voltage bumps, it actually translates to a cubic relationship over the whole voltage/clock range of a design. The extremely high peak at heavy overclocks is what happens when a design is being pushed past its speed and power targets.

I think the big thing that's done to stop/reduce the voltage hikes is to increase pipeline length, which is in fact precisely why all these latency reduction technologies came to be! To maintain voltage, I believe you'd have to increase pipeline length linearly with clock speed, so that each rank of transistors in a given stage has the same amount of time to stabilize. However, I would strongly suspect that increasing the pipeline length four times to go from 1 GHz to 4 GHz would impact latency well beyond what you could fix with OOO etc.

In fact, I think we may be making a mistake comparing latency optimizations needed at 1 GHz to those at 4 GHz, ESPECIALLY if you add superscalar to the 4 GHz case! Between the more and longer pipelines, you need quite a bit more ILP to cover it!
 
OOO hardware does not by any means schedule optimally! It effectively uses a crude greedy algorithm, which is likely worse than what a compiler could do offline.
Indeed out-of-order execution in and of itself doesn't result in optimal scheduling. But out-of-order execution doesn't mean you can no longer statically schedule. Out-of-order execution will further improve on any static schedule whenever latencies are not as predicted. In addition, register renaming allows the static scheduling to ignore false register dependencies, which is something you can’t do for an in-order architecture.

I believe that in theory out-of-order scheduling is optimal when you have infinite execution units and an infinite instruction window, for any order of the instructions and any latencies. For in-order execution this isn't true, even with infinite resources, so it’s always at a disadvantage. And for all practical purposes, Haswell’s eight execution ports (four arithmetic) and 192 instruction scheduling window gets pretty close to optimal scheduling when you make a reasonable effort at static scheduling (e.g. trace scheduling).

So your downplaying of out-of-order execution isn't really justified. I’m not denying that practical implementations aren't 100% optimal, but it gets darn close and in-order execution essentially can’t compete with that. Mobile CPUs use out-of-order execution too now, and there’s no turning back from that. With all due respect NVIDIA’s first attempt at designing its own CPU will probably be remembered in the same way as NV1: using fundamentally the wrong architecture.
Or you could do a JiT stage and schedule specifically for the processor in question. Instruction scheduling is cheap - so much so that it can be done dynamically at runtime millions of times for each instruction in a loop.
Scheduling is cheap in hardware, not in software. Out-of-order execution essentially uses CAMs which perform hundreds of compares per cycle. In software it’s so expensive some JIT compilers don’t (re)schedule, and those that do use cheap heuristics that aren't as thorough as out-of-order scheduling hardware.

Trying to improve on out-of-order execution is great, but it should use out-of-order execution as the starting point, not in-order execution. There’s some research on dual-stage scheduling which uses a slow big scheduler and a fast small scheduler to save on power while increasing the total scheduling window, but as far as I’m aware that hasn't been used in practice yet, probably due to direct improvements on single-stage scheduling. There are also ideas to switch to in-order execution when there’s good occupancy due to Hyper-Threading, but when you have lots of parallelism you probably want wide SIMD instead, and the cost of out-of-order execution gets amortized.
Who's writing in assembler and dealing with this? That's the job of the compiler optimizer / JiTter, and is done completely behind the scenes without the developer having to lift a finger.
First of all you don’t have to write in assembler to have to deal with scheduling for an in-order architecture. Secondly, compilers are written by developers too, and are in a constant state of flux. Scheduling is not a solved problem, and between all the other things that are expected of a compiler, is hard to keep up to date. Also, again, JIT compilation has a very limited time budget for scheduling. This is a serious concern for run-time compiled shaders and compute kernels, which shifts some of the problem to the application developer. Out-of-order execution makes things a lot easier for everybody. You can do a somewhat sloppy job and still get great results, and have your legacy code run faster without recompilation.

You can curse at developers for this, but the reality is that they have a lot of other issues on their mind, mostly high-level ones, to not want to bother with low-level issues. The increase in productivity you get from out-of-order execution is something that shouldn't be underestimated, and is a significant factor in its widespread success.
Also keep in mind that with SIMD you typically execute both branches.
...Which would completely defeat the purpose of branch prediction.

This isn't really true anyway - a well optimized algorithm can organize execution so that quite a bit of the data falls into uniform branches.
Yes, my argument is that you can have it either way, or both. If the branching is uniform and predictable and the conditional blocks are large, you can have a real jump instruction. If the branching is less uniform and the blocks are small, just execute both branches. Note that with AVX-512’s predicate masks, it can improve upon the latter by skipping instructions for uniformly non-taken branches. In theory it should be able to skip four such instructions per cycle, so the threshold for wanting to use an actual jump becomes quite high. On the other hand if you’re not sure about the branching behavior it’s still safe to have a real jump because prediction will help you out in 95% of the cases. You don’t get that luxury with in-order non-speculative execution.
First of all, you definitely can JIT-compile SIMD code on the CPU. That’s what I've always done.
You're missing the point. In order for the ISA to be tweaked, for instance, to add more registers, ALL programs must be JiT-compiled. Any that aren't will break! The CPU ecosystem is no where near this point.
That’s not true. x86-64 doubled the number of registers without breaking backward compatibility for x86-32 applications. AVX-512 will double the SIMD register count without affecting any prior code. Also, when you compare different ISAs it’s clear that it’s not so critical in the first place. Micro-architectural features and backward compatible extensions have a bigger effect on performance than any overhauls that would break compatibility.

Dynamic compilation is a valuable technique, but you don’t need/want it everywhere.
That's not completely true. Even at fairly high occupancy, a GPU has 32ish registers per thread. If you problem needs more immediate storage, you can lower the thread count and increase the register count per thread quite a bit - new Nvidia cards support up to 255 registers per thread!
The problem is that even though 32 registers might seem plenty compared to x86-64’s 16 registers, the CPU can leave several megabytes of data on the stack. Want a local array of 16k elements? No problem on the CPU, impossible on the GPU. Want to recurse the same function a few thousand times? No problem on the CPU, impossible on the GPU. Of course if the data is shared you could explicitly use shared memory and manage access to it, and you can convert recursion into a loop, but that’s putting a heavy burden on the developer.

Microsoft increased the register limit from 32 in Shader Model 3.0, to 4096 in Shader Model 4.0. SM3 is ancient nowadays. And while you won’t run into the 4096 limit that easily yet with the relatively small bits of code the GPU is expected to run, anything in between results in lower occupancy.

There’s a reason GPUs continue to increase their register set or lower the latencies every generation. It allows them to run more complex code without going back to the CPU. The end goal is to be able to start executing parallel code in the middle of sequential code, even when you already have tens of function’s stack frames on the stack. That’s only going to be achieved by a unified architecture.
Of course, this does decrease the amount of hyperthreading you have available, but the important part is that the developer gets to tune it themselves!
You don’t really get to choose how many local variables you need, or how deep your stack gets. For GPUs to have a flourishing software ecosystem you have to be able to call other people’s GPU code libraries. For this they have to be able to park stack frames in generic memory instead of precious registers, and they have to lower the the thread count so that this data will still be in caches by the time the call returns. And to achieve that you need out-of-order execution.
Prefetching costs power and bandwidth and pollutes the cache if it guesses wrong…
If and when they guess wrong, yes. But automatic prefetching is quite conservative (it really needs a strong pattern), and if it predicts wrongly it immediately aborts. This can be tuned as desired for best performance/Watt. Not prefetching results in stalls or switching threads, and the latter also ‘pollutes’ the cache for the first thread. Research shows that GPUs should prefetch too to improve power efficiency.
The boost number I've heard bandied about for OOOe is 30%. Keep in mind that this is for a superscalar architecture - for something like a GPU, it would be considerably less since there are fewer instructions in flight, and since it has hyperthreading (the 30% was referring to an ARM architecture with no hyperthreading). I suspect it would end up around 10% or less. Is that worth the increase in die size and power use? If it was, I'm sure Nvidia and AMD would be using it right now.
First of all, the speedup from out-of-order execution greatly depends on the rest of the micro-architecture, especially the execution width. In-order architectures simply don’t come as wide as out-of-order ones, so it gets hard to compare. Secondly, don’t trust singular results. It is obviously going to greatly depend on the software, data set, and usage pattern how much you gain.

The Atom Z3770 is well over times faster than the Z2760. Even if it was just 30% when eliminating all other differences, that’s not something anyone is willing to give up. So unification will require the scalar part to use out-of-order execution. Strictly speaking the SIMD part could use in-order execution, but since the cost is amortized over a great vector width, you might as well stick with the out-of-order execution. Alternatively it could use less aggressive out-of-order execution. In fact that’s exactly what I’m proposing by executing AVX-1024 on 512-bit units. It takes two cycles to issue each instruction, so you can have two SIMD clusters and have a single scheduler alternate between them. Of course that’s the same cost as issuing them on 1024-bit units each cycle when you have just one SIMD cluster, but that wouldn't come with the latency hiding benefits. Also, if you pin each of the two 512-bit clusters to a single thread, you can really use out-of-order execution to augment the utilization.
 
Running in lockstep (or almost lockstep) is very hard on PC, since the CPU and GPU performances vary so much. Compared to an mid class class gaming PC, a high end Intel mobile CPU has roughly twice the CPU performance and half the GPU performance. Let's assume the game is designed to run at 60 fps on the mid class "balanced" gaming PC, and it utilizes 16.6 milliseconds of both CPU and GPU time to render a frame. The game could be running roughly at CPU<->GPU lockstep on this setup. However when you run the same game on the Intel high end mobile part, the GPU time per frame is 33.3 ms and the CPU time per frame is 8.3 ms. This means that the CPU has submitted the whole frame to the command buffer 25 milliseconds earlier than the last GPU command finishes. There's no way that untouched data in 8 MB L3 cache shared by CPU and GPU lasts for 25 milliseconds. However it becomes really interesting if the L4 caches become big enough to cover all the memory accesses inside a frame. 128 MB (Crystalwell) is not yet enough, but we are talking about the future here.

An alternative model would fire a realtime priority CPU callback whenever the GPU ring buffer is "almost empty", and the CPU would put the commands and the needed data to the command buffer just before they are needed. This way the commands and the data would be in the L3 cache when the GPU execution starts. But there's a problem with this model as well. It's hard to define "almost empty" in a way that ensures that the GPU never runs out of work. If it's too long time, then the CPU generated data has already gone out of the caches. If it's too short, then the GPU might stall (for example some object is fully hi-Z culled, and pixel cost was much less than expected).

I think it would be much simpler to just hold the CPU<->GPU command buffers in dedicated SRAM on chip, which the OS can control. There is no reason to communicate here via DRAM when the cores themselves are on the same die.
 
I wouldn't rule out Nvidia pulling something like this off. They have the resources and a solid product base to fund their research. Of course, I don't expect them to get it all right the first generation either:p
Which part did they pull off?
Not an OoO core--referencing Denver, and most probably not a complex CPU core with no errata--referencing everyone.

Barring a crippling flaw that somehow breaks the software layer, Nvidia can do what Transmeta did and compensate for vulnerabilities in the hardware by putting the necessary workarounds in the translated code cache.

I think the big thing that's done to stop/reduce the voltage hikes is to increase pipeline length, which is in fact precisely why all these latency reduction technologies came to be!
Additional pipeline stages can be added to allow for better voltage scalability in critical paths, but beyond selective application to things like the memory pipeline it can threaten the performance of the design. Heavier pipelining also increases transistor counts that directly contribute to the static power figure in the graph cited earlier.

To maintain voltage, I believe you'd have to increase pipeline length linearly with clock speed, so that each rank of transistors in a given stage has the same amount of time to stabilize. However, I would strongly suspect that increasing the pipeline length four times to go from 1 GHz to 4 GHz would impact latency well beyond what you could fix with OOO etc.
You're likely to run into a number of roadblocks with extreme pipelining for the sake of not bumping voltage.
One is performance tanking in absolute terms and per-watt.
Then there is the already mentioned increase in transistor count that will increase the static component of leakage in that graph.
Another is that while wire delay is typically a bigger obstacle these days than transistor switch times, splitting things up and adding drive stages cuts wire delay at the cost of an accumulation of small-but-not-zero transistor delays, and the proportion of delay sources is not constant throughout a design.

Then there's the cost of adding a pipe stage, since each stage involves passing signals through logic that takes data from one latch and outputs to the next.
The latches are a non-trivial contributor to pipe delay, and every reduction of the amount of logic work per stage also increases the fraction of a pipe stage taken up by stage overhead.

Just as an example:
http://www.realworldtech.com/cell/3/

The CELL processor had a very aggressively pipelined design whose cycle time was approximately 50% pipe stage overhead, and CELL still had to bump voltage to reach the upper range of its clock envelope.
http://www.realworldtech.com/cell/8/

Taken to the furthest extreme, you will simply run out logic layers that do any work if you add enough stages, and the chip that can do nothing is probably not going to hit the super high clock range in that graph without a voltage bump.
 
No, there is no need to extract more ILP when executing embarassingly parallel code. It does not matter that it's old code but when the intermediate language and the program is made for embarassingly parallel code, ILP is not needed. Just add more cores or SIMD or SIMT lanes in future to get better perfromance.
Wrong. Just adding more parallelism will run you into the Bandwidth Wall. The GeForce 8800 GTX could do 518 GFLOPS and had 86.4 GB/s of bandwidth, while a 680 GTX can do 3090 GFLOPS with 192.25 GB/s. That's six times the computing power but just twice the bandwidth to feed it. That’s only possible because of larger on-die caches to feed the additional units, and lower instruction latencies and dual-issue to improve locality. In effect they had to increase per-thread IPC (or rather decrease the CPI).

The Bandwidth Wall also translates into a Power Wall. Bandwidth can’t be increased aggressively because it costs a lot more power, and unlike raw computing throughput it’s not helped much by the silicon process scaling. So ever more bandwidth will have to be delivered by caches, and the more you can fetch data from a closer cache level, the better the power consumption. To achieve that you have to run a low number of threads. Note that even Hyper-Threading with just two threads can result in no gains due to cache thrashing. So GPUs will have to continue to extract more ILP to stay efficient.
Because IPC is reaching it's limits, in order to get some last remaining percents of that serial performance the CPU's have to throw even more resources into thing like branch prediction.
Sure, but as long as it’s achieving a good performance : power ratio, that’s fine. Intel’s Broadwell architecture is claimed to uphold a 2:1 ratio or more for each new enhancement. And that’s for scalar code. For SIMD the scaling from 128-bit to 512-bit will result in even much better ratios. The effective GFLOPS/Watt is increasing spectacularly.
Display resolutions have been increasing, 4k displays are coming. There are quite a lot parallelism available to draw all of those pixels, you can just add wider SIMD/SIMT or put more cores.
When you quantify what you call “quite a lot”, and its rate of adoption, it’s really rather bleak compared to the rate at which computing power is increasing. Also, people don’t just want more pixels, they want more realism per pixel. Meanwhile the total frame latency can’t increase, and should ideally be decreased. So the only way to achieve all that is to compute things faster on a per pixel / per thread level. If you add more ALUs to increase parallelism, a good chunk of that has to go to instruction level parallelism.
Another point of view:
Frames have to be drawn 60 times per second. With 1.5 Ghz clock rate that means there are 25 million clock cycles per frame. You can run quite complicated shader programs with that 25 million cycles with quite bad ipc if you have own processor(or SIMD lane) for every pixel.
But then you’re wasting hardware by having an aggregate IPC much lower than 1, and you're ignoring locality issues due to running one thread per core. Obviously we don’t want to waste anything, so the choices are between out-of-order execution to increase the single-threaded IPC, or SMT to achieve an equivalent aggregate utilization. Again bandwidth and power consumption become your enemies as you scale up. You can only increase the number of cores up to the point where bandwidth is saturated at any level. And since you want the bandwidth further away from the cores to be as low as possible to minimize power consumption, the only way to have a high core count is to have a high locality of accesses. Out-of-order execution achieves a higher locality than SMT, so it will prevail.
 
Wrong. Just adding more parallelism will run you into the Bandwidth Wall. The GeForce 8800 GTX could do 518 GFLOPS and had 86.4 GB/s of bandwidth, while a 680 GTX can do 3090 GFLOPS with 192.25 GB/s. That's six times the computing power but just twice the bandwidth to feed it. That’s only possible because of larger on-die caches to feed the additional units, and lower instruction latencies and dual-issue to improve locality. In effect they had to increase per-thread IPC (or rather decrease the CPI).

Before you go extrapolating about increased IPC using Kepler as a base, consider Maxwell. Maxwell is basically Kepler *without* the new IPC and with a larger L2. How did it perform? Quite well, thank you, *especially* in compute benchmarks!

I'm somewhat limited in providing concrete data for this since pretty much all the compute benchmarks are OpenCL based. Given the state of Nvidia's OpenCL stack, it's not clear how much of the gains come from "Maxwell performs much better" as opposed to "Kepler's OpenCL support sucked, but Nvidia put a bit more work into it for Maxwell". Anyone have some Cuda benchmark comparisons?

What this really goes to show is that the IPC micro-architecture first introduced with GF104 was a dead end. At the very least, it shows that it's better to spend the space/power you would use for OOO or something and put it into more cache and/or more processors.
 
Last edited by a moderator:
One of the (big!) problems that no one's mentioned WRT to CPU GPU unification is clock rate. For good serial performance you need a high clock rate. The problem is that power usage increases *quadratically* to clock rate. So if you take a 1 GHz GPU and scale it to 4 GHz you will use 16x the power! Normalizing performance (assume 1 GHz to 4 GHz makes it 4x faster, that the longer pipelines you need are offset by needing fewer threads), you still have 4x the power for an equal amount of compute!

daciI.png
I don't think that's entirely the case. There's a constant, a scalar and a quadratic component to it, and while that makes it O(n²), the constant + scalar component actually dominates as long as you stay within the design space of the process. In fact as you can see from the image for most part of the curve you can get twice the clock speed for much less than twice the power consumption. Also, overclocking a 3 GHz design to 5 GHz will result in worse power than when it was designed for 5 GHz in the first place.

Last time I checked, the performance/Watt results for CPUs vs GPUs were not off by 4x despite the big difference in clock speed. AVX-512 should further diminish that gap, and we'd need AVX-1204 for the most fair comparison. Either way, with a CPU designed for 4 GHz the power consumption for the performance you get out of it isn't bad at all.

And then there's several ways to actually do lower the frequency in case that would be necessary to achieve unification. Downclocking to keep the core within the power budget while executing wide SIMD code could improve performance/Watt if the voltage was also lowered. The same design techniques for near-threshold voltage computing can keep things stable at a moderate voltage to optimize for performance/Watt. And while that would affect absolute scalar performance, it could ramp up again when no SIMD code is executed for a while. In fact Intel is already balancing the power budget between the CPU and iGPU in a similar fashion. Alternatively we could have two 1024-bit SIMD clusters per core, but clock them at half the frequency of the scalar part. That would have the downside of impacting legacy SIMD code which doesn't take advantage of the full width, but maybe that's quite acceptable especially when the two clusters would keep multi-threaded execution on par.

Also note that while GPU core frequencies are relatively low, the RAM can run at up to 6 GHz! For CPUs it's practically the reverse situation. It appears that future GPUs will use lower clocked stacked RAM but use wider interfaces, which might allow them to increase core frequency, whereas CPUs will switch to higher clocked (but also power efficient) DDR4 soon.
 
Before you go extrapolating about increased IPC using Kepler as a base, consider Maxwell. Maxwell is basically Kepler *without* the new IPC and with a larger L2. How did it perform? Quite well, thank you, *especially* in compute benchmarks!
"dependent arithmetic instruction latencies have been significantly reduced"
"scheduler still has the flexibility to dual-issue"

Kepler couldn't actually dual-issue 4-operand instructions, due to register port conflicts. Also, whether or not you can dual-issue depends on the instruction dependencies. So dual-issue rates probably weren't that great. Instead a significant reduction in latencies, and the ability to still dual-issue different instruction types, must result in greater single-threaded performance. Ergo, another small convergence step toward the CPU.
What this really goes to show is that the IPC micro-architecture first introduced with GF104 was a dead end. At the very least, it shows that it's better to spend the space/power you would use for OOO or something and put it into more cache and/or more processors.
It seems more like one step back to make two steps forward in single-threaded performance. You can't conclude from this that out-of-order execution won't ever be viable. Also note that you have to cross the valley in one go. We won't actually see the architectures converge to the same point before unification. Instead they'll have competitive performance and power characteristics and CPUs will suddenly be able to take on graphics computing. Still, GPUs clearly can't wander off to the other extreme and have to incrementally achieve better access locality and hit rates through lower latencies and larger caches.
 
I have done this manually with a profiler so many times on in-order PPC cores that I believe a monkey (= JIT compiler) could do most of it automatically.
The same results can be obtained with profile-guided optimization.
Not really. Profile guided optimization has two big limitations vs runtime recompilation:

1) The statistics are based on the run of one profiler instead of a specific user's inputs and actions
2) The statistics provide one static data point which is supposed to characterize a best case for all runs

And the more you try to generalize #1 the more it'll make #2 a smeared input.

Note that nVidia's approach can include recompiling the same code multiple times at runtime as conditions change.
Note that sebbbi was talking about doing it manually, and thus statically. Anyway, yes in theory the same code can be recompiled multiple times. But that's not novel either. In fact proponents of JIT compilation keep bringing it up as a reason why JIT compiled languages should be faster than statically compiled ones. In reality they're not. The hotspots of performance critical applications are typically optimized by the developers themselves, already utilizing specialized paths for changing conditions. You can't really rely on a JIT compiler to do that for you. Where JIT actually can outperform static compilation is when the number of conditions to specialize for is too high to do it manually, but once again in reality they're all too conservative to be potentially spewing out a lot of code. I'm not expecting Denver to be any different.

With Reactor I'm generating exactly the specialized code that I want at run-time. So even for cases where there are lots of conditions it's feasible and preferably to have the developer guide the process.
 
Denver isn't comparable to a JITed VM for a language like Java because it's translating machine code, not unoptimized or poorly optimized bytecode. The hotspots of performance critical applications will be optimized by the developers exactly as much as they always have been. Maybe not targeting Denver's specifically, but on Android apps have very little in the way of uarch specific optimization, not least of all because here's one binary per architecture without dispatching. While Denver is technically running a separate architecture in between I'm sure it has been tuned to fit ARM well.
 
I won't claim we know all there is to know, but from a language design perspective there are clearly issues that no one who has been using them from day one really disagrees with. For instance, I think there was a panel at SIGGRAPH last week that pretty much uniformly agreed that shared memory was a bad idea.
Can you mention what the panelists had to say on shared memory? IE, why did they dislike it? What about your own views on shared memory.
 
Also note that while GPU core frequencies are relatively low, the RAM can run at up to 6 GHz!
For CPUs it's practically the reverse situation. It appears that future GPUs will use lower clocked stacked RAM but use wider interfaces, which might allow them to increase core frequency, whereas CPUs will switch to higher clocked (but also power efficient) DDR4 soon.

You mean 6 Gbps, right?
GDDR5 doesn't clock its bus to 6GHz. Since it is DDR, the data bus clock is half that.
The upper tiers of DDR4 may eventually bump into the lower range of GDDR5.

High bandwidth solutions may be the very wide HBM or ultra-short-reach HMC. The USR variant of HMC is targeting 10 Gbps. The next iteration of that standard is apparently aiming for 30 Gbps.
 
Back
Top