OOO hardware does not by any means schedule optimally! It effectively uses a crude greedy algorithm, which is likely worse than what a compiler could do offline.
Indeed out-of-order execution in and of itself doesn't result in optimal scheduling. But out-of-order execution doesn't mean you can no longer statically schedule. Out-of-order execution will further improve on any static schedule whenever latencies are not as predicted. In addition, register renaming allows the static scheduling to ignore false register dependencies, which is something you can’t do for an in-order architecture.
I believe that in theory out-of-order scheduling is optimal when you have infinite execution units and an infinite instruction window, for any order of the instructions and any latencies. For in-order execution this isn't true, even with infinite resources, so it’s always at a disadvantage. And for all practical purposes, Haswell’s eight execution ports (four arithmetic) and 192 instruction scheduling window gets pretty close to optimal scheduling when you make a reasonable effort at static scheduling (e.g. trace scheduling).
So your downplaying of out-of-order execution isn't really justified. I’m not denying that practical implementations aren't 100% optimal, but it gets darn close and in-order execution essentially can’t compete with that. Mobile CPUs use out-of-order execution too now, and there’s no turning back from that. With all due respect NVIDIA’s first attempt at designing its own CPU will probably be remembered in the same way as NV1: using fundamentally the wrong architecture.
Or you could do a JiT stage and schedule specifically for the processor in question. Instruction scheduling is cheap - so much so that it can be done dynamically at runtime millions of times for each instruction in a loop.
Scheduling is cheap in hardware, not in software. Out-of-order execution essentially uses CAMs which perform hundreds of compares per cycle. In software it’s so expensive some JIT compilers don’t (re)schedule, and those that do use cheap heuristics that aren't as thorough as out-of-order scheduling hardware.
Trying to improve on out-of-order execution is great, but it should use out-of-order execution as the starting point, not in-order execution. There’s some research on dual-stage scheduling which uses a slow big scheduler and a fast small scheduler to save on power while increasing the total scheduling window, but as far as I’m aware that hasn't been used in practice yet, probably due to direct improvements on single-stage scheduling. There are also ideas to switch to in-order execution when there’s good occupancy due to Hyper-Threading, but when you have lots of parallelism you probably want wide SIMD instead, and the cost of out-of-order execution gets amortized.
Who's writing in assembler and dealing with this? That's the job of the compiler optimizer / JiTter, and is done completely behind the scenes without the developer having to lift a finger.
First of all you don’t have to write in assembler to have to deal with scheduling for an in-order architecture. Secondly, compilers are written by developers too, and are in a constant state of flux. Scheduling is not a solved problem, and between all the other things that are expected of a compiler, is hard to keep up to date. Also, again, JIT compilation has a very limited time budget for scheduling. This is a serious concern for run-time compiled shaders and compute kernels, which shifts some of the problem to the application developer. Out-of-order execution makes things a lot easier for everybody. You can do a somewhat sloppy job and still get great results, and have your legacy code run faster without recompilation.
You can curse at developers for this, but the reality is that they have a lot of other issues on their mind, mostly high-level ones, to not want to bother with low-level issues. The increase in productivity you get from out-of-order execution is something that shouldn't be underestimated, and is a significant factor in its widespread success.
Also keep in mind that with SIMD you typically execute both branches.
...Which would completely defeat the purpose of branch prediction.
This isn't really true anyway - a well optimized algorithm can organize execution so that quite a bit of the data falls into uniform branches.
Yes, my argument is that you can have it either way, or both. If the branching is uniform and predictable and the conditional blocks are large, you can have a real jump instruction. If the branching is less uniform and the blocks are small, just execute both branches. Note that with AVX-512’s predicate masks, it can improve upon the latter by skipping instructions for uniformly non-taken branches. In theory it should be able to skip four such instructions per cycle, so the threshold for wanting to use an actual jump becomes quite high. On the other hand if you’re not sure about the branching behavior it’s still safe to have a real jump because prediction will help you out in 95% of the cases. You don’t get that luxury with in-order non-speculative execution.
First of all, you definitely can JIT-compile SIMD code on the CPU. That’s what I've always done.
You're missing the point. In order for the ISA to be tweaked, for instance, to add more registers,
ALL programs must be JiT-compiled. Any that aren't will break! The CPU ecosystem is no where near this point.
That’s not true. x86-64 doubled the number of registers without breaking backward compatibility for x86-32 applications. AVX-512 will double the SIMD register count without affecting any prior code. Also, when you
compare different ISAs it’s clear that it’s not so critical in the first place. Micro-architectural features and backward compatible extensions have a bigger effect on performance than any overhauls that would break compatibility.
Dynamic compilation is a valuable technique, but you don’t need/want it everywhere.
That's not completely true. Even at fairly high occupancy, a GPU has 32ish registers per thread. If you problem needs more immediate storage, you can lower the thread count and increase the register count per thread quite a bit - new Nvidia cards support up to 255 registers per thread!
The problem is that even though 32 registers might seem plenty compared to x86-64’s 16 registers, the CPU can leave several megabytes of data on the stack. Want a local array of 16k elements? No problem on the CPU, impossible on the GPU. Want to recurse the same function a few thousand times? No problem on the CPU, impossible on the GPU. Of course if the data is shared you could explicitly use shared memory and manage access to it, and you can convert recursion into a loop, but that’s putting a heavy burden on the developer.
Microsoft increased the register limit from 32 in Shader Model 3.0, to 4096 in Shader Model 4.0. SM3 is ancient nowadays. And while you won’t run into the 4096 limit that easily yet with the relatively small bits of code the GPU is expected to run, anything in between results in lower occupancy.
There’s a reason GPUs continue to increase their register set or lower the latencies every generation. It allows them to run more complex code without going back to the CPU. The end goal is to be able to start executing parallel code in the middle of sequential code, even when you already have tens of function’s stack frames on the stack. That’s only going to be achieved by a unified architecture.
Of course, this does decrease the amount of hyperthreading you have available, but the important part is that the developer gets to tune it themselves!
You don’t really get to choose how many local variables you need, or how deep your stack gets. For GPUs to have a flourishing software ecosystem you have to be able to call other people’s GPU code libraries. For this they have to be able to park stack frames in generic memory instead of precious registers, and they have to lower the the thread count so that this data will still be in caches by the time the call returns. And to achieve that you need out-of-order execution.
Prefetching costs power and bandwidth and pollutes the cache if it guesses wrong…
If and when they guess wrong, yes. But automatic prefetching is quite conservative (it really needs a strong pattern), and if it predicts wrongly it immediately aborts. This can be tuned as desired for best performance/Watt. Not prefetching results in stalls or switching threads, and the latter also ‘pollutes’ the cache for the first thread. Research shows that
GPUs should prefetch too to improve power efficiency.
The boost number I've heard bandied about for OOOe is 30%. Keep in mind that this is for a superscalar architecture - for something like a GPU, it would be considerably less since there are fewer instructions in flight, and since it has hyperthreading (the 30% was referring to an ARM architecture with no hyperthreading). I suspect it would end up around 10% or less. Is that worth the increase in die size and power use? If it was, I'm sure Nvidia and AMD would be using it right now.
First of all, the speedup from out-of-order execution greatly depends on the rest of the micro-architecture, especially the execution width. In-order architectures simply don’t come as wide as out-of-order ones, so it gets hard to compare. Secondly, don’t trust singular results. It is obviously going to greatly depend on the software, data set, and usage pattern how much you gain.
The Atom Z3770 is
well over times faster than the Z2760. Even if it was just 30% when eliminating all other differences, that’s not something anyone is willing to give up. So unification will require the scalar part to use out-of-order execution. Strictly speaking the SIMD part could use in-order execution, but since the cost is amortized over a great vector width, you might as well stick with the out-of-order execution. Alternatively it could use less aggressive out-of-order execution. In fact that’s exactly what I’m proposing by executing AVX-1024 on 512-bit units. It takes two cycles to issue each instruction, so you can have two SIMD clusters and have a single scheduler alternate between them. Of course that’s the same cost as issuing them on 1024-bit units each cycle when you have just one SIMD cluster, but that wouldn't come with the latency hiding benefits. Also, if you pin each of the two 512-bit clusters to a single thread, you can really use out-of-order execution to augment the utilization.