It mispredicts 1% of the time. The 30% comes from running a mispredicted branch for the length of the pipeline, for a single-threaded core.
That was my bad wording. It should have stated that 30% of the work is wasted in both threads, which I interpret as meaning that on average 1/3 of all ROB entries are not committed.
Anyway, what you appear to be missing is that the misprediction penalty is lower with SMT because instructions are not fetched that far ahead, in each thread, while the total number of mispredictions stays the same.
Nehalem still fetches pretty far ahead, up to 64 ahead in SMT mode. The decode rate would be about half per thread, which I can see providing some benefit in buying time for a branch with a small window of time needed for resolution.
Speculation can even be zero: Let's say you have two threads where each branch is at least 64 instructions apart.
That seems like a pretty restrictive example, and not one a silicon designer can count on.
Then whenever a branch is encountered, the CPU can switch to the other thread to avoid executing speculative instructions.
Which processors with SMT do that?
It's an awfully GPU-like thing to switch on a branch, if we also set aside for a moment that GPUs from the POV of the warp or wavefront abstraction actually switch in sets of 32 or 64 branches.
Larrabee's raster method will branch in granularities of at least 16 to start with (or rather, the fiber pretends to).
Switching right there also negates the point of branch prediction, since putting a thread to sleep right there means halting instruction fetch, which is why we're predicting branches in the first place.
Also, because the IPC isn't constantly 4, the situation is actually even better. So my expectation is that 4-way SMT should be sufficient to make branch misprediction no longer a significant issue. This is confirmed by
research. There is no need to switch threads on every cycle, like GPUs do. That's a waste of context storage.
The paper indicates that SMT leaves the overall CPI insenstive to branch prediction accuracy (much of the figures are normalized to a YAGS scheme).
The discusion of a branch misprediction as a long-latency event was not particularly in-depth and their treatment of the components of that penalty is unclear to me.
It's not particularly helpful in teasing out information on my point of emphasis: the amount of work an aggressively speculating OoO chip wastes.
I need baseline numbers not starting with an SMT chip.
Going by the general observation of 30% overall wastage for a OoO chip (probably an oldish figure, but likely close), how much lower is wastage in SMT?
Relative insensitivity is not interesting to me, if it means little change from an already wasteful baseline.
The register space of GPUs has grown ever since we started calling them GPUs. If you want to run generic code, it has to increase even further. Developers go to great lengths to ensure that their algorithms run in a single 'pass' to avoid wasting precious bandwidth. So more intermediate data has to stay on chip. Like I said before, running out of register space is disastrous for performance.
Register pressure is a constant threat, in some cases of high occupancy, we might see a miniscule 8 registers per thread, which CPU coders will tell you is a disaster...
But the intermediate data only has to stay on chip for as long as the final result isn't ready. A GPU takes a comparatively long time to finish computing. Even if the next instruction in a thread could start the very next cycle, a GPU will first execute the current instruction for every other thread. So it has to keep a lot more data around. The solution is to adopt more CPU-like features...
There are problems with the way GPUs work, but I don't see anything wrong just by having a lot of data on chip. It's actually preferable in many ways.
FQuake is not using the remaining 20% because of dynamic load balancing granularity and because primitive processing is single-threaded. So I think my math is accurate. Computationally intensive code reaches an IPC well above 1 on modern architectures.
That doesn't help the rest of the code that needs to be run, which silicon has no choice but to also cater to.
So you rather have a chip that is twice as large and idle half the time, than a chip that has high utilization but uses a bit of speculation?
All else being equal, no.
The chip that's twice as large can handle an order of magnitude more peak resources, so it's not all equal.
A little speculation on 10 times as many units is a lot more expensive.
Utilization of total resources without knowing the total is not enough to make a judgement call.
Current CPU architectures are not optimal, but neither are current GPU architectures. Both have something to learn from each other. Larrabee's got most of it right, but it's not the end of the convergence.
Larrabee is a garbage desktop CPU. Assuming it hits 2.0 GHz, to most consumer applications it will appear as a 250 MHz Pentium.
One core and 1/4 threading for single-threaded apps, and single-issue for everything not using its vector instructions.
First you claim IPC is around 1, and now you're saying it's so high SMT can't yield 30% improvement?
Yes, if resource contention is a problem or there are no instruction slots unused, SMT can't manufacture extra performance.
Developer's already cringe at the idea of having to split up workloads over 8 threads to get high performance out of a Core i7. So how could they possibly consider running things on a GPU with hundreds of threads, other than graphics?
I'm not sure they'd want to.
I don't particularly care if they don't apply it to anything other than ridiculously parallel tasks.
Scheduling long running tasks is not efficient and a lot of them don't work on parallel data. Larrabee will be far better at running a wide variety of workloads, and CPUs are increasing the number of cores and widen the SIMD units to catch up...
Larrabee will be far better than GPUs at certain loads, yes.
Better than crap isn't necessarily good.