Looks like Core i7 does that. It seems like an obvious thing to do when facing the task of selecting an instruction from threads that are otherwise equivalent. And frankly I care little about what's done today and more about what could be done in the future.
All the descriptions I've seen indicate SMT mode has each thread take turns fetching instructions at the front end.
That's just one cycle of hidden latency for a branch that won't resolve for another 15 cycles (best case).
Branch prediction is still very useful because there can be other causes of latency (most notably cache misses) that force it to run speculative instructions anyway.
Branch prediction doesn't help with data cache misses, and an instruction cache miss is going to cause a stall either instantly or in a cycle if there's some kind of target instruction buffer.
In those cases it's good to know that you have a 99% chance you're still computing something useful. 4-way SMT would not suffice if you got rid of branch prediction altogether. But going to extremes to avoid any speculation means slowing down your threads to a crawl and introducing all sorts of other issues.
For an subset of the workloads out there, this is a worthwhile compromise.
Furthermore, single-threaded performance will remain important. Even the most thread friendly application inevitably has sequential code, and you want your processor to execute that as fast as possible.
The point of conflict I see here is that there are design compromises necessary for executing serial code as fast as possible that impact execution in all non-serial cases, as well as have knock-on effects on other parts of the system.
One of my former professors did some research on
Characterizing the Branch Misprediction Penalty.
A rough approximation would be: branch density * misprediction rate * IPC * pipeline length. So let's say we have on average a branch every 10 instructions, a misprediction rate of 10%, an IPC of 2, and a pipeline length of 16. That's 30% wasted work, for the worst case scenario you'll find in practice.
The focus of that paper was in latency and performance impact, which is not exactly what I am focusing on.
As to the approximation you are using.
What is the mispredict rate you've chosen? Is it 10% chance of misprediction per individual branch, or the cumulative probability of a misprediction somewhere in a 64-instruction window with 6 branches in that range?
Successive branches lead to cumulative misprediction rates that can lead to 30% of ROB entries not being committed with more reasonable misprediction rates per branch instruction.
What do you mean by IPC in this case?
Barring an I-cache miss, a good 4-wide speculative processor is going to push close to 4 instructions through the front end of the pipeline every clock.
Referencing the stacked penalty model in that paper, the front-end penalty is going to be over 3 times higher than the 5-stage chosen in the model.
100% of all speculatively issued instructions will go through the mispredict pipeline up to the final point of execution. In terms of ROB instructions that do not commit, that's 30% of instructions going through hardware that as a percentage of the non-cache core area is close to 2/3 or more comprised of active logic.
This is a fixed power cost 100% of the time.
Some as yet undetermined percentage of the wasted instructions will go so far as to execute and have their results pending in the ROB when they are negated, depending on the situation. This is the decode+schedule power draw all speculated instructions draw, then a certain amount due to execution unit consumption, for which I have no figures but will vary depending on the operation type.
Loads and stores in i7 are very heavily speculated, more so than what is already the case in other OoO chips.
Heck, even a 30% wastage didn't sound all that terrible to me. How much does a GPU waste by having a minimum batch size of 32 or 64?
As GPUs probably try to do and Larrabee's VPU has been documented as doing, a known invalid lane is clock-gated.
That's different than running an instruction through the pipeline and not knowing it is invalid until the end.
Drawing a 10 pixel character in the distance takes as long as drawing a wall that covers the entire screen because you have 90% of the chip sitting idle.
While not desirable, silicon is relatively cheap in this situation.
If all the chip is doing is drawing a wall or a 10 pixel character, whatever time it takes will be sufficiently fast for the purposes of the GPU's target market.
The figures given by OTOY are quite shocking. Also, try running Crysis with SwiftShader at medium quality at 1024x768. It may only run at 3 FPS, but that's merely a factor 10 from being smoothly playable, on a processor that has no texture samplers, no special function units, no scatter/gather abilities, only 128-bit vectors, and damned speculation...
That doesn't sound particularly compelling, in the face of other engineered solutions that aren't searching for a problem to solve.
So clearly there is such a thing as keeping too much data on chip. It's wrong, and not preferable. You should only keep as much data around as is necessary for hiding actual latency. Today's GPU architectures are far better in this respect than NV40, but they're still not ideal. I agree with Voxilla that they need a cached stack, both for allowing register spilling without decimating performance, and to have unlimited programmability.
Right, CPUs don't keep much data on chip, they just send it off to the 8-12 MB of cache--oh I see what we have here.