Using and in-order-core to perform out-of-order code search.

wildcardx

Newcomer
Is it possible to use an in-order-core to perform the task of finding code that can be run (out of order) at the hardware level or even using it run software to search for code and feed an IOC?

Why or why not?
And if so what are the limitations for each method?
 
The time taken up by such an algorithm run in software would probably cancel out any performance advantage. A more reasonable alternative would be a good compiler that re-orders the code such that instruction latencies are not so poorly hidden. Most any modern compiler does this, though if we'd never switched to OOOE (P6), they might have gone farther on this matter. Also, reordered instructions alone will help hide small stalls, but not big ones. OOO cores don't take code into the issue window out-of-order, but instead continue to issue non-dependent instructions when stalls occur such that instructions *complete* out of order (something that software OOOE won't really be able to do), so reordering instructions to reduce consecutive data dependencies is still necessary.

Moreover, you're talking about something that happens at the machine code level. It'd be damn annoying to manually write something that has to decode and determine the dependencies between instructions, which in general is not so different from emulating the processor you're anyway running on.
 
Alpha_Spartan said:
With a good compiler, it's not necessary.
Compiler decisions are static, which just cannot be as good as doing it dynamically at runtime.
Consider data dependent branches. A branch hint (which is the best a compiler can do anyway) that works well for one data set might be counterproductive for another. Further, while there is some coherency to conditional branch outcome more often than not, the path you're going to take most of the time may change over the lifetime of the application.

E.g.
Code:
while (1)
{
   float delta_t=time_since_last_frame();
   if (bullet_time_key_down()) delta_t*=0.25f;
   world->animate(delta_t);
}
The outcome of this branch doesn't just switch erratically. You'll much more likely have dozens to hundreds of frames in a row where it's taken, and then another run of many frames where it's not taken.
A dynamic branch predictor as implemented in current x86 CPUs takes just a few mispredicts to completely turn around and predict this branch correctly again. A static compiler "best guess" can't adapt at all.

Or take even cache hit vs cache miss on a simple memory access. You cannot statically schedule such things in an optimal fashion.
Aplha_Spartan said:
Secondly, virtual machines have alot of overhead. That would do more harm than good.
Depends on what kind of virtual machine you're referring to.
E.g. x86 on x86-type VMs have zero overhead for executing garden variety code. User level instructions can just execute as is. There is significant overhead though if the hosted code hits something that needs to be emulated (privileged resources such as hardware registers, descriptors etc).
 
zeckensack said:
A branch hint (which is the best a compiler can do anyway) that works well for one data set might be counterproductive for another.
If a branch hint can be dynamic and if a compiler has enough instructions to schedule ahead to make the hint effective you have win-win situation, the hint is never counterproductive.
 
nAo said:
zeckensack said:
A branch hint (which is the best a compiler can do anyway) that works well for one data set might be counterproductive for another.
If a branch hint can be dynamic and if a compiler has enough instructions to schedule ahead to make the hint effective you have win-win situation, the hint is never counterproductive.
I'm not sure I understand what you're saying.
I only know of branch hints that are tied to a dynamic branch instruction. When the machine encounters the instruction, it continues issuing the instructions on the most likely branch, and (later) rolls back if that turned out wrong. I don't know any machines that a)leave any window for further instructions between a branch hint and the corresponding branch nor b)allow the hints themselves to change over time (without resorting to self-modifying code).

Of course you can combine compiler hints with dynamic hardware predictors, to the effect that the hint given by the compiler stacks with the predictor's current idea of the branch outcome. In that context I agree that there wouldn't be much of a misprediction problem. But then, that's not an entirely static model anymore. IMO the adaptive portion is much more important than the compiler hint. It doesn't hurt all that much to omit the hint if you have an adaptive hardware predictor.
 
zeckensack said:
nAo said:
zeckensack said:
A branch hint (which is the best a compiler can do anyway) that works well for one data set might be counterproductive for another.
If a branch hint can be dynamic and if a compiler has enough instructions to schedule ahead to make the hint effective you have win-win situation, the hint is never counterproductive.
I'm not sure I understand what you're saying.
I only know of branch hints that are tied to a dynamic branch instruction. When the machine encounters the instruction, it continues issuing the instructions on the most likely branch, and (later) rolls back if that turned out wrong. I don't know any machines that a)leave any window for further instructions between a branch hint and the corresponding branch nor b)allow the hints themselves to change over time (without resorting to self-modifying code).

Of course you can combine compiler hints with dynamic hardware predictors, to the effect that the hint given by the compiler stacks with the predictor's current idea of the branch outcome. In that context I agree that there wouldn't be much of a misprediction problem. But then, that's not an entirely static model anymore. IMO the adaptive portion is much more important than the compiler hint. It doesn't hurt all that much to omit the hint if you have an adaptive hardware predictor.


He's suggesting that the processor has a "hint" instruction that can be used before the branch to minimise the mispredict penalty.

I think branch prediction is actually the lesser of the evils for an in order core. Hiding cache and instruction latency is a much larger issue and it isn't necessarilly trivial. The mechanisms are understood, and I keep hearing "but a good compiler...." but I have yet to see a compiler do anything more than make a half assed attempt at it.

By that definition of "good compiler" all compilers suck.

Look at how long VCL takes to hide 4 cycle latency in trivial VU programs, and it still doesn't deal with the general case, put a branch inside a loop and it simply fails.
 
zeckensack said:
I'm not sure I understand what you're saying.
I only know of branch hints that are tied to a dynamic branch instruction.
I was talking about a hint instruction which takes as arguments dynamic values (registers, of course ;) ) , so it can know in advance if a branch would be taken or not.
CELL's SPE processors should employ a mechanism like that.
 
ERP said:
By that definition of "good compiler" all compilers suck.
I admit I have zero experience in writing compilers but it doesn't seem to me an insurmountable problem to take advance of SPE's hint instruction, am I asking tuo much? :)
Look at how long VCL takes to hide 4 cycle latency in trivial VU programs, and it still doesn't deal with the general case, put a branch inside a loop and it simply fails.
Eheh..it takes AGES..I know :(
Even if branchy loops sucky code it's not entirely a VCL fault..it can't be done much better than that once you have a loop filled with instructions from iteration n-1th,nth,n+1th an do so on..
I try to avoid braches in my loops designing all my code to be composed of many branchless passes.
 
It probably just does an exhaustive search.

They need to start with something like CoSy this time, not gcc.
 
nAo said:
MfA said:
It probably just does an exhaustive search.
It does
They need to start with something like CoSy this time, not gcc.
Ehhe..VCL is just an advanced preprocessor anyway :?

The only reason I brought up VCL is to demonstrate it's not a trivial problem despite first appearances. It has simple scheduling rules and short programs. Yet it can't solve the general case and it take forever to generate close to optimal solutions.
 
As long as the compiler can run in a distributed mode you can always just buy a cluster to speed up compilation :)
 
nAo said:
zeckensack said:
I'm not sure I understand what you're saying.
I only know of branch hints that are tied to a dynamic branch instruction.
I was talking about a hint instruction which takes as arguments dynamic values (registers, of course ;) ) , so it can know in advance if a branch would be taken or not.
CELL's SPE processors should employ a mechanism like that.
I still don't get it. To take the correct path, the branch predicate must be known. Likewise, to give the correct hint, the same (?) predicate must be known. I don't understand how that's beneficial. You're just introducing a second dependency on the branch condition. It must be computed anyway for the branch.

Can you whip up some (pseudo) code example to shed some light on the situation?
 
Nao seems to be mixing up split branches with branch hints, we dont know whether there will be dynamic branch hints AFAIK. Im not sure a dynamic branch hint would be usefull even.

What we do know is that it uses split branches (just like PowerPC). So as long as you have enough work to do after you know the predicate you can simply use an unconditional branch, no branch prediction necessary.

EDIT : Actually reading the ISSCC paper they say "it inserts branch hint instructions to identify branches and load the probable targets into the SMBTB". Hmmm that can mean a lot of things. On the face of it it sounds like it isnt really a split branch architecture as such, but instead has an instruction which both hints (probably statically) which way the next branch will go and what it's target is. Since the target is dynamic though, you would be able to use that type of instruction for a dynamic hint ... as the static hint use branch taken, and if during run time you think it wont be then store the address of the instruction after the branch in the BTB (assuming it's smart enough not to rewind even though the direction hint was wrong).

Anyway that is all irrelevant, the important thing is that if you have work which can be done in between the predicate being known and the branch then you will never waste time ... unlike with say the x86 (where you are at the mercy of the branch prediction).

Oh BTW, on the example ... never no way would the history for the branch still be available after rendering an entire frame :)
 
zeckensack said:
I still don't get it. To take the correct path, the branch predicate must be known. Likewise, to give the correct hint, the same (?) predicate must be known.
Right
I don't understand how that's beneficial. You're just introducing a second dependency on the branch condition. It must be computed anyway for the branch.
To tell the truth the 2 branch predicates are one, they're the same thing, of course.
It's beneficial to dynamically tell to the CPU which path will be taken if you can do it many cycles before that path will be taken and you have some useful work to schedule between the hint and its associated branch.
This way one can move the branch 'misprediction' penalty and thus it can be hidden if we have some useful work to do.
Can you whip up some (pseudo) code example to shed some light on the situation?
Sure:
//////////////////////////////////////////////////////////////
float fSide = Dot3(vLocalEye, vLocalNormal)

//Hint can be placed here..
//A branch will be taken in the future according fSide value

// all these matrix vector muls can be schedule after the hint and before the branch
vEye = mLocal2World*vLocalEye;
vNormal = mLocal2World*mLocalNormal;

// here we have a branch..
if (fSide >= 0.0f) vColor = ComputeFrontLighthing(vEye, vNormal, vLight)
else vColor = ComputeBackLighthing(vEye, vNormal, vLight)
//////////////////////////////////////////////////////////////

If vEye and vNormal computation is comprised of enough instructions to hide our processor pipeline length (+ hint instruction latency) we'd have no branch penalty at all, even if fSide is a completely random value.
I think no BPU can beat that :)
 
Back
Top