GPUs and Branch Delay Slot?

I've seen discussion here with regard to how GPU branching works, and is it accurate to say that it's a form of the branch delay slot technique? Except the major difference is that instead of being done at compile time, the delay slots are filled with instructions from a different batch (ie. executing a second batch until the first batch's branches have exited the pipeline)?

Now I also know that the GPU manufacturers don't publish much documentation like this, but does anyone know of any papers describing something similar to this?
 
I've seen discussion here with regard to how GPU branching works, and is it accurate to say that it's a form of the branch delay slot technique? Except the major difference is that instead of being done at compile time, the delay slots are filled with instructions from a different batch (ie. executing a second batch until the first batch's branches have exited the pipeline)?
You kindof answered your own question. ;) It's not really accurate to say it's a form of branch delay slots. That technique executes instructions from the same thread to hide the latency of the branch instruction. So indeed there's an important difference that this has to be taken into account at compile time. A GPU doesn't execute instructions from the same thread/batch before taking the branch. Branch delay slots frequently require several NOP's, while GPU's keep the execution units maximum utilized.
Now I also know that the GPU manufacturers don't publish much documentation like this, but does anyone know of any papers describing something similar to this?
You could try looking for patents, although that requires decrypting skills. However the Sun Niagara CPU executes instructions from another thread to hide memory and branching latency, and might be better documented.
 
It's not really accurate to say it's a form of branch delay slots. That technique executes instructions from the same thread to hide the latency of the branch instruction.

Does the source of the instruction really matter in whether or not it is a branch delay slot technique? At least to me the intent is to merely occupy the slots with useful instructions. The way I've seen this taught is statically using the compiler, but is it still not branch delay slots if done dynamically either through instruction re-ordering, or more simply, instructions from another thread?

Which I guess once again answers one of my initial questions about whether or not it is a form of branch delay slots. And so is this effectively the method used in GPUs?

You could try looking for patents, although that requires decrypting skills. However the Sun Niagara CPU executes instructions from another thread to hide memory and branching latency, and might be better documented.

I've only started looking into it, but they don't seem to use the terminology "branch delay slot", which if there is another name for the dynamic version that I'm not aware of I certainly wouldn't mind hearing it :).
 
Does the source of the instruction really matter in whether or not it is a branch delay slot technique? At least to me the intent is to merely occupy the slots with useful instructions. The way I've seen this taught is statically using the compiler, but is it still not branch delay slots if done dynamically either through instruction re-ordering, or more simply, instructions from another thread?
It matters a lot what instruction is executed during the branch delay. For an out-of-order CPU architecture a significant amount of extra logic is required. Not just for branch prediction but also to be able to invalidate results from mispredicted branches and properly handle exceptions and interrupts. Delay slots were born out of the observation that the pipeline doesn't have to be stalled while executing a branch instruction, if the compiler can schedule some extra instructions. But it's very architecture dependent. You can't execute code compiled for three delay slots on a CPU with two delay slots, and it doesn't scale well to superpipelined architectures.

GPU's don't rely on the compiler to hide branching latency. Neither do they have much trouble with somewhat longer pipelines. And the architecture is totally different. Delay slots cost practically no extra logic (in fact it could be less because the pipeline doesn't have to be stalled). Out-of-order architectures put a lot of effort in finding more instructions to execute. And GPU's require extra logic to manage the threads and their data. So although the intent is the same the implementation is totally different.
Which I guess once again answers one of my initial questions about whether or not it is a form of branch delay slots. And so is this effectively the method used in GPUs?
Just to be explicit: No. Delay slots are for free, while it takes considerable effort to hide branching (and other) delays in a GPU.
I've only started looking into it, but they don't seem to use the terminology "branch delay slot", which if there is another name for the dynamic version that I'm not aware of I certainly wouldn't mind hearing it :).
They don't use that term because that's not the technique they use. But I don't know if the technique(s) they do use has any particular name. The terminology of CPU technology doesn't really apply. GPU's only started to support branching a couple years ago, so the detailed information is really still in the patents or just secret. There are even significant differences between generations like G70 and G80...

So, what kind of information are you really looking for? Any specific reason or just curiosity? ;)
 
I suggest you look at real SMT threading (Alpha 26464, Sun Niagara, and some academic processors) and the way the Tera machine was designed (and even back to the HEP but that's really old school).

Basically, if you assume a threading model, a branch is just a type of stall, much like memory read/write or other dependecy stalls. If you have a scoreboard, you can use it to match available resources to runnable threads and find something else to do until the stall can be resolved. (Yes this is a massive simplification...) Modern GPUs run LOTS of threads per processor, effectively overcommiting the processors so they almost always have something else they can do on a stall. Where it gets complicated is that threads are issued in batches, so it's really managing the resources and executing of a block of threads and switching to other blocks of threads as needed. The other main complication is that you have to be able to store the context (register file and pending read/writes) for all threads in flight on the machine.
 
Branch delay slots are an instruction-level decision with hardware with highly exposed execution semantics.

In the case of code that's run through a driver's compiler there is no way to make sure an instruction stays in the branch delay slot from the POV of the hardware. Individual instructions have such long latencies and the memory behavior so unpredictable that a few cycles for branch target calculations are lost in the noise.

Considering that GPUs either stall a thread or predicate and run both sides when encountering branches, the branch slot doesn't gain anything.

The CTM spec implies behavior where both sides of a branch are run through by the array processors. A delay slot in that case would buy little.
 
It seems like I need to clarify what exactly I'm meaning and the way I understand things...

Branch Delay Slot: Cycles following a branch in which the branch outcome and branch target address are unknown. With out any additional form of scheduling other useful instructions these cycles need to be filled with nops.

Compiler (Static) Scheduling: In this case the number of delay slots for the target architecture need to be known at compile time, and the sources for the instructions come from before the branch, the target of the branch, and the fall through from the branch. And if the delay slots are not filled, they need to be filled with nops.

Hardware (Dynamic) Scheduling: In this case I assume the compiler assumes no delay slots. In a single threaded processor it would have to schedule from the same sources as the static form. In a multithreaded processor other threads are added as a source for useful instructions.


So the dynamic form is likely troublesome and expensive for single threaded processors, and I assume the reason that extensive branch prediction is used instead. While on the flip side it would seem almost trivial to fill all the delay slots in multithreaded system with enough threads (ie. GPU).



So, what kind of information are you really looking for? Any specific reason or just curiosity?

Actually I'm at the point in my education where there really is no difference, grad school sure is fun. With that said though, I have a research project in a CPU architecture class in which I wish to bridge the gap in understanding of GPU/CPU architectures.


Considering that GPUs either stall a thread


It's this stall and switch behavior I'm attempting to describe. How else might one go about describing this behavior?


or predicate and run both sides when encountering branches, the branch slot doesn't gain anything.

I thought GPUs only predicated when the branches in a batch diverged?
 
It's this stall and switch behavior I'm attempting to describe. How else might one go about describing this behavior?
That's what I meant to say, that as in the case of G80, a branch prompts a stall of the current thread and switch to another.

A branch delay slot is the explicit revealing to software of the latency period after a branch where the target calculation is not complete.

Considering branch prediction (or another thread or predication for a GPU) in modern processors will fill the delay period with something, there's less incentive to reveal an implementation-specific number of delay slots, especially as pipelines are so much longer now than when the concept was first introduced.
 
That's what I meant to say, that as in the case of G80, a branch prompts a stall of the current thread and switch to another.

That's what I assumed.

A branch delay slot is the explicit revealing to software of the latency period after a branch where the target calculation is not complete.

So take for example my description of a dynamic rescheduler in a single threaded processor, the only difference between that and the compiler performing the scheduling is the explicit revealing of the latency. The stall length is the same, the source of the instructions is the same, the intent is the same. So why is the explicit and public revealing of the stall length necessary in the dynamic scheduler to be considered the same technique as the static one?

Then with that in mind, adding alternate threads of execution as a source for useful instructions does not seem like a radical departure from the single threaded model.

The differences just seem so trivial, even though I'm aware the implementations are all worlds apart. And I will accept, that's just the definition of it, if that's what it is.
 
So take for example my description of a dynamic rescheduler in a single threaded processor, the only difference between that and the compiler performing the scheduling is the explicit revealing of the latency. The stall length is the same, the source of the instructions is the same, the intent is the same. So why is the explicit and public revealing of the stall length necessary in the dynamic scheduler to be considered the same technique as the static one?

With basic speculation and branch prediction, the dynamic processor does not necessarily stall. It might mispredict, and it is possible that a branch will inject a pipeline bubble, but this is dependent on the implementation.
Those filled in cycles are not known as a delay slots.

A branch delay slot in an ISA is an explicit element of the architecture. That means that even with branch prediction and speculation, the chip has to go through the extra work of evaluating or at least pretending to evaluate delay slots appropriately. For a long pipeline design, this might be tens of slots long.
If a future implementation doesn't need any delay slots, or it needs more, it's considered tough luck if the ISA isn't changed.
 
With basic speculation and branch prediction, the dynamic processor does not necessarily stall. It might mispredict, and it is possible that a branch will inject a pipeline bubble, but this is dependent on the implementation.

I know that's how most all CPUs actually do it, I'm still interested in if my very much hypothetical processor where you tried to fill those cycles with instructions scheduled in the same fashion as the compiler would for delay slots. At which point there is no need to make the delay slot number part of the ISA.

Then like I said, if you extend this to be a multithreaded system it only makes sense to allow the use of other thread's instructions.

Those filled in cycles are not known as a delay slots.

And in my case above, what would that be known as? Because it is using neither prediction nor speculation to fill the stages, so there is no concern about backing up and starting over on the correct path of execution.


A branch delay slot in an ISA is an explicit element of the architecture.

Ok, so I'll give up and accept that being part of the ISA definition is a requirement for branch delay slots. Now is there is a name for the technique I described, "Stall, switch, and wait"? ;)
 
Branch Delay Slot: Cycles following a branch in which the branch outcome and branch target address are unknown. With out any additional form of scheduling other useful instructions these cycles need to be filled with nops.
To be accurate it's not the cycles but the instructions following a branch, which are executed regardless of whether the branch is taken or not. It is the compiler's task to fill these slots with instructions that can be executed before the branch is actually taken. Only when it can't fill all delay slots with useful intructions it has to use nops.
Compiler (Static) Scheduling: In this case the number of delay slots for the target architecture need to be known at compile time, and the sources for the instructions come from before the branch, the target of the branch, and the fall through from the branch. And if the delay slots are not filled, they need to be filled with nops.
The compiler always knows the architecture. If it's an architecture with delay slots it needs to know the number of delay slots.
Hardware (Dynamic) Scheduling: In this case I assume the compiler assumes no delay slots. In a single threaded processor it would have to schedule from the same sources as the static form. In a multithreaded processor other threads are added as a source for useful instructions.
When there are not delay slots there are several options:
- Stall the pipeline until the branch instruction has executed so we know what the next instructions are. This is done by sending 'hardware nops' into the pipeline, also called 'bubbles'.
- Predict the branch target and process the instructions speculatively. When the prediction was wrong, don't write the results of any of the speculative instructions.
- Switch-on-event multithreading: For any instruction with long latency (typically branches and memory reads), switch to another thread.

For all of these the compiler keeps the logical instruction order. I think we could categorize GPUs as switch-on-event. The threads are just shaders 'unrolled' for all the pixels/vertices in a batch.
So the dynamic form is likely troublesome and expensive for single threaded processors, and I assume the reason that extensive branch prediction is used instead. While on the flip side it would seem almost trivial to fill all the delay slots in multithreaded system with enough threads (ie. GPU).
The closest thing to your definition of hardware scheduling is Niagara, which switches threads on branches and memory reads. Hyper-Threading also shows some resemblance, although it actually schedules instructions from both threads simultaneously (also speculative instructions). A GPU however has no delay slots in the sense that it executes the instructions right below the branch instruction before the branch has actually been taken.
I thought GPUs only predicated when the branches in a batch diverged?
Predication in the typical sense means that the code consists of the instructions from both branches, but only the correct result is selected. I could call that software predication.

But there's also a form of hardware predication, indeed when branches in a batch diverge. In this case the GPU has to execute the instructions from both branches separately and select the correct result for each pixel/vertex. While software predication always executes every path, hardware predication only executes the actual paths taken by a batch. But while software predication hardly requires any extra hardware, hardware predication requires keeping track of the branches that have been taken, managing all the extra variables per branch, fetching the right instructions, and combining the results. All implicitely instead of explicitely.
 
...I'm still interested in if my very much hypothetical processor where you tried to fill those cycles with instructions scheduled in the same fashion as the compiler would for delay slots. At which point there is no need to make the delay slot number part of the ISA.
There are typically very little branch delay slots. It works well for architectures with very short pipelines (fetch, decode, execute, write) but not for superpipelined architectures. On a GHz processors winning back one clock cycle from a 12 cycle branch delay is not worth it. It's better to just try to predict the branch.

Modern processors are not only superpipelined, but also superscalar. So with a critical pipeline length of 12 stages and the ability to execute four instructions in parallel you'd have to find 48 to put into the branch delay slot (explicit or not), to hide the branch latency. That's worst case or best case depending on your point of view, but in any case branch prediction is just far more effective. At least for single-threaded execution...

The other effective approach is to execute instructions from independent threads, or instructions within the same thread for independent data. On a CPU it's not trivial to have 12 independent threads, let alone write software for it (although in my opinion it's inevitable). For a GPU it's feasible because there's so much 'riduculously parallel' work to do. It executes nearly independent shaders on independent pixels/vertices. In this sense I think I understand why you refer to delay slots for GPUs. The pipelines can remain full by executing more instructions from the same shader thread. If we look at a single pixel/vertex however, no new instructions is executed before the branch has been evaluated. The significant difference is that the software doesn't schedule any instructions in any delay slot, and the hardware requires extra logic to manage the whole thing (the shader 'unrolling' happens in hardware).

Am I still making sense? ;)
 
Last edited by a moderator:
Thank you everyone for the input. I (think I) understand how a normal superscalar processor deals with branches, and my misuse of the term delay slot has lead to some confusion. I was merely bouncing an idea off of everyone, and you guys seem to have straightened me out. The correct term I was searching for was switch on event multithreading, where as in a GPU the events that cause switches are pretty much any instruction with "long" latencies, such as branches.
 
Delay slots are just one form of latency hiding. Latency arises in many different situations, and for each situation there are many ways of hiding it. The technique known as "delay slots" is one particular way of dealing with the latency between evaluating a condition and fetching/decoding the next instruction based on the outcome. Delay slots hide this latency by exposing it in the ISA and giving the compiler the opportunity to schedule useful work during the latency period.

So delay slots are one solution to the branch latency problem. There are other solutions -- switch-on-branch, speculative execution, etc.. Understanding delay slots can be instructive when trying to understand the other solutions, but that doesn't mean the others are forms of delay slots, instead they're all different solutions to the same problem.

GPUs are interesting in that they hide virtually all forms of latency using a single basic mechanism: switching to another thread. They use this to hide instruction latency (GPUs typically have no forwarding networks), branch latency (no delay slots, branch prediction, speculative execution, etc.), memory latency, and probably others. This only works because they can rely on there always being lots of independent threads available.
 
GPUs typically have no forwarding networks

This brings me to another thing I've been wondering, is it likely for GPUs going to clock higher than CPUs in the future due to fewer interconnect delays and longer pipelines?

Like I said I'm taking an architecture class at the moment, and something that has jumped out at me is just how many data paths there are that run back and forth between the stages, primarily to do the forwarding. Now in the 5-stage MIPS pipeline this isn't a problem, but I could certainly see this being in issue as the pipeline gets longer.
 
This brings me to another thing I've been wondering, is it likely for GPUs going to clock higher than CPUs in the future due to fewer interconnect delays and longer pipelines?
No. For CPU's it's crucial to have very high single-threaded performance, and high clocks are a primary factor for this. For a GPU it's more efficient to keep the clock at moderate levels and just have lots of smaller execution units working in parallel. They could reach or exceed CPU frequencies, but then the heat production would be unmanageable. Increasing clock frequencies also costs die space because fast components are typically larger. Apparently NVIDIA found a sweet spot for G80 by doubling the shader unit frequency and focussing optimization efforts there. But that frequency is still significantly lower than CPU's to keep heat levels manageable. So it's a a delicate balance and clock frequencies are important design decisions.
Like I said I'm taking an architecture class at the moment, and something that has jumped out at me is just how many data paths there are that run back and forth between the stages, primarily to do the forwarding. Now in the 5-stage MIPS pipeline this isn't a problem, but I could certainly see this being in issue as the pipeline gets longer.
Eight layers of metal allows a lot of data forwarding. In a CPU you want to get the data back to the top of the pipeline as soon as possible so you can execute the next dependent instruction with minimal latency. But yes, interconnects are a big issue. One solution is to execute instructions from different threads in round-robin fashion, much like a GPU works on the elements within a batch. But it's going to take almost a decade before all applications are adequately threaded for such CPU's. Anyway, Intel is planning to re-introduce Hyper-Threading, and AMD plans to share execution units between cores. So programmers better learn to write threaded code right now. ;)
 
This brings me to another thing I've been wondering, is it likely for GPUs going to clock higher than CPUs in the future due to fewer interconnect delays and longer pipelines?

Like I said I'm taking an architecture class at the moment, and something that has jumped out at me is just how many data paths there are that run back and forth between the stages, primarily to do the forwarding. Now in the 5-stage MIPS pipeline this isn't a problem, but I could certainly see this being in issue as the pipeline gets longer.

Forwarding isn't necessarily as painful on longer pipelines just because they are longer. The number of forwarding datapaths is dependent on the number of stages where operands are accessed.
The number of stages that actually do that is limited and depends on whether there are instructions with more complex register access.

What's usually more painful for clock speeds is going superscalar, as that requires a number of connections that introduces a delay that scales quadratically with issue width.

GPUs don't forward, but they do have to worry about the penalties of being so wide. The individual shader processors are not very wide, but there are a lot of units that are wired together by schedulers and read/write hardware.

That is a lot of wires to drive at a high clock. The G80 is an example where the shader units themselves can be clocked higher, but the more complex hardware that is directly affected by the width of the arrays is clocked lower.
 
Back
Top