Branch Delay Slot: Cycles following a branch in which the branch outcome and branch target address are unknown. With out any additional form of scheduling other useful instructions these cycles need to be filled with nops.
To be accurate it's not the cycles but the instructions following a branch, which are executed regardless of whether the branch is taken or not. It is the compiler's task to fill these slots with instructions that can be executed before the branch is actually taken. Only when it can't fill all delay slots with useful intructions it has to use nops.
Compiler (Static) Scheduling: In this case the number of delay slots for the target architecture need to be known at compile time, and the sources for the instructions come from before the branch, the target of the branch, and the fall through from the branch. And if the delay slots are not filled, they need to be filled with nops.
The compiler always knows the architecture. If it's an architecture with delay slots it needs to know the number of delay slots.
Hardware (Dynamic) Scheduling: In this case I assume the compiler assumes no delay slots. In a single threaded processor it would have to schedule from the same sources as the static form. In a multithreaded processor other threads are added as a source for useful instructions.
When there are not delay slots there are several options:
- Stall the pipeline until the branch instruction has executed so we know what the next instructions are. This is done by sending 'hardware nops' into the pipeline, also called 'bubbles'.
- Predict the branch target and process the instructions speculatively. When the prediction was wrong, don't write the results of any of the speculative instructions.
- Switch-on-event multithreading: For any instruction with long latency (typically branches and memory reads), switch to another thread.
For all of these the compiler keeps the logical instruction order. I think we could categorize GPUs as switch-on-event. The threads are just shaders 'unrolled' for all the pixels/vertices in a batch.
So the dynamic form is likely troublesome and expensive for single threaded processors, and I assume the reason that extensive branch prediction is used instead. While on the flip side it would seem almost trivial to fill all the delay slots in multithreaded system with enough threads (ie. GPU).
The closest thing to your definition of hardware scheduling is Niagara, which switches threads on branches and memory reads. Hyper-Threading also shows some resemblance, although it actually schedules instructions from both threads simultaneously (also speculative instructions). A GPU however has no delay slots in the sense that it executes the instructions right below the branch instruction before the branch has actually been taken.
I thought GPUs only predicated when the branches in a batch diverged?
Predication in the typical sense means that the code consists of the instructions from both branches, but only the correct result is selected. I could call that software predication.
But there's also a form of hardware predication, indeed when branches in a batch diverge. In this case the GPU has to execute the instructions from both branches separately and select the correct result for each pixel/vertex. While software predication always executes every path, hardware predication only executes the actual paths taken by a batch. But while software predication hardly requires any extra hardware, hardware predication requires keeping track of the branches that have been taken, managing all the extra variables per branch, fetching the right instructions, and combining the results. All implicitely instead of explicitely.