As I noted above, IBM's PowerPC 970 fetches 8 instructions per cycle from the L1 cache into an instruction queue, from which the instructions are pulled for decoding at a rate of 8 per cycle.
[...]
Almost all the PowerPC ISA instructions, with a few exceptions, translate into exactly one IOP. Of the instructions that translate into more than one IOP, IBM distinguishes two types:
* A cracked instruction is an instruction that splits into exactly 2 IOPs.
* A millicoded instruction is an instruction that splits into more than 2 IOPs.
This difference in the way instructions are classified is not arbitrary. Rather, it ties into a very important design decision that the Power4's designers made regarding how the chip tracks instructions at various stages of execution. Before I explain this decision and the impact that it has on the 970, I should first recap how instructions normally move from a processor's front end to its execution core.
[...]
If you take a look at the middle of the large PPC 970 diagram from the beginning of the article, then you'll notice that right below the "decode, cracking, and group formation" phase I've placed a group of five boxes. These five boxes represent what IBM calls a "group", and each "group" consists of five IOPs arranged in program order according to certain rules and restrictions. It is these organized and packaged groups of five IOPs, and not single IOPs in isolation, that the 970 dispatches in-order to the six issue queues in its execution core. Once the IOPs in a group reach their proper issue queues, they can then be issued out of order to the execution units at a rate of 8 IOPs/cycle for all the queues combined. Before they reach the completion stage, however, they need to be placed back into their group so that an entire group of 5 IOPs can be completed each cycle.
I probably shouldn't go any further in discussing how these groups work without first explaining the reason for their existence. By assembling IOPs together into specially ordered groups of five for dispatch and completion, the 970 can track these groups, and not individual IOPs, through the various stages of execution. So instead of tracking 100 individual IOPs in-flight as they work their way through the 100 or so execution slots available in the execution core, the 970 need track only 20 groups. IOP grouping, then, significantly reduces the overhead associated with tracking and reordering the huge volume of instructions that can fit into the 970's "deep and wide" design.
The price the 970 pays for this reduced overhead is a loss of execution efficiency brought on by the reduced granularity of control that comes from being able to dispatch, schedule, issue, and complete instructions on an individual basis. Let me explain.
When the 970's front end assembles an IOP group there are certain rules it must follow. The first rule is that the group's five slots must be populated with IOPs in program order, starting with the oldest IOP in slot 0 and moving up to newest IOP in slot 4. Another rule is that all branch instructions must go in slot 4, and slot 4 is reserved for branch instructions only. This means that if the front end can't find a branch instruction to put in slot 4, then it can issue one less instruction that cycle. Similarly, there are some situations in which the front end must insert noops into the group's slots in order to force a branch instruction into slot 4. "Noop" (pronounced "no op") is short for "no operation"--it's a kind of non-instruction instruction that means "do nothing". In other words, the front end must sometimes insert empty execution slots, or pipeline bubbles, into the instruction stream in order to make the groups comply with the rules.
The above rules aren't the only ones that must be adhered to when building groups. Another rule dictates that instructions destined for the conditional register unit (CRU) can go only in slots 0 and 1. And then there are the rules dealing with cracked and millicoded instructions. From IBM's Power4 whitepaper:
Cracked instructions flow into groups as any other instructions with one restriction. Both IOPs must be in the same group. If both IOPs cannot fit into the current group, the group is terminated and a new group is initiated. The instruction following the cracked instruction may be in the same group as the cracked instruction, assuming there is room in the group. Millicoded instructions always start a new group. The instruction following the millicoded instruction also initiates a new group.
And that's not all! A group has to have the following resources available before it can even dispatch to the core. If just one of following resources is too tied up to accommodate the group or any of its instructions, then the entire group has to wait until that resource is freed up before it can dispatch.
* Group Completion Table entry: The GCT is the 970's equivalent of a reorder buffer. The GCT has 20 entries for keeping track of 20 active groups as the groups' constituent instructions make their way through the ~100 execution slots available in the execution core's pipelines. Regardless of how few instructions are actually in the execution core at a given moment, if those instructions are grouped so that all 20 GCT entries happen to be full then no new groups can be dispatched.
* Issue Queue slot: If there aren't enough slots available in the appropriate issue queues to accommodate all of a group's instructions, then the group must wait to dispatch. (In a moment I'll elaborate on what I mean by "appropriate issue queues".)
* Rename Registers: There need to be enough register rename resources available so that any instruction which requires register renaming can issue when it's dispatched to its issue queue.
Again, when it comes to the above restrictions, one bad instruction can spoil the whole bunch.
Because of its use of groups the 970's dispatch bandwidth is sensitive to a whole complex host of factors, not the least of which is a sort of "internal fragmentation" of the group completion table that could potentially arise and needlessly choke dispatch bandwidth if too many of the groups in the GCT are partially or mostly empty.
[...]
You can't really get a full picture of what the 970 offers until you examine its execution core and issue queues. The 970 offers twelve (depending on how you count them) execution units for doing the actual grunt work of executing instructions, and though twelve is a relatively large number, a simple enumeration of execution resources doesn't tell you nearly as much as an examination of how those resources are organized.
[...]