I thought I'd respond here instead of the 3D Hardware forum, since it's so obviously off topic, but a good interesting discussion nonetheless.
I snipped alot, since it was already a long post. The points I haven't commented on (snipped) I basically agreed upon (which were most).
My original intention was to argue that code size does not matter in a high performance CPU, which I completely failed (forgot) to do. In problem domains where code size matters there are better uarchs than x86.
As you stated yourself Dave, some of the things you give up when moving to a more condensed instruction format are reduced number of registers (reduced number of bits to address registers) and usually a more restrictive 2-adress format instead of a 3- or 4-address format. This also means that you'll end up using more instructions on shuffling/spilling data. So while your total codesize goes down, the actual number of instructions goes up.
One of the reasons why the text segment (ie. the program code) of programs is traditionally larger on RISCs has as much to do with the larger number of register allowing a larger degree of loop unrolling; - in this case the bloated code size results in a net gain in performance.
Just for reference: The 21264b was a "dumb" shrink (Only the transistors were made with 0.18um, metal layers were still 0.25um or something, not sure here). The 21264C was re-laid out with 0.18um design rules. It's true that's its on a Cu SOI process, but SOI does not reduce die size.
Are you thinking of John Mashey's usenet posting? (more Mashey)
Let us recap the features of EPIC:
1.) Many registers organized as a rotating stack
2a.) Explicit parallism (fixed scheduling)
2b.) Instruction bundles (non-power of 2 instruction size)
3.) Full predication (ie. conditional execution)
4.) Memory speculation
Pros and cons of each:
ad 1:
PROS: Loads of register is good for codes which has many active variables, mostly big vector codes. Rotating register-stack (together with specieal setup instructions) can collapse unrolled loops into a single instance, obviously good for locality/cache usage.
CONS: Context switching is expensive, saving and loading 2x128 128bit registers (2KB).
ad 2:
PROS: 2a: scheduling determined by template bits (ie. no logic wasted on scheduling logic), only true if number of execution units match the worst-case bundle instruction distribution (as they do in Itanium 2, but not in Merced). 2b: Non-power of 2 instruction (41 bits) gives more room for adressing operands while not wasting any instruction bits (64-41 bits).
CONS: 2b: In case of a future OOO or SMT implementation, dynamic scheduling is going to be needed anyway-> scheduling template bits are made redundant.
ad 3:
PROS: Reduces the amounts of branches in IF-THEN-ELSE constructs (and similar). Facilitates eager execution (executing both sides of a branch)
CONS: Not really any.
General usefulness: Not that big, for short blocks in IF-THEN-ELSE constructs, a simple CMOVE (conditional move) offers similar capabilites in 90% of all cases (although with worse latency). For bigger blocks, eager execution quickly loses out to a good branch predictor (well, any predictor doing better than 50% wins when block size -> inf.)
ad 4:
PROS: reduces latency by initiating loads before data is needed.
CONS: Not really any
General usefulness: _very_ limited, can be used to initiate loads prior to use. However for predictable access patterns, this is no better than prefetching. For unpredictable ones the state-space can grow very quickly (ie. speculating loads within a binary decision tree will cause the processor to have 2^n outstanding speculated loads, where n is your progress).
General comment:
EPIC was conceived in the late 80s and early 90s when OOO schedulers looked to have n^2 complexity as a function of in-flight-instructions. The instruction format can be called compressed VLIW (ie. the instruction bundles doesn't have instructions for all execution units in a instruction execution cluster). however it isn't VLIW because you can't do a=b, b=a kind of operations to do register swaps (ie. ops interlock, regardless of being declared explicitly parallel).
However, time has shown that OOO scheduling is not nearly the hurdle it was perceived to be earlier. If it was, one would think that the instruction decoder in the P4 would put scheduling information into the trace cache together with the instruction, -but it doesn't, and yet manages to have 120 instructions in flight at any one time.
The P4 is in many ways a sign of things to come, also for EPIC, IMHO.
It exploits instruction level parallism, ILP, aswell as thread level parallism, TLP. There is a gray area in between you can call micro-thread level parallism that isn't covered today (ie. two different loop iterations of a *big* loop, or two different, independent, function calls) . But as the amount of in-flight-instructions increase and the cost of thread instantiation/switching decreases this will be covered aswell.
*gubbi puts on Dr. Who scarf
This means that all high performance (and high throughput) processors eventually will be OOO, this includes EPIC. In an OOO EPIC implementation the scheduling template bits are redundant. The many registers will make context (thread) switching more expensive, much like register windows in SPARC today, to the point where the compilers will generate code that only use a 32 fp+32int register subset in throughput sensitive applications (but may still use rotating register stack for high performance single thread situations). Speculative loads will hardly ever be used because the load/store resources will be better used in another context (to do real work). All IMO (grain of salt... etc.).
I'm fairly certain that IBM chose to bundle instructions in order to increase the amount of instructions they can have in flight given a fixed number of items their OOO engine can track. Conversely you can think of it as reducing complexity of the OOO engine for any given number of in-flight-instructions.
AMD does something similar in the Athlon: Instruction that uses a memory operand are sent down the pipes as a macro-op, - no point decoding it into micro-ops (like the P3) to have the OOO scheduler try to extract parallism out of two instructions that are inherently sequential in nature.
Cheers
Gubbi
I snipped alot, since it was already a long post. The points I haven't commented on (snipped) I basically agreed upon (which were most).
DaveH said:Gubbi wrote:
CISC has *zero* code size advantage over RISC. Compare ARM thumb or MIPS16 to x86 and you'll find the latter losing
Well sure, but condensed ISAs are hardly what one means when one says RISC. Of course Thumb and MIPS16 deserve the term "RISC" ISAs, because they are variations on "classic RISC" ISAs (and SuperH because it is so similar to Thumb and MIPS16), and incorporate many of the design insights of the RISC revolution.
<snipped more good points on code size of various uarchs and their implications>
My original intention was to argue that code size does not matter in a high performance CPU, which I completely failed (forgot) to do. In problem domains where code size matters there are better uarchs than x86.
As you stated yourself Dave, some of the things you give up when moving to a more condensed instruction format are reduced number of registers (reduced number of bits to address registers) and usually a more restrictive 2-adress format instead of a 3- or 4-address format. This also means that you'll end up using more instructions on shuffling/spilling data. So while your total codesize goes down, the actual number of instructions goes up.
One of the reasons why the text segment (ie. the program code) of programs is traditionally larger on RISCs has as much to do with the larger number of register allowing a larger degree of loop unrolling; - in this case the bloated code size results in a net gain in performance.
DaveH said:<snipped die sizes for P4 Will., 21264B and 21264C>
Just for reference: The 21264b was a "dumb" shrink (Only the transistors were made with 0.18um, metal layers were still 0.25um or something, not sure here). The 21264C was re-laid out with 0.18um design rules. It's true that's its on a Cu SOI process, but SOI does not reduce die size.
DaveH said:There was a famous study carried out at DEC where they pitted their own VAX 8700 against the MIPS M2000; the chips were chosen because they were built on an extremely similar process.
Are you thinking of John Mashey's usenet posting? (more Mashey)
DaveH said:Quote:
Finally: The compiler advancements that will benefit EPIC (VLIW) will also benefit every single other architecture out there.
EPIC is infinitely more dependent on good compilers for high performance than CISC or RISC, and particularly out-of-order implementations of CISC or RISC. Moreover, the other general-purpose architectures don't have features like full predication, branch hints (with poison bits to preserve correctness), or memory reference speculation. Plus their smaller visible register set limits how aggressive the compiler can be in terms of software pipelining or trace scheduling.
Quote:
The only thing EPIC has going for it is the large register file
Totally wrong. For one thing, simply giving a classic RISC 128 GPRs without significantly changing the rest of the ISA would barely improve performance at all. (After all, OoO RISCs get most of the benefit of a large visible register set by having a similarly large renaming register set.) For another, among all the other bits I mentioned above, you're somehow forgetting the little bit about the explicit parallelism...
Let us recap the features of EPIC:
1.) Many registers organized as a rotating stack
2a.) Explicit parallism (fixed scheduling)
2b.) Instruction bundles (non-power of 2 instruction size)
3.) Full predication (ie. conditional execution)
4.) Memory speculation
Pros and cons of each:
ad 1:
PROS: Loads of register is good for codes which has many active variables, mostly big vector codes. Rotating register-stack (together with specieal setup instructions) can collapse unrolled loops into a single instance, obviously good for locality/cache usage.
CONS: Context switching is expensive, saving and loading 2x128 128bit registers (2KB).
ad 2:
PROS: 2a: scheduling determined by template bits (ie. no logic wasted on scheduling logic), only true if number of execution units match the worst-case bundle instruction distribution (as they do in Itanium 2, but not in Merced). 2b: Non-power of 2 instruction (41 bits) gives more room for adressing operands while not wasting any instruction bits (64-41 bits).
CONS: 2b: In case of a future OOO or SMT implementation, dynamic scheduling is going to be needed anyway-> scheduling template bits are made redundant.
ad 3:
PROS: Reduces the amounts of branches in IF-THEN-ELSE constructs (and similar). Facilitates eager execution (executing both sides of a branch)
CONS: Not really any.
General usefulness: Not that big, for short blocks in IF-THEN-ELSE constructs, a simple CMOVE (conditional move) offers similar capabilites in 90% of all cases (although with worse latency). For bigger blocks, eager execution quickly loses out to a good branch predictor (well, any predictor doing better than 50% wins when block size -> inf.)
ad 4:
PROS: reduces latency by initiating loads before data is needed.
CONS: Not really any
General usefulness: _very_ limited, can be used to initiate loads prior to use. However for predictable access patterns, this is no better than prefetching. For unpredictable ones the state-space can grow very quickly (ie. speculating loads within a binary decision tree will cause the processor to have 2^n outstanding speculated loads, where n is your progress).
General comment:
EPIC was conceived in the late 80s and early 90s when OOO schedulers looked to have n^2 complexity as a function of in-flight-instructions. The instruction format can be called compressed VLIW (ie. the instruction bundles doesn't have instructions for all execution units in a instruction execution cluster). however it isn't VLIW because you can't do a=b, b=a kind of operations to do register swaps (ie. ops interlock, regardless of being declared explicitly parallel).
However, time has shown that OOO scheduling is not nearly the hurdle it was perceived to be earlier. If it was, one would think that the instruction decoder in the P4 would put scheduling information into the trace cache together with the instruction, -but it doesn't, and yet manages to have 120 instructions in flight at any one time.
The P4 is in many ways a sign of things to come, also for EPIC, IMHO.
It exploits instruction level parallism, ILP, aswell as thread level parallism, TLP. There is a gray area in between you can call micro-thread level parallism that isn't covered today (ie. two different loop iterations of a *big* loop, or two different, independent, function calls) . But as the amount of in-flight-instructions increase and the cost of thread instantiation/switching decreases this will be covered aswell.
*gubbi puts on Dr. Who scarf
This means that all high performance (and high throughput) processors eventually will be OOO, this includes EPIC. In an OOO EPIC implementation the scheduling template bits are redundant. The many registers will make context (thread) switching more expensive, much like register windows in SPARC today, to the point where the compilers will generate code that only use a 32 fp+32int register subset in throughput sensitive applications (but may still use rotating register stack for high performance single thread situations). Speculative loads will hardly ever be used because the load/store resources will be better used in another context (to do real work). All IMO (grain of salt... etc.).
DaveH said:(Although ironically, Power4 does a crude form of on-chip RISC->"semi-VLIW" encoding, so that it can reap some of the control benefits of a bundled instruction ISA. Remind you of anything?)
I'm fairly certain that IBM chose to bundle instructions in order to increase the amount of instructions they can have in flight given a fixed number of items their OOO engine can track. Conversely you can think of it as reducing complexity of the OOO engine for any given number of in-flight-instructions.
AMD does something similar in the Athlon: Instruction that uses a memory operand are sent down the pipes as a macro-op, - no point decoding it into micro-ops (like the P3) to have the OOO scheduler try to extract parallism out of two instructions that are inherently sequential in nature.
Cheers
Gubbi