Discussion of general purpose processor architecture cont.

Gubbi · Jun 17, 2003

I thought I'd respond here instead of the 3D Hardware forum, since it's so obviously off topic, but a good interesting discussion nonetheless.

I snipped alot, since it was already a long post. The points I haven't commented on (snipped) I basically agreed upon (which were most).

DaveH said:
Gubbi wrote:
CISC has *zero* code size advantage over RISC. Compare ARM thumb or MIPS16 to x86 and you'll find the latter losing

Well sure, but condensed ISAs are hardly what one means when one says RISC. Of course Thumb and MIPS16 deserve the term "RISC" ISAs, because they are variations on "classic RISC" ISAs (and SuperH because it is so similar to Thumb and MIPS16), and incorporate many of the design insights of the RISC revolution.

<snipped more good points on code size of various uarchs and their implications>

My original intention was to argue that code size does not matter in a high performance CPU, which I completely failed (forgot) to do. In problem domains where code size matters there are better uarchs than x86.

As you stated yourself Dave, some of the things you give up when moving to a more condensed instruction format are reduced number of registers (reduced number of bits to address registers) and usually a more restrictive 2-adress format instead of a 3- or 4-address format. This also means that you'll end up using more instructions on shuffling/spilling data. So while your total codesize goes down, the actual number of instructions goes up.

One of the reasons why the text segment (ie. the program code) of programs is traditionally larger on RISCs has as much to do with the larger number of register allowing a larger degree of loop unrolling; - in this case the bloated code size results in a net gain in performance.

DaveH said:
<snipped die sizes for P4 Will., 21264B and 21264C>

Just for reference: The 21264b was a "dumb" shrink (Only the transistors were made with 0.18um, metal layers were still 0.25um or something, not sure here). The 21264C was re-laid out with 0.18um design rules. It's true that's its on a Cu SOI process, but SOI does not reduce die size.

DaveH said:
There was a famous study carried out at DEC where they pitted their own VAX 8700 against the MIPS M2000; the chips were chosen because they were built on an extremely similar process.

Are you thinking of John Mashey's usenet posting? (more Mashey)

DaveH said:
Quote:
Finally: The compiler advancements that will benefit EPIC (VLIW) will also benefit every single other architecture out there.

EPIC is infinitely more dependent on good compilers for high performance than CISC or RISC, and particularly out-of-order implementations of CISC or RISC. Moreover, the other general-purpose architectures don't have features like full predication, branch hints (with poison bits to preserve correctness), or memory reference speculation. Plus their smaller visible register set limits how aggressive the compiler can be in terms of software pipelining or trace scheduling.

Quote:
The only thing EPIC has going for it is the large register file

Totally wrong. For one thing, simply giving a classic RISC 128 GPRs without significantly changing the rest of the ISA would barely improve performance at all. (After all, OoO RISCs get most of the benefit of a large visible register set by having a similarly large renaming register set.) For another, among all the other bits I mentioned above, you're somehow forgetting the little bit about the explicit parallelism...

Let us recap the features of EPIC:

1.) Many registers organized as a rotating stack
2a.) Explicit parallism (fixed scheduling)
2b.) Instruction bundles (non-power of 2 instruction size)
3.) Full predication (ie. conditional execution)
4.) Memory speculation

Pros and cons of each:
ad 1:
PROS: Loads of register is good for codes which has many active variables, mostly big vector codes. Rotating register-stack (together with specieal setup instructions) can collapse unrolled loops into a single instance, obviously good for locality/cache usage.

CONS: Context switching is expensive, saving and loading 2x128 128bit registers (2KB).

ad 2:
PROS: 2a: scheduling determined by template bits (ie. no logic wasted on scheduling logic), only true if number of execution units match the worst-case bundle instruction distribution (as they do in Itanium 2, but not in Merced). 2b: Non-power of 2 instruction (41 bits) gives more room for adressing operands while not wasting any instruction bits (64-41 bits).

CONS: 2b: In case of a future OOO or SMT implementation, dynamic scheduling is going to be needed anyway-> scheduling template bits are made redundant.

ad 3:
PROS: Reduces the amounts of branches in IF-THEN-ELSE constructs (and similar). Facilitates eager execution (executing both sides of a branch)

CONS: Not really any.

General usefulness: Not that big, for short blocks in IF-THEN-ELSE constructs, a simple CMOVE (conditional move) offers similar capabilites in 90% of all cases (although with worse latency). For bigger blocks, eager execution quickly loses out to a good branch predictor (well, any predictor doing better than 50% wins when block size -> inf.)

ad 4:
PROS: reduces latency by initiating loads before data is needed.

CONS: Not really any

General usefulness: _very_ limited, can be used to initiate loads prior to use. However for predictable access patterns, this is no better than prefetching. For unpredictable ones the state-space can grow very quickly (ie. speculating loads within a binary decision tree will cause the processor to have 2^n outstanding speculated loads, where n is your progress).

General comment:

EPIC was conceived in the late 80s and early 90s when OOO schedulers looked to have n^2 complexity as a function of in-flight-instructions. The instruction format can be called compressed VLIW (ie. the instruction bundles doesn't have instructions for all execution units in a instruction execution cluster). however it isn't VLIW because you can't do a=b, b=a kind of operations to do register swaps (ie. ops interlock, regardless of being declared explicitly parallel).

However, time has shown that OOO scheduling is not nearly the hurdle it was perceived to be earlier. If it was, one would think that the instruction decoder in the P4 would put scheduling information into the trace cache together with the instruction, -but it doesn't, and yet manages to have 120 instructions in flight at any one time.

The P4 is in many ways a sign of things to come, also for EPIC, IMHO.

It exploits instruction level parallism, ILP, aswell as thread level parallism, TLP. There is a gray area in between you can call micro-thread level parallism that isn't covered today (ie. two different loop iterations of a *big* loop, or two different, independent, function calls) . But as the amount of in-flight-instructions increase and the cost of thread instantiation/switching decreases this will be covered aswell.

*gubbi puts on Dr. Who scarf

This means that all high performance (and high throughput) processors eventually will be OOO, this includes EPIC. In an OOO EPIC implementation the scheduling template bits are redundant. The many registers will make context (thread) switching more expensive, much like register windows in SPARC today, to the point where the compilers will generate code that only use a 32 fp+32int register subset in throughput sensitive applications (but may still use rotating register stack for high performance single thread situations). Speculative loads will hardly ever be used because the load/store resources will be better used in another context (to do real work). All IMO (grain of salt... etc.).

DaveH said:
(Although ironically, Power4 does a crude form of on-chip RISC->"semi-VLIW" encoding, so that it can reap some of the control benefits of a bundled instruction ISA. Remind you of anything?)

I'm fairly certain that IBM chose to bundle instructions in order to increase the amount of instructions they can have in flight given a fixed number of items their OOO engine can track. Conversely you can think of it as reducing complexity of the OOO engine for any given number of in-flight-instructions.

AMD does something similar in the Athlon: Instruction that uses a memory operand are sent down the pipes as a macro-op, - no point decoding it into micro-ops (like the P3) to have the OOO scheduler try to extract parallism out of two instructions that are inherently sequential in nature.

Cheers
Gubbi

MfA · Jun 17, 2003

Re: Discussion of general purpose processor architecture con

Gubbi said:
There is a gray area in between you can call micro-thread level parallism that isn't covered today (ie. to different loop iterations of a *big* loop, or two different, independent, function calls) . But as the amount of in-flight-instructions increase and the cost of thread instatiation/switching decreases this will be covered aswell.

In 2000 Nec said they might commercialize the MP98 in 2003, so this might not take all that long. The advantage of this kind of SMT is that transformation of ILP to TLP is much more efficient than the other way around.

Even with the best compilers in the world the only way forward for EPIC is to execute more and more speculative instructions, for an ever diminishing amount of actually usefull instructions. The same diminishing return problem of superscalar architectures reframed in a different setting.

Saem · Jun 18, 2003

McKinely still can't do every possible combination of instruction bundles.

There might be a few gains by adding the necessary execution units.

Once the control logic for EPIC implementations stop being such a pain, then they can concentrate in really jacking up the clock rate.

Dave H · Jun 18, 2003

Gubbi-

Thanks for a great reply! I'll follow the same policy: if I don't comment on something it's because I agree.

Gubbi said:
My original intention was to argue that code size does not matter in a high performance CPU, which I completely failed (forgot) to do. In problem domains where code size matters there are better uarchs than x86.

As you stated yourself Dave, some of the things you give up when moving to a more condensed instruction format are reduced number of registers (reduced number of bits to address registers) and usually a more restrictive 2-adress format instead of a 3- or 4-address format. This also means that you'll end up using more instructions on shuffling/spilling data. So while your total codesize goes down, the actual number of instructions goes up.

One of the reasons why the text segment (ie. the program code) of programs is traditionally larger on RISCs has as much to do with the larger number of register allowing a larger degree of loop unrolling; - in this case the bloated code size results in a net gain in performance.

In other words, in general there is very often a tradeoff between code size and core performance (by which I mean performance ignoring memory stalls). A great point--if you hadn't forgotten to put it in your original reply, I probably wouldn't have gone off on you!

Obviously, for modern general-purpose computing, no one could care less about program code footprint--which is definitely not the case for many embedded systems, nor was it the case for general-purpose systems in the heyday of CISC. On the other hand, bloated code can still extract a performance penalty by clogging up the memory hierarchy, as today's general-purpose computers are increasingly sensitive to memory hierarchy efficiency for good performance.

To explore this a bit further, let's look at some of the sources of different code size between a traditional RISC ISA and x86:

1. 32 GPRs; 3-operand instructions: on the one hand, these make individual instructions larger: 15 bits for the register fields in a typical instruction as opposed to 6. On the other, they save a lot of instructions in the form of register-register copies, and register thrashing when 8 GPRs is not enough. Overall, they probably result in a code size decrease (although I'm not sure on that point), and definitely in a performance increase.

1.1 Loop unrolling, an optimization enabled to a much greater degree by the larger register set: loop unrolling increases the code's memory footprint, but decreases the code path length. It will presumably make I-cache locality a bit worse, but then again if you're in a tight code loop, the impact of instruction fetch on your memory performance is probably about nil. The impact of data loads, on the other hand, might be considerably greater, and longer basic blocks might allow more efficient data loads; so loop unrolling is probably a win from a memory stall perspective as well as from a control stall perspective.

2. Decoupled memory and arithmetic operations: makes pipelining much, much easier, for a variety of reasons, but none moreso than simplifying exception handling. Such a large effect in enabling today's super- (and "hyper-") pipelined speed demons that it's almost pointless to talk about the corresponding code bloat. Of course, it's also a moot point when discussing x86 code from a modern compiler, which will not generate combined memory/arithmetic instructions. In talking about CISC vs. RISC as general paradigms, however, this is a huge win for RISC.

3. Fixed instruction size: makes instruction decoding much simpler; on the other hand, definitely leads to larger code. (It is obvious that, everything else being equal, some flexibility in instruction size can only lead to denser code.) And that's larger code in terms of memory footprint, fetch bandwidth, and I-cache footprint (unless, like P4, you are storing decoded instructions). It was this aspect I was thinking about when I made my comment that CISC's smaller code size could have performance benefits when instruction fetch bandwidth becomes important.

Now, this isn't to say that it isn't an overall win to move to a fixed instruction size as opposed to the complicated mess of logic it takes to decode an x86 instruction stream. However, it's still true that, once you've already paid the x86 tax by sticking your decoders in, there is the mitigating benefit of higher code density.

Let us recap the features of EPIC:

1.) Many registers organized as an rotating stack
2a.) Explicit parallism (fixed scheduling)
2b.) Instruction bundles (non-power of 2 instruction size)
3.) Full predication (ie. conditional execution)
4.) Memory speculation

...

CONS: 2b: In case of a future OOO or SMT implementation, dynamic scheduling is going to be needed anyway-> scheduling template bits are made redundant.

I agree IPF will eventually go OOO/SMT in the long-term. In the medium-term, however, it is likely to go to coarse-grained multithreading first. (CMT keeps several context states "saved" on the die at one time (via duplicated register files, etc.), allowing free context-switches on exceptions, etc.)

And I'm not sure I agree that the instruction bundles will become redundant with the coming of an OOO implementation. Rather I would guess that scheduling will occur at bundle granularity instead of instruction granularity. Like you said about Power4:

I'm fairly certain that IBM chose to bundle instructions in order to increase the amount of instructions they can have in flight given a fixed number of items their OOO engine can track. Conversely you can think of it as reducing complexity of the OOO engine for any given number of in-flight-instructions.

I'd guess a similar idea for IPF, except with the bundles formed at compile-time rather than during decoding.

EPIC was conceived in the late 80s and early 90s when OOO schedulers looked to have n^2 complexity as a function of in-flight-instructions...However, time has shown that OOO scheduling is not nearly the hurdle it was perceived to be earlier. If it was, one would think that the instruction decoder in the P4 would put scheduling information into the trace cache together with the instruction, -but it doesn't, and yet manages to have 120 instructions in flight at any one time.

I thought scheduling was still O(n^2). And I'm not entirely sure your P4 argument holds water: the P4 devotes 6 or 7 stages (depending on what goes on in the "allocate" stage) to register renaming and scheduling. That obviously represents a huge amount of logic complexity. Certainly scheduling scales quite steeply with the size of the reordering window (if not indeed quadratically). It would seem that dynamic scheduling at instruction granularity will run out of steam soon. Dynamically scheduling bundles of independent instructions would appear to be the way around this problem, certainly according to IBM and pretty certainly according to Intel as well. (Paul DeMone at RealWorldTech has implied that CMT and bundle-granularity OOO are the eventual future for IPF, and obliquely implied that this conclusion is based on conversations with the ex-Alpha team working on such an implementation.)

Speaking of Paul, he has a new article discussing some of these issues. Among other things it includes some good counterarguments to the notion that x86 really is approaching process-normalized parity with RISC that you haven't brought up: namely that low-volume server-oriented MPUs have fewer clock bins in a given process, and those are more conservatively qualified, compared to high-volume consumer MPUs.

Gubbi · Jun 18, 2003

Dave H said:
I thought scheduling was still O(n^2).

Well, it depends on what you're measuring

Area, power or delay. I was thinking delay, but you're still right. After reading this it is clear that there's a n^2 term that stems from wire delay. It is low however, but since wire delay will become dominant in future processes, you of course have a point.

Dave H said:
And I'm not entirely sure your P4 argument holds water: the P4 devotes 6 or 7 stages (depending on what goes on in the "allocate" stage) to register renaming and scheduling. That obviously represents a huge amount of logic complexity.

True, but to achieve high levels of performance you'll need alot of physical registers regardless. The alternative to register renaming is to have these registers explicitly addressable, -like the IA64. If you look at the Itanium pipeline you'll find that it spends 2 stages to access registers. The I2 however runs at half the speed of P4 (in a similar process) and thus uses an equivalent amount of time (in nanoseconds) to access its register file.

What is left is the three stages used for injecting and scheduling instructions, time Intel apparently thought could be made up by having OOO execution.

Dave H said:
Certainly scheduling scales quite steeply with the size of the reordering window (if not indeed quadratically). It would seem that dynamic scheduling at instruction granularity will run out of steam soon.

The paper above indicates that issue width has a larger effect on delay. I think the penalty for OOO execution is more or less fixed in relation to the other problems you get from going very wide and deep (pipelining is another, cheaper, way of exploiting ILP than superscalar). You'll need to check dependencies regardless of schedule methodology (which has similar time complexity).

Cheers
Gubbi

Dave H · Jun 18, 2003

Gubbi said:
Dave H said:

And I'm not entirely sure your P4 argument holds water: the P4 devotes 6 or 7 stages (depending on what goes on in the "allocate" stage) to register renaming and scheduling. That obviously represents a huge amount of logic complexity.

Click to expand...

True, but to achieve high levels of performance you'll need alot of physical registers regardless. The alternative to register renaming is to have these registers explicitly addressable, -like the IA64. If you look at the Itanium pipeline you'll find that it spends 2 stages to access registers. The I2 however runs at half the speed of P4 (in a similar process) and thus uses an equivalent amount of time (in nanoseconds) to access its register file.

What is left is the three stages used for injecting and scheduling instructions, time Intel apparently thought could be made up by having OOO execution.

A very good point. Of course, let's not forget that I2 also recieves more IPC for its troubles!

(More IPC than a comparable heavily-OOO RISC is, however, debatable. Although it's probably fair to say, no for SPECint, yes for SPECfp.)

I should make it clear that I'm not saying naive OOO is running out of steam on recent cores like Netburst, EV6/7 or K7/8. Just that we're at the point where future all-new cores are going to have to look beyond just naive OOO to find enough parallelism to keep up.

Of course that's hardly a controversial statement, particularly as just about every general-purpose processor family has SMT on their roadmap. (Except IPF and AMD64. And IPF is not only already going after other sources of parallelism via EPIC, but is probably moving toward OOO and maybe even SMT at some point beyond the end of the roadmap; and AMD is just being tight-lipped about K9, but surely it addresses the issue.)

psurge · Jun 18, 2003

The problem with bundling independent instructions and then
scheduling bundles is that (AFAICS), dependency checks must
still occur at the instruction level (bundle-bundle check means all instructions in bundle1 must be checked against all instructions in bundle 2).

I'm not an expert at all, but it makes more sense to me to track queues (or "bundles", whatever) of serially dependent instructions:

Queues would be constructed such that the head of each queue must be executed before any other instructions in the queue can run. This means dependency checking and scheduling for N queues is equivalent to
dependency checking and scheduling for the N queue head instructions.

If you queue statically (~ IA64 bundles), say with 4 instructions per queue, you get 4x the number of instructions in flight versus a no queue OOOE of similar complexity.

MfA · Jun 18, 2003

Reminds me a bit of JP Grossman's delayed issue paper.

psurge · Jun 18, 2003

Skimmed the paper (will read more later)...

I had an idea for a very parallel processing architecture
(loosely based on the grid architecture)
whereby you maintain an in order (fixed length) instruction queue
per FU. Connect say 16 FUs with a network of some kind,
and then instead encoding registers inputs/outputs, you
encode source/destination FUs. I'm still thinking about details,
but I'll post a bit more about it when I start making good sense to myself

.

Serge

Simon F · Jun 19, 2003

psurge said:
Skimmed the paper (will read more later)...

I had an idea for a very parallel processing architecture
(loosely based on the grid architecture)
whereby you maintain an in order (fixed length) instruction queue
per FU. Connect say 16 FUs with a network of some kind,
and then instead encoding registers inputs/outputs, you
encode source/destination FUs. I'm still thinking about details,
but I'll post a bit more about it when I start making good sense to myself .

Serge

That sounds a bit like the old "data flow machine" approach.

MfA · Jun 19, 2003

It is more like TTAs, and even more like the OpenRISC 2000 proposal at opencores.org.

The problem with classical dataflow is that it is only suited for a limited set of applications, Ive always thought condensed graphs would make a nice concept to build a hardware architecture around.

Andy Glew had some nice things to say about these kinds of architecture, and new instruction sets in general (reading between the lines I dont think he likes VLIW for a scalable ISA either). I dont quite understand the problems he sees with them, the fan-in/out problem can be limited with pipelining and routing instead of using naive crossbars (the register set serves the same purpose really) and I dont understand the control flow convergence problem. He also disagrees with Dave about decoupled memory and arithmetic operations BTW

Gubbi · Jun 19, 2003

MfA said:
Andy Glew had some nice things to say about these kinds of architecture, and new instruction sets in general (reading between the lines I dont think he likes VLIW for a scalable ISA either).

He does appear to be an idea factory. I think he would like to see the VLIW concept, bundling independent instruction together, generalized. His idea of representing the dependency matrix explicitly lends itself to a micro-thread architecture, doesn't it?

Imagine a bunch of small chunks of code each with high levels of intra-chunk dependency, but with no inter-chunk dependencies, issued at the same time. Dependency checks will then only be performed between each chunk when they're done executing.

The individual instructions inside each chunk could then be executed either:

1.) With the assumption that all instructions are heavily dependent (ie. don't check) -> instructions are executed sequentially on a scalar core.
2.) Normally as in a normal narrow super scalar.

Use SIMD to exploit data-level parallelism.

My head hurts

Cheers
Gubbi

Discussion of general purpose processor architecture cont.

Gubbi

MfA

Saem

Dave H

Gubbi

Dave H

psurge

MfA

psurge

Simon F

Tea maker

MfA

Gubbi

Similar threads