25 chips "that shook the world"

The parallelism in EPIC is explicitly pointed out by the code through the template and stop bits, and by how the ISA defines valid instruction packets as not being rife with dependences.
Implicit parallelism is derived by how x86 chips analyse the instructions they load and check for dependences that IA-64 would have spelled out.

Dependencies aren't spelled out by IA-64, instruction bundles contains instructions that explicitly have no dependencies (and thus can be issued in one cycle without any checking).. However IPF still uses a scoreboard to track register dependencies (and you cannot do VLIW like a=b, b=a swaps inside a bundle).

Most of what seemed like good ideas when IPF was conceived are millstones around its neck today:
1. The rotating integer register file adds an adder in the critical register access path, - low latency register access is imperative to IPF, the result is a low operating frequency (and lots of power used).
2. The instruction templates, while allowing for simple issue logic, dictates a plethora of execution units, with a resulting *massive* result forwarding mux, the result is a low operating frequency (and lots of power used).
3. The ALAT allows for speculative loads, but the explicit clean up needed means that it is rarely used, meanwhile Intel x86 from Core 2 and onwards has had a reordering load/store unit, where speculative loads under outstanding stores are supported, which if effect means that every load is the equivalent of an ALAT load.
4. Predication of instructions to avoid branches. later findings showed that conditional moves cover 90% of cases where predication makes sense. The remaning 10% is eroded by ever improving branch predictors. Branch predictors are fucking awesome because they break data dependencies (ie, apparent data-dependenct latency is obliterated because branches are handled are at the front of the pipeline, unlike write-back which is at the very end of the pipeline). And in this day and age where almost all CPUs are bound by power, eager execution of nested if-then-else constructs, using exponential amounts of energy compared to work, is plainly just a bad idea.
5. The current implementations are in-order with the high sensitivity to cache access latency that implies. The cache system of current IPFs is fantastic, multiple superfast accesses, but the consequence is lots of power spent and a lower operating frequency.

The only really useful thing in IPF is that it packs non-power-of-two sized instructions into its instruction bundles, -and allowing for variable sized instructions (ie. 64bit immidiates).

I'd would like to see a high speed OOO IPF implementation. The rotating register file adder would be renamed away, with a speculating load/store unit, the ALAT could be completely ignored, and the higher latency tolerance of an OOO execution engine would allow more slack in cache access, lowering the amount of power spent there, as well as allowing for higher operating frequencies.

IPF/EPIC is an architecture that shook the world because it killed MIPS, Alpha and PA-RISC purely by politics and speculation long before any implementation existed. My bet is that Intel will dump it (on HP) within 2-3 years.

Cheers
 
Last edited by a moderator:
Dependencies aren't spelled out by IA-64, instruction bundles contains instructions that explicitly have no dependencies (and thus can be issued in one cycle without any checking)..
That is what I said.

However IPF still uses a scoreboard to track register dependencies (and you cannot do VLIW like a=b, b=a swaps inside a bundle).
The scoreboard does facilitate some latency tolerance for long latency or variable latency events like cache misses or multicycle instructions. Without it, a cache miss would need to stall execution immediately, because how else would the chip know an instruction in the EXE stage needed or did not need the result (and which would negate that cache subsystem to a large extent).

Most of what seemed like good ideas when IPF was conceived are millstones around its neck today:
1. The rotating integer register file adds an adder in the critical register access path, - low latency register access is imperative to IPF, the result is a low operating frequency (and lots of power used).
Is this particularly significant? How many bits does the adder need, and why wouldn't pipelining allow clocks to scale despite the extra work for the add and rotate?

I'd would like to see a high speed OOO IPF implementation. The rotating register file adder would be renamed away, with a speculating load/store unit, the ALAT could be completely ignored, and the higher latency tolerance of an OOO execution engine would allow more slack in cache access, lowering the amount of power spent there, as well as allowing for higher operating frequencies.
What about handling predicates? Some kind of value prediction or retry?

What about an ISA that has two instructions (or just another x86 prefix if we went that route): bundle_start and bundle_stop, which would indicate the lack of dependences inbetween?
 
That is what I said.

I know, but you also said:

Implicit parallelism is derived by how x86 chips analyse the instructions they load and check for dependences that IA-64 would have spelled out.

I just thought I would clarify that it isn't dependencies that are spelled out, it's the lack of dependencies (ie. bundles contain independent instructions), there's a semantic difference even if it looks like I'm splitting words.

The scoreboard does facilitate some latency tolerance for long latency or variable latency events like cache misses or multicycle instructions. Without it, a cache miss would need to stall execution immediately, because how else would the chip know an instruction in the EXE stage needed or did not need the result (and which would negate that cache subsystem to a large extent).

The instruction format of IPF/EPIC can be considered compressed VLIW because: 1.) a bundle only contains independent instruction, 2.) instruction issue is defined at the bundle level (by the template bits). The scoreboard is needed because, as you point out, you need to guard for RAW hazards of variable latency instructions (mostly memory ops). One of the quirks of IPF is that while a bundle only contains independent instructions per definition, dependencies inside a bundle are still scoreboarded (serialized) and thus you cannot do VLIW like single cycle swaps.

Is this particularly significant? How many bits does the adder need, and why wouldn't pipelining allow clocks to scale despite the extra work for the add and rotate?

It directly adds latency in the register access critical path. Given the dependency of low latency access, this is a major handicap, IMO.

What about handling predicates? Some kind of value prediction or retry?

Handling predicates in a ROB is simple: At the retire stage, if the predicate is false, don't commit the result.

What about an ISA that has two instructions (or just another x86 prefix if we went that route): bundle_start and bundle_stop, which would indicate the lack of dependences inbetween?
[/quote]

The explicit parallel stuff only saves work at instruction issue as far as I can tell. The usefulness is thus very limited, in an OOO it is completely gone since the ROB sorts the dependencies out.

I could imagine a bundle that only holds dependent instructions. Such a bundle would have completely sequential execution and thus you don't need to spend energy finding parallism, there is none. I know there has been research in to such micro-thread/fibre architectures, you'd still need a way to check inter-fibre dependencies (and merge state).

Cheers
 
Last edited by a moderator:
I know, but you also said:
With regards to non-EPIC x86 chips.

It directly adds latency in the register access critical path. Given the dependency of low latency access, this is a major handicap, IMO.
Do what Intel did and add a pipeline stage for it. The penalty is the extra cycle added to a branch mispredict. In Itanium's case, it goes from 9 to 10 cycles, which is 3-5 shorter than faster-clocked x86 chips.

The explicit parallel stuff only saves work at instruction issue as far as I can tell. The usefulness is thus very limited, in an OOO it is completely gone since the ROB sorts the dependencies out.
Registers are renamed as part of the process of issuing instructions. The ROB stores the renamed identifiers the issue hardware generated. The ROB wouldn't even be needed if that renaming didn't happen, so it does nothing to affect the usefulness of explicit parallelism.

The explicitly parallel code removes a swath of decoder and scheduling hardware that usually scales quadratically or worse in complexity based on issue width and the width of the instruction window.
Whether Itanium is performant or not, a look at the die plot of a Nehalem core should show just how much can be saved by doing so.
 
As for the Pentium Pro, it lopped off a few 16-bit registers, making the mixed 32 bit and 16-bit code in Windows 95 run like molasses.
That is way overstated. I've tried to dig up info about this since they came out, and I've run PPro in Win 95/98 myself. It is not slow at all really and as far as I can find out it performs about as well as a Pentium Classic at the same clock in that OS.

Also, the PPro platform initially had major issues with PCI bandwidth that were remedied with newer chipsets and BIOS improvements. Some of the testing with the initial chipsets can show major performance losses if there are any DOS video tests. PCI bandwidth was down around the single digit MB/s level.
 
Last edited by a moderator:
The explicitly parallel code removes a swath of decoder and scheduling hardware that usually scales quadratically or worse in complexity based on issue width and the width of the instruction window.
Whether Itanium is performant or not, a look at the die plot of a Nehalem core should show just how much can be saved by doing so.

Fetch is made easier by the 128 bit aligned bundle boundary, decode is made easier because of the 41 bit aligned instruction slots within each bundle. But this is no different than a sane architecture that uses say 32bit instructions, fetch+decode is as straight forward.

Compared to the quagmire that is superscalar x86 decode, yeah, it's a huge win.

It's true, that instruction issue is made super simple/small.

But what you're really doing is moving complexity from the front of the pipeline to the back. To be able to issue multiple bundles almost every cycle, you need a plethora of execution units to cover the many permutations of the various bundle templates. The consequence is a massive result forwarding mux which scales quadratically with the number of execution units. Size is not the problem here, critical path timing is.

Compared to a ROB-centric architecture where the size and power of the ROB scales with n log n, (EDIT: timing scales with log n,) where n is the number of instructions in the instruction window. width times depth, - secondary size-related effects not withstanding. The number of result buses equal retire width.



Cheers
 
Last edited by a moderator:
Fetch is made easier by the 128 bit aligned bundle boundary, decode is made easier because of the 41 bit aligned instruction slots within each bundle. But this is no different than a sane architecture that uses say 32bit instructions, fetch+decode is as straight forward.
Dependence checking is also simplified.
In a superscalar architecture, each operand must be checked for a dependence with any other simultaneously issued instruction to avoid an incorrect issue.
That scales quadratically with issue width.

An ROB architecture adds a renaming phase, which means that on top of the dependence checking there must be a rename of the destination register and a lookup for all source operands for the last rename registers for a given instruction's identifiers.

Many schemes use a CAM, which is not an easy thing to just toss in the critical path from a complexity and power perspective.

But what you're really doing is moving complexity from the front of the pipeline to the back. To be able to issue multiple bundles almost every cycle, you need a plethora of execution units to cover the many permutations of the various bundle templates.

The consequence is a massive result forwarding mux which scales quadratically with the number of execution units. Size is not the problem here, critical path timing is.
It would be quadratic per every set of units capable of sourcing each other, and only if we expect every bypass to occur in a cycle. Some level of clustering is expected based on unit type. Things like a move from an FP to an INT are not expected to happen with the same latency as an INT bypass.
Itanium's initial idea for going beyond a 2-banger was that the design would be clustered, such that bypass between clusters would incur additional latency, rather than blow up the network.

Given compiler inconsistency in extracting ILP with 6 instructions, this doesn't seem to have panned out.
From a hardware perspective, this is fine. One can have wide superscalar issue with non-uniform forwarding. Alpha did it.

It is not so simple to do the same for the front end and still call it superscalar issue.

As far as the cost on a non-x86, the ISU unit (which does all that scheduling and dependence checking) on POWER7 is close to a quarter of the core logic in a single core, so it is not cheap.

Compared to a ROB-centric architecture where the size and power of the ROB scales with n log n, (EDIT: timing scales with log n,) where n is the number of instructions in the instruction window. width times depth, - secondary size-related effects not withstanding. The number of result buses equal retire width.
The ROB is not the primary concern with wide OOE chips. The back-end work is wide, but relatively "dumb" in both schemes.
It's the front end and issue logic where costs diverge.
For a given level of width, an explicit encoding of dependences is going to be cheaper to implement in terms of hardware.
 
Dependence checking is also simplified.
In a superscalar architecture, each operand must be checked for a dependence with any other simultaneously issued instruction to avoid an incorrect issue.
That scales quadratically with issue width.

An ROB architecture adds a renaming phase, which means that on top of the dependence checking there must be a rename of the destination register and a lookup for all source operands for the last rename registers for a given instruction's identifiers.

Many schemes use a CAM, which is not an easy thing to just toss in the critical path from a complexity and power perspective.

In the kind of ROBs you find i PPRO derivatives (all the way up to Core i7s) and Athlons you don't have any explicit dependency checking. After decode, instructions get their values from the active register file, which either holds a value or a data-not-ready state if the value of the register is to be computed by a not yet executed instruction.

The destination registers of the decoded instructions in the active register file is marked as data-not-ready, the instructions are then inserted into the ROB. Instructions that got all their values from the active register file are ready to execute at once. Instructions that got data-not-ready will have to sit and listen until the values they need shows up on the result buses. When all register values are captured (hence the name Data Capture Scheduler), the ROB entry is ready to be scheduled for execution.

After execution, instructions broadcast their result on the result buses. The results gets picked up in the ROB and eventually written to the active register file. The original PPRO didn't even have a write port in the active register file for each result bus (it had one write port!), - the low number of registers meant that most of the results lived entirely in the ROB (ie. mostly data-not-readys in the active register file).

When an instruction retires, the result of the instruction is written to the architected register file. In Athlons the number of result buses equal the retire width (with arbitration required if more results are produced).

The ROB itself is a dense SRAM structure with loads of comparators.

Most of these structures scale well with N (number of instructions in flight), and more or less sensibly with width.

It would be quadratic per every set of units capable of sourcing each other, and only if we expect every bypass to occur in a cycle. Some level of clustering is expected based on unit type. Things like a move from an FP to an INT are not expected to happen with the same latency as an INT bypass.

It is still exactly quadradic going from one bundle to two, the fraction of integer/floating point resources are the same in a worst case bundle combination scenario, doubling the overall execution unit count, doubles integer execution count, and thus inflates the result forwarding mux.

Cheers
 
Instructions that got data-not-ready will have to sit and listen until the values they need shows up on the result buses. When all register values are captured (hence the name Data Capture Scheduler), the ROB entry is ready to be scheduled for execution.
How do you think they know which broadcast on the result bus to listen to? Those don't come back in order.
In the case of the AMD architectures, instructions that find that there is a pending result on an architectural register record the instruction tag for the result bus that will match the output of the producing instruction.
Every destination register is renamed.
Any source register must determine if it can draw from the architectural registers or if it needs a result tag.

Since this is superscalar, renames must be done in parallel, which means additional work is done in the cycle to catch the case where instructions in the same issue write to the same destination, and the source operands must be checked to see if they must be taken from register state or from the result bus based on the tag they get based on what has been renamed.

The ROB itself is a dense SRAM structure with loads of comparators.
The ROB itself is not the problem.
It's the process of Unordering that makes the Reorder Buffer necessary that is the source of trouble.

In the P4, the ROB barely existed beyond tracking instruction order for the sake of mispredicts and exceptions. The RAT did the rest.

It is still exactly quadradic going from one bundle to two, the fraction of integer/floating point resources are the same in a worst case bundle combination scenario, doubling the overall execution unit count, doubles integer execution count, and thus inflates the result forwarding mux.
Resource oversubscription is not unheard of. If necessary, the chip will stall.
There is no requirement that the chip serve every possible instruction combination within one cycle.
Merced definitely did not. McKinley supported far more, but some combinations of bundle types still cause stalls.

If Intel ever went with a clustered design, there would have been another source of additional latency and scheduling headaches.
 
How do you think they know which broadcast on the result bus to listen to? Those don't come back in order.
In the case of the AMD architectures, instructions that find that there is a pending result on an architectural register record the instruction tag for the result bus that will match the output of the producing instruction.
Every destination register is renamed.
Any source register must determine if it can draw from the architectural registers or if it needs a result tag.

Renaming is simple, the tag given the destination register is simply the ROB slot number the instruction is going to occupy, this gives a straigth forward round robin renaming scheme.

In modern x86 CPUS, each ROB entry listens to *all* the result buses. This, of course, is a source of complexity: ROB size times width (number of result buses).

The wider IBM OOO Power CPUs, 4, 5 and 7 alleviates this by organizing the ROB in column and rows, where each column has restrictions as to what type of instructions it can hold, thus floating point results are only broadcast to two columns in the case of P7, conditionals to just one. While this causes slots to be empty it allows for a bigger ROB.

Cheers
 
Have to admit, I fumbled at the "write your own program" bit, I'd need to dig out my copy of Programming the 6502. So I just watched in awe as I set if off on it's default little program.
 
Was it any good? Mine was with 8085 and it sucked. Big time. :mad:

I was 11 or 12 back then I was able to learn it and program my first big application which took 55KB of C64 memory.
So yes, quite easy :D.
Oh, and the reason for learning assembler was in C64 BASIC limitation of using just half of available memory for program, so that Chemistry program I was writing couldn't be finished in only 32KB (or so) of memory.
 
Back
Top