Was AMD claiming the only additional area related to x86 was specifically in the decoders?
That's what the slide said: The overhead of x86 was 4%.
Some elements aren't as relevant to Bobcat, such as the A64 devoting more storage in the L1 due to the predecode bits.
The instruction border bits were stored in the ECC bits of the I$, A64 only supported parity checking on I$ (damage to read-only instructions = no harm done: Reload from memory). You can argue they could have saved a few bits in the instruction cache without this, however it allowed reuse of the same SRAM macro as the one used for D$.
Then there's the longer pipeline, with 2-3 stages for picking and lane selection.
I'll grant you an extra pipe stage for picking instructions after scan.
The instruction grouping/lanes was a consequence of the ROB structure used. By grouping instructions in threes AMD used less hardware to track instructions ending up with a more compact ROB (thus faster) which had larger capacity than Intel's counterpart. Power 4&5 and the PPC970 derivative also had instruction groups, and much more restrictive ones at that.
Pipeline length is a function of operation frequency and power consumption design point. Athlon 64 had a 12 stage pipeline, POWER 4&5 had 14 (and twice the schedule-execution latency to boot!). Cortex A15 has a 15 stage pipeline as do AMD's Bobcat.
Then there's the internal cracking into micro ops that requires full tracking for each op and then folding back into the ROB.
Most common instructions map one-to-one to internal ops. Some map to multiple ops but can be decoded in a single cycle, some instructions require microcode fallback. Microcode execution is rare because it is slow. It is slow because little hardware is used on it (in part).
Micro op fusion is not specific to x86. RISCs would benefit too (eg. Power's compute predicate+branch). Packing multiple ops into a single ROB entry increases the virtual size of the ROB and improves execution efficiency (and power).
There's the need to track flags throughout the engine, and stuff like the support of the x87 FP pipeline.
Flags are renamed with every instruction that modify them in the register rename stage. They add 6 bits to the result buses throughout the chip (newer CPUs split them up to avoid false dependencies on unused flags). x87 is indeed a headache and an abomination that won't die.
There is a lot of impressive work over decades to make x86 look like it doesn't have disadvantages.
Agree entirely.
Many techniques were developed to work around shortcomings in the ISA: SRAM based ROBs were pioneered in PPRO to support a large ROB (this design is still amazing to me, the register file only has three ports, almost all register values live in the ROB). Unaligned memory accesses (since forever) seems like a small deal, but isn't. Large store-to-load forwarding queues because of the frequent register spilling effectively extending the size of the register file. Speculative loads again to work around all the false RAW hazards introduced by register spills and incidentally giving a massive jump to wide superscalar implementations.
Today you wouldn't think of building a high end CPU design without these features (well Intel are the only ones with truly speculative loads, the new IBM POWER8 might feature it).
The success of x86 is thanks to Intel and AMD engineers overcoming the challenges posed by the ISA, - and by pure luck, because the ISA is actually fairly efficient.
The two-operand instruction scheme was a performance bottleneck until OOO execution became a reality, then it became a boon because you save encoding bits for one operand per instruction on average. The addressing modes are *very* useful and simple compared to VAX and M68K (in particular M68020 and onwards). The instruction format doesn't have the long sequential dependency chains found in M68020. I'm not saying the prefix system is elegant, but it is easier to make fast than other CISCY schemes.
Cheers