ShootMyMonkey said:There was a post from an AMD architect almost a year ago on comp.arch that posed a theoretical CMP chip made up of 486-like cores (not including cache) in a die about the same size as an Opteron core. This was back around the time that Sun's Niagara was brand new, so that was the hot topic.
I'm not so sure about the empirical figure of 0.5 IPC for a 486, but other than that, everything seems to make perfect sense.Mitch Alsup said:Let us postulate a fair comparison, and since I happen to be composing this, lets use data I am familliar with. Disclaimer: all data herein is illustrative.
The core size of an Athlon or Opteron is about 12 times the size of
the data cache (or instruction cache) of Athlon or Opteron. I happen
to know that one can build a 486-like processor* in less area than
than the data cache of Athlon, and that this 486-like core could run
between 75% and 85% of the frequency of Opteron.
[*] 7 stage pipeline, 1-wide, in-order, x86 to SSE3 instruction set.
Let us pretend Opteron is a 1.0 IPC machine, and that the 485-like processoris a 0.5 IPC machine. (At this point you see that we have spent the last15 years in microprocessor development getting that last factor of 2 and it has cost us around 12X in silicon real estate...)
CPUs IPC/CPU Frequency IPC*Freq IPC*Freq*CPU
Opteron 1 1.0 2.4 GHz 2.4 2.4
486-like 12 0.5 2.0 GHz 1.0 12.0
If you really want to get into the game of large thread count MPs;
smaller slower less complicated in-order blocking cores delivers
more performance per area and more performance per Watt than any
of the current SMT/CMP hype.
Lets look at why:
Uniprocessor Best Case Typical Case Worst Case
DRAM access time*: 42 ns 58 ns 120+ns
CPU cycles @ 2.0 GHz 84 116 240
MultiProcessor
DRAM access time*: 103 ns** 103 ns 500 ns
CPU cycles @ 2.0 GHz 206 206 1000
[*] as seen in the execution pipeline
[**] best case is coherence bound not memory access time bound.
One needs a very large L2 cache to usefully ameliorate these kinds
of main memory latencies. Something on the order of fraction of 1%.
L2 Cache miss rates on commercial workloads: 64 GBytes of main memory1 TByte commercial data base, thousands of disks in multiple RAID channels, current Data Base software....
L2 miss
Miss Rate CPI cost
1 MB 5%+ 10.3
2 MB 4%-ish 8.2
4 MB 3%-ish 6.2
8 MB 2%-ish 4.1
So the fancy OoO core goes limping along at 0.2 MIPS while the itty bitty
486-like core goes limping along at 0.17 MIPS. And you get 12 of them!
So, the measly 5X advantage above, becomes a 10X advantage in the face of bad cache behavior.
Now if I were to postulate sharing the FP/MMX/SSE units between two
486-like cores, I can get 15 of them in the same footprint as the
Opteron core.
I can also postulate what the modern instruction set additions hav done to
processor area: Leave out MMX/SSE and the 486-like size drops to 1/18 of an Opteron core.
The problem at this instant in time is that very few benchmarks have
enough thread level parallelism to enable a company such as Intel or
AMD to embark on such a (radical) path.
Mitch
#include <std.disclaimer>
I think with stuff like this you need the right balance for TLP, CMP and ILP and an efficient, usable programming model...