With all this talk of multi-core...

ShootMyMonkey said:
There was a post from an AMD architect almost a year ago on comp.arch that posed a theoretical CMP chip made up of 486-like cores (not including cache) in a die about the same size as an Opteron core. This was back around the time that Sun's Niagara was brand new, so that was the hot topic.
Mitch Alsup said:
Let us postulate a fair comparison, and since I happen to be composing this, lets use data I am familliar with. Disclaimer: all data herein is illustrative.

The core size of an Athlon or Opteron is about 12 times the size of
the data cache (or instruction cache) of Athlon or Opteron. I happen
to know that one can build a 486-like processor* in less area than
than the data cache of Athlon, and that this 486-like core could run
between 75% and 85% of the frequency of Opteron.

[*] 7 stage pipeline, 1-wide, in-order, x86 to SSE3 instruction set.

Let us pretend Opteron is a 1.0 IPC machine, and that the 485-like processoris a 0.5 IPC machine. (At this point you see that we have spent the last15 years in microprocessor development getting that last factor of 2 and it has cost us around 12X in silicon real estate...)

CPUs IPC/CPU Frequency IPC*Freq IPC*Freq*CPU
Opteron 1 1.0 2.4 GHz 2.4 2.4
486-like 12 0.5 2.0 GHz 1.0 12.0

If you really want to get into the game of large thread count MPs;
smaller slower less complicated in-order blocking cores delivers
more performance per area and more performance per Watt than any
of the current SMT/CMP hype.

Lets look at why:

Uniprocessor Best Case Typical Case Worst Case
DRAM access time*: 42 ns 58 ns 120+ns
CPU cycles @ 2.0 GHz 84 116 240

MultiProcessor
DRAM access time*: 103 ns** 103 ns 500 ns
CPU cycles @ 2.0 GHz 206 206 1000

[*] as seen in the execution pipeline
[**] best case is coherence bound not memory access time bound.

One needs a very large L2 cache to usefully ameliorate these kinds
of main memory latencies. Something on the order of fraction of 1%.
L2 Cache miss rates on commercial workloads: 64 GBytes of main memory1 TByte commercial data base, thousands of disks in multiple RAID channels, current Data Base software....
L2 miss
Miss Rate CPI cost
1 MB 5%+ 10.3
2 MB 4%-ish 8.2
4 MB 3%-ish 6.2
8 MB 2%-ish 4.1

So the fancy OoO core goes limping along at 0.2 MIPS while the itty bitty
486-like core goes limping along at 0.17 MIPS. And you get 12 of them!
So, the measly 5X advantage above, becomes a 10X advantage in the face of bad cache behavior.

Now if I were to postulate sharing the FP/MMX/SSE units between two
486-like cores, I can get 15 of them in the same footprint as the
Opteron core.

I can also postulate what the modern instruction set additions hav done to
processor area: Leave out MMX/SSE and the 486-like size drops to 1/18 of an Opteron core.

The problem at this instant in time is that very few benchmarks have
enough thread level parallelism to enable a company such as Intel or
AMD to embark on such a (radical) path.

Mitch
#include <std.disclaimer>
I'm not so sure about the empirical figure of 0.5 IPC for a 486, but other than that, everything seems to make perfect sense.

I think with stuff like this you need the right balance for TLP, CMP and ILP and an efficient, usable programming model...
 
I think with stuff like this you need the right balance for TLP, CMP and ILP and an efficient, usable programming model...
No doubt. That's partly why Niagara made perfect sense for the server world where tons of requests will come in at any given time. There's plenty more TLP than you have cores to handle it. But yeah, they also dealt with the ILP problem by relying on 4-way SMT per core and just filling in more TLP. Bandwidth-hungry sucker it is, but that's fine for where it's going.

I almost find it funny, though, that with all the talks that Sun gave about "throughput computing," they almost seemed to be referring to Niagara being used in a thin client + mainframe type of setting. Things coming full circle, I guess.
 
Jaws said:
Well, Xenos is not the X360 CPU but it's the GPU! ;)

Well.. :oops: I should have caught that - thanks for the correction..

Megadrive1988 said:
Well, the next step, IMO, is to have a unified *processor* architecture that can do all the things that a CPU and a GPU would do. of course, on the back end of the rendering pipeline, and in other places, you still have some specialised cores / functional units, but only where it makes sense. much of the rest of the transistor budget goes into unified computing cores and caches/eDRAM (hopefully SRAM or a new type of ultra-lowlatency memory). what I am describing sounds alot like Intel's Platform 2015, and is the next step beyond the current CELL architecture.

This is exactly what I'm thinking - reconfigurable cores to allow for different types of processing, doing away with all this ultra-specific GPU hardware. While I do believe this is the route we'll see in the future, I don't expect it to come easily, especially when you factor in economics. With a system of many cores coming from a specific semiconductor company, where would that leave the likes of ATI, Nvidia, and others, who thrive on people upgrading their graphics cards on a regular basis? Will their market get absorbed into AMD's and Intel's area of expertise, or will they have to work together to get a unified system? There would be no need for them in the sense we think of GPU's today. I believe this, NOT the technology, will hold the incorporation of this new paradigm back.

Anyway, the speculation about many 486-class cores on one die is interesting. This is a step in the direction I was originally thinking, ultimately.
 
I think nVidia and Sony have opinions on this matter. nVidia's really talking up their 'synergy' with Sony...Cell was to a be a one-processor solution...next iteration or two I guess we might well see a Cell with nVidia grapihcs content alongside generic processing, and provide a fully scalable architecture that'll turn it's hand to whatever you want.
 
Back
Top