CPUs...why not more exectution units over cores?

scificube

Regular
CPUs...why not more execution units over cores?

I understand why TLP is of value and is the way of the future.

I have a few questions though...

With the ever increasing need for more computational power why wouldn't chip makers simply leverage ILP more and add more executions units vs. adding more cores?

The way I see it...it's hard to find 16 ways to split up a task and it's probaly much easier to find 16 instructions in a task to exectute so why not leverage this.

Modern CPUs already have OoOe and instruction windows so why not toss in a few more SIMD units before you run off to add another core in order to increase performance?

TLP could still be leveraged while this is being done but in a more practical way. There are tasks that can logically be split up into independent tasks towards a final result. This is where TLP should be leveraged and the amount of TLP can reflect the demands of such tasks. Say tasks we know of can logically be split up into 4-6 indpendent tasks on the average...so that's how many cores a CPU should have. Algorithmic analysis and brainstorming could find a reasonable balance here. Since the tasks meant to leverage multiple cores are emabarassingly parallel allot of pain and suffering is avoided...allot of the,"paradigm shift" would then not need to happen.

For other tasks where more computational power is still needed but the task itself really isn't one that is embarasssingly parallelizable it would seem a solution to me would be to simply have a larger instruction window on your cores along with more exectution elements and leverage ILP via OoOe. This saves one the headache of trying to parallelize rigid serial code and probably as the code being such as it is one relatively could get better performance if for no other reason than the overhead "glue" needed to run the same naturally serial tasks in a parallel fashion is removed.

A CPU of this breed would seem the best overall solution so I'm a little lost as to why I've not heard of such on anyone's drawing board? Perhaps I've not looked well enough but I don't what I descibed in on anyone's roadmap. I'm talking about a CPU that is not quite so massively parallel at the thread level and much more parallel at the instruction level. My idea is pretty simple. Get more power by increasing the number of execcution units. Make the chips that much faster by adding more cores for conccurrent processing on these suped up cores. The number of cores you need should meet the demand for or usefulness of parallelism.

My guesses are:

The CPU would be too big.
I'm wrong to think more ILP can be extracted...or I'm wrong to think it could be extracted in this manner.
A balance between TLP and ILP is too hard to find...or TLP is simply unavoidable due to the above or reasons I lack the vision to see.

The CPU would be too big...

I gather it could get quite large but this is still relative to me. In comparsion to a chip with more cores and less exectution units per core would it be all the much bigger if at all in the future?

I would think SIMD units etc are much smaller than entire cores and that more of them along with more transistors dedicated to an larger instruction window might make for a core that is larger than what see today but make for a smaller multi-core chip in the overal.

When I here comments that suggest 40+ cores in a CPU of the future it makes me shiver to think how difficult it would be to take advantage of such a part. It would seem to me the struggle would be to find tasks that could map to that many cores or find ways to map tasks that really can't take advantage of such TLPbecause there is no other choice if you wish to have more performance or to not waste any potential you can find. I see this as what would spurn on the great,"paradigm shift,"...I see allot of pain and suffering that can avoided with a much smaller paradign shift to take advantage of parallelism when it is obvious and at least somewhat easy to do.

I guess I can stop blathering on now and ask whether my thoughts are pycho or sane.(I do want an aswer...)

Well I guess I should tie this into "console" talk in some way...

The CBEA seems to going the multi-core route full force where as Xenon isn't really going the more SIMD route but rather the "more capable" SIMD route desktop CPUs have been going with iterations of SSE, 3Dnow, AltiVec/VMX etc.

They both seem to welcome the move to parallelsim and for the level that they try to leverage it I don't think it's beyond programmers to take advantage of it.

However some questions I would like to pose to some to the resident developers here: (of course anyone is free to comment)

Even without OoOe would you prefer CPUs with all the SIMD units in a single core over these multi-core designs? (assuming this single core can issue 1 instruction per cycle per execution unit)

I assume there is at least some value in concurrent processing due to parallesism so where would you draw the balance? (thinking about physics, graphics, procedural stuff, etc)(balance--how many cores...how many ex units per core)

If I were developing for the PS4 or Xbox720 I think I would prefer a mutl-core chip where cores are multi-threaded, have OoOe and more executions units per core to a chip that is uber massively parallel still providing more executions units than the former but again since it's so parallel either OoOe or something else may not be there.

Which way would you guys like to see it go?
 
Last edited by a moderator:
What you seem to be wanting to suggest is a full data-flow approach (see this and so far only prototypes have been built (Monsoon - MIT, EM4 - Japan, and some machine in Manchester Uni) but these indicate problems with the idea. To summarize these are: difficult I/O, complex scheduling policies, poor use of local stores such as registers and potentially issues with (would you believe it) too much concurrency.

The main issue with trying to extract ILP from sequential code is that of resources. With only a finite number of 'quick' registers on the chip and slow main memory you have to provide enough registers for each instruction to be able to execute with disregard of others executing. Obviously you will have hazards which you must deal with (in your re-ordering process and ALU selection this can be done) but programs do not really contain that much parallelism when written without regard for it (in FP programs also there tends to be far less than in integers). The effect of widening the executing window also has consequences for branches and jumps which again make it a lot tricker to divide instructions up so you can operate on many concurrently.
 
Thanks! I'll give that a read. (gonna take a while :))

I knew it had to be something going on I just couldn't figure out what and I hadn't seen an an example yet that outlines the pitfalls.
 
Extracting ILP out of code that has good natural parallellism in the first place is quite expensive in terms of hardware unless you go for VLIW; doubling the number of execution units will make your processor core about 4x larger (and correspondingly hotter) due to all the register file ports and the depedency checking/bypassing circuitry. For code that doesn't have good parallelism, but instead relies on long sequences of dependent operations, throwing tons of excution units at it won't help the performance at all. Branching also tends to throw a spanner in the works (it is estimated that average x86 code has 1 branch for every 6 instructions or so).

VLIW in turn has its own set of problems: it forces a strict in-order execution model, the design of the register file often ends up cutting down clock speed, so that doubling the number of ALUs halves the clock speed, etc.

Large instruction windows don't really help all that much; they tend to burn a lot of power and don't seem to help performance very much beyond a certain level; Pentium4 has a 126-instruction window, while Athlon64 has just 72, and we all know how Pentium4 and Athlon64 compare in terms of ILP.

Even though both P4 and A64 can theoretically execute 3 instructions per clock, it is in practice rare for either processor architecture to exceed ~1.2 instructions/clock sustained for anything else than carefully hand-crafted code. I seem to remember reading once that AMD engineers estimated the performance benefit of adding a 4th ALU to the Athlon core to be on the order of ~1%.
 
Edge said:
A good answer would be that compiler technology can't keep up!
I would actually argue that it is more a matter of available programming languages rather than compilers per se; practically all languages I have seen, except perhaps Fortran, are specified in a way that is quite hostile to high-parallellism code, making it hard for programmer and compiler alike to extract high ILP, even when the underlying processor could support it.
 
The CPU would be too big.
I'm wrong to think more ILP can be extracted...or I'm wrong to think it could be extracted in this manner.
I remember some key engineers at AMD and Intel both saying things along the lines that bringing IPC up by a factor of 2 from the earliest pipelined x86 CPUs (namely 80486) cost something like 8-12X the transistors.

Also, from a pure code standpoint, it's not very often that you'll find clumps of instructions as large as 16 or even 8 that are totally independent. It is true that compilers have something to do with this, but part of it is just the extent to which software design involves breaking things down to small tasks. On the other hand, it is very possible to get that with more threads running. And in theory, you can make a CPU with more execution resources, and give it several ways of SMT to use otherwise unused pipes. At that point, what's the big difference between this monolithic CPU and a multicore design with shared cache (at least from a software point of view)? From a hardware point-of-view, the multicore design will probably have better yields overall since you can still sell it with only 14 or 15 working cores instead of 16 -- since the cores are separated more discretely in a multi-core design, this is easier to do, whereas a single-core design will probably have a lot of intertwined interconnects and some more mixed up layouts.
 
scificube said:
There's been allot of great insight in this thread. Just want to say thanks...especially for not laughing.

About 16-8 window sizes - Aanaconda was designed to deal with this via multiple micro-threads (which are similar to what execute on an SPU). I suggest reading up about this architecture if you wish to learn more.
 
arjan de lumens said:
I would actually argue that it is more a matter of available programming languages rather than compilers per se; practically all languages I have seen, except perhaps Fortran, are specified in a way that is quite hostile to high-parallellism code, making it hard for programmer and compiler alike to extract high ILP, even when the underlying processor could support it.

I'm a Programming Languages student, just want to throw my two cents in and say I concur. The languages most people use really suck for extrapolating parallelism. Part of the problem is that imperative programming is very difficult to reason about whereas functional or declarative code can be reasoned about much more efficiently and, believe it or not, it lends itself to optimization more readily. It would also help to have languages which have data parallelism and concurrency backed into the semantics of the language. Not many languages do that.

My 1.5 cents.
 
scificube said:
The way I see it...it's hard to find 16 ways to split up a task and it's probaly much easier to find 16 instructions in a task to exectute so why not leverage this.
Actually, I would think it is quite the opposite. Trying to extract parallelism out of the centre of a inner loop, for example, may be extremely difficult.
 
Simon F said:
Actually, I would think it is quite the opposite. Trying to extract parallelism out of the centre of a inner loop, for example, may be extremely difficult.

actually, inner loops are realtively easily parallelizable through unrolling/unfolding*.


the first rule of the in-order, super-scalar coder**:

thou shalt unroll/unfold thy innermost loops (and spice them up with prefetches) to improve thy superscalarity.


* a term i use to distinguish between full unrolling and partial unrolling - i.e. making a loop of N into a loop of N/2 would be unfolding.

** last seen on the desktop about the time of p5/mmx.
 
Last edited by a moderator:
So the x86 instruction set (and current programming languages?) would need to be pretty much thrown away to effectively add more instruction units?

IIRC, Intel's next gen processor is set to have 4 execution units, and AMD's roadmaps lists plans to follow that, but no real time frame of when it will happen, just sometime before K9/K10 (I forget which one is supposed to be AMD's all new core, but around the time they start doing quadcore they have plans to vamp the individual cores)
 
arjan de lumens said:
Extracting ILP out of code that has good natural parallellism in the first place is quite expensive in terms of hardware unless you go for VLIW;

This is a bit incorrect. While VLIW may seem easier, it often isn't and just results in lots of additional hardware that sits around and isn't used in most cases.

doubling the number of execution units will make your processor core about 4x larger (and correspondingly hotter) due to all the register file ports and the depedency checking/bypassing circuitry.

This register file issues effect all ISA be it VLIW or not. In addition, the VLIW hardware has the same issue with the bypass circuitry. The one potential advantage for VL?IW is dependency checking, but this only is true if you enforce strict grouping throughout the pipeline and still has all the memory op dependency checking issues.

Large instruction windows don't really help all that much; they tend to burn a lot of power and don't seem to help performance very much beyond a certain level; Pentium4 has a 126-instruction window, while Athlon64 has just 72, and we all know how Pentium4 and Athlon64 compare in terms of ILP.

If you are going to compare P4 and K8, the better point of comparison is probably around the northwood timeframe. Prescott is pretty broken. Instruction windows aren't really an issue of ILP but of execution width x pipeline depth. K8 has a smaller execution window because it has a shorter pipeline length.

Aaron Spink
speaking for myself inc.
 
Kryton said:
About 16-8 window sizes - Aanaconda was designed to deal with this via multiple micro-threads (which are similar to what execute on an SPU). I suggest reading up about this architecture if you wish to learn more.

Or look up the multi-scalar research out of Wisconsin. There is a pretty good disertation by Scott Breach that has plenty of data, but might be a bit unweildy, at ~600 pages.

aaron Spink
speaking for myself inc.
 
aaronspink said:
Or look up the multi-scalar research out of Wisconsin. There is a pretty good disertation by Scott Breach that has plenty of data, but might be a bit unweildy, at ~600 pages.

aaron Spink
speaking for myself inc.


600 pages! My god! I hope there's like 2 words on each page :)

As always thanks of course ;)
 
Fox5 said:
So the x86 instruction set (and current programming languages?) would need to be pretty much thrown away to effectively add more instruction units?
The main problems with x86, compared to RISC/VLIW, are instruction decoding and register count. The instruction decoding can be solved with trace caches, and the small register count can largely be helped with register renaming (to break antidependencies that otherwise occur when you try to use a register for more than one purpose) and store-to-load forwarding (which reduces the relative cost of having to swap stuff out to the stack all the time). AFAICS, RISC and VLIW haven't really provided any big breakthroughs that x86 hasn't been able to follow up; if you want to really leave x86 behind in the dust in terms of instruction level parallellism, you will need something much more radical than just another PowerPC or Itanium.

As for programming languages, let's see if any of you programming gurus out there can spot why the following piece of C code cannot be parallellized/vectorized:
Code:
void vec_add( float *a, const float *b, const float *c )
  {
  for(int i = 0; i<100; i++)
    a[i] = b[i] + c[i];
  }
 
arjan de lumens said:
As for programming languages, let's see if any of you programming gurus out there can spot why the following piece of C code cannot be parallellized/vectorized:
Code:
void vec_add( float *a, const float *b, const float *c )
  {
  for(int i = 0; i<100; i++)
    a[i] = b[i] + c[i];
  }

http://msdn2.microsoft.com/en-us/library/5ft82fed.aspx

If you don't flag the params 'restrict', the compiler has to assume that the arrays could overlap (pointer aliasing). Hence auto-parallelizing the loop is unsafe.

(I thought already we went over this a few months ago. ;))
 
Last edited by a moderator:
Back
Top