CPUs...why not more execution units over cores?
I understand why TLP is of value and is the way of the future.
I have a few questions though...
With the ever increasing need for more computational power why wouldn't chip makers simply leverage ILP more and add more executions units vs. adding more cores?
The way I see it...it's hard to find 16 ways to split up a task and it's probaly much easier to find 16 instructions in a task to exectute so why not leverage this.
Modern CPUs already have OoOe and instruction windows so why not toss in a few more SIMD units before you run off to add another core in order to increase performance?
TLP could still be leveraged while this is being done but in a more practical way. There are tasks that can logically be split up into independent tasks towards a final result. This is where TLP should be leveraged and the amount of TLP can reflect the demands of such tasks. Say tasks we know of can logically be split up into 4-6 indpendent tasks on the average...so that's how many cores a CPU should have. Algorithmic analysis and brainstorming could find a reasonable balance here. Since the tasks meant to leverage multiple cores are emabarassingly parallel allot of pain and suffering is avoided...allot of the,"paradigm shift" would then not need to happen.
For other tasks where more computational power is still needed but the task itself really isn't one that is embarasssingly parallelizable it would seem a solution to me would be to simply have a larger instruction window on your cores along with more exectution elements and leverage ILP via OoOe. This saves one the headache of trying to parallelize rigid serial code and probably as the code being such as it is one relatively could get better performance if for no other reason than the overhead "glue" needed to run the same naturally serial tasks in a parallel fashion is removed.
A CPU of this breed would seem the best overall solution so I'm a little lost as to why I've not heard of such on anyone's drawing board? Perhaps I've not looked well enough but I don't what I descibed in on anyone's roadmap. I'm talking about a CPU that is not quite so massively parallel at the thread level and much more parallel at the instruction level. My idea is pretty simple. Get more power by increasing the number of execcution units. Make the chips that much faster by adding more cores for conccurrent processing on these suped up cores. The number of cores you need should meet the demand for or usefulness of parallelism.
My guesses are:
The CPU would be too big.
I'm wrong to think more ILP can be extracted...or I'm wrong to think it could be extracted in this manner.
A balance between TLP and ILP is too hard to find...or TLP is simply unavoidable due to the above or reasons I lack the vision to see.
The CPU would be too big...
I gather it could get quite large but this is still relative to me. In comparsion to a chip with more cores and less exectution units per core would it be all the much bigger if at all in the future?
I would think SIMD units etc are much smaller than entire cores and that more of them along with more transistors dedicated to an larger instruction window might make for a core that is larger than what see today but make for a smaller multi-core chip in the overal.
When I here comments that suggest 40+ cores in a CPU of the future it makes me shiver to think how difficult it would be to take advantage of such a part. It would seem to me the struggle would be to find tasks that could map to that many cores or find ways to map tasks that really can't take advantage of such TLPbecause there is no other choice if you wish to have more performance or to not waste any potential you can find. I see this as what would spurn on the great,"paradigm shift,"...I see allot of pain and suffering that can avoided with a much smaller paradign shift to take advantage of parallelism when it is obvious and at least somewhat easy to do.
I guess I can stop blathering on now and ask whether my thoughts are pycho or sane.(I do want an aswer...)
Well I guess I should tie this into "console" talk in some way...
The CBEA seems to going the multi-core route full force where as Xenon isn't really going the more SIMD route but rather the "more capable" SIMD route desktop CPUs have been going with iterations of SSE, 3Dnow, AltiVec/VMX etc.
They both seem to welcome the move to parallelsim and for the level that they try to leverage it I don't think it's beyond programmers to take advantage of it.
However some questions I would like to pose to some to the resident developers here: (of course anyone is free to comment)
Even without OoOe would you prefer CPUs with all the SIMD units in a single core over these multi-core designs? (assuming this single core can issue 1 instruction per cycle per execution unit)
I assume there is at least some value in concurrent processing due to parallesism so where would you draw the balance? (thinking about physics, graphics, procedural stuff, etc)(balance--how many cores...how many ex units per core)
If I were developing for the PS4 or Xbox720 I think I would prefer a mutl-core chip where cores are multi-threaded, have OoOe and more executions units per core to a chip that is uber massively parallel still providing more executions units than the former but again since it's so parallel either OoOe or something else may not be there.
Which way would you guys like to see it go?
I understand why TLP is of value and is the way of the future.
I have a few questions though...
With the ever increasing need for more computational power why wouldn't chip makers simply leverage ILP more and add more executions units vs. adding more cores?
The way I see it...it's hard to find 16 ways to split up a task and it's probaly much easier to find 16 instructions in a task to exectute so why not leverage this.
Modern CPUs already have OoOe and instruction windows so why not toss in a few more SIMD units before you run off to add another core in order to increase performance?
TLP could still be leveraged while this is being done but in a more practical way. There are tasks that can logically be split up into independent tasks towards a final result. This is where TLP should be leveraged and the amount of TLP can reflect the demands of such tasks. Say tasks we know of can logically be split up into 4-6 indpendent tasks on the average...so that's how many cores a CPU should have. Algorithmic analysis and brainstorming could find a reasonable balance here. Since the tasks meant to leverage multiple cores are emabarassingly parallel allot of pain and suffering is avoided...allot of the,"paradigm shift" would then not need to happen.
For other tasks where more computational power is still needed but the task itself really isn't one that is embarasssingly parallelizable it would seem a solution to me would be to simply have a larger instruction window on your cores along with more exectution elements and leverage ILP via OoOe. This saves one the headache of trying to parallelize rigid serial code and probably as the code being such as it is one relatively could get better performance if for no other reason than the overhead "glue" needed to run the same naturally serial tasks in a parallel fashion is removed.
A CPU of this breed would seem the best overall solution so I'm a little lost as to why I've not heard of such on anyone's drawing board? Perhaps I've not looked well enough but I don't what I descibed in on anyone's roadmap. I'm talking about a CPU that is not quite so massively parallel at the thread level and much more parallel at the instruction level. My idea is pretty simple. Get more power by increasing the number of execcution units. Make the chips that much faster by adding more cores for conccurrent processing on these suped up cores. The number of cores you need should meet the demand for or usefulness of parallelism.
My guesses are:
The CPU would be too big.
I'm wrong to think more ILP can be extracted...or I'm wrong to think it could be extracted in this manner.
A balance between TLP and ILP is too hard to find...or TLP is simply unavoidable due to the above or reasons I lack the vision to see.
The CPU would be too big...
I gather it could get quite large but this is still relative to me. In comparsion to a chip with more cores and less exectution units per core would it be all the much bigger if at all in the future?
I would think SIMD units etc are much smaller than entire cores and that more of them along with more transistors dedicated to an larger instruction window might make for a core that is larger than what see today but make for a smaller multi-core chip in the overal.
When I here comments that suggest 40+ cores in a CPU of the future it makes me shiver to think how difficult it would be to take advantage of such a part. It would seem to me the struggle would be to find tasks that could map to that many cores or find ways to map tasks that really can't take advantage of such TLPbecause there is no other choice if you wish to have more performance or to not waste any potential you can find. I see this as what would spurn on the great,"paradigm shift,"...I see allot of pain and suffering that can avoided with a much smaller paradign shift to take advantage of parallelism when it is obvious and at least somewhat easy to do.
I guess I can stop blathering on now and ask whether my thoughts are pycho or sane.(I do want an aswer...)
Well I guess I should tie this into "console" talk in some way...
The CBEA seems to going the multi-core route full force where as Xenon isn't really going the more SIMD route but rather the "more capable" SIMD route desktop CPUs have been going with iterations of SSE, 3Dnow, AltiVec/VMX etc.
They both seem to welcome the move to parallelsim and for the level that they try to leverage it I don't think it's beyond programmers to take advantage of it.
However some questions I would like to pose to some to the resident developers here: (of course anyone is free to comment)
Even without OoOe would you prefer CPUs with all the SIMD units in a single core over these multi-core designs? (assuming this single core can issue 1 instruction per cycle per execution unit)
I assume there is at least some value in concurrent processing due to parallesism so where would you draw the balance? (thinking about physics, graphics, procedural stuff, etc)(balance--how many cores...how many ex units per core)
If I were developing for the PS4 or Xbox720 I think I would prefer a mutl-core chip where cores are multi-threaded, have OoOe and more executions units per core to a chip that is uber massively parallel still providing more executions units than the former but again since it's so parallel either OoOe or something else may not be there.
Which way would you guys like to see it go?
Last edited by a moderator: