Assymetric multicore CPUs

Why not a bigger push towards assymetric multicore? Symmetric made sense when engineering constraints dictated multi-chip instead of multi-core, though obviously that model breaks down on the larger scale when you look at the motherboard as a whole, as see dedicated graphics, memory (in the past), i/o, sound processors, etc. (and even in the more distant past math-coprocessors).

But looking at "just" the main CPU chip, why are the multicore solutions coming to market all symmetric? I don't think CELL is by any means the ideal solution for desktop CPUs, but perhaps a better model is the GPU (though it is obviously highly optimized for the tasks it does).

What are the desired attributes of a good CPU desktop solution? Great single thread performance, good multitasking performance, good computational "crunching" performance when needed. A big single core chip does the first great, the second not so great, and the third is arguably variable depending on what the application looks like. A multi-core symmetric chip does the first good, the second great, and the third perhaps not so great, again variable. So why not budget transistors to achieve all goals in a good balanced manner?

What I see looks something like this:
- general purpose OS core, designed to handle OS switching, management, scheduling tasks etc. well
- general purpose application core, designed to have very good single thread (or perhaps dual thread) integer and FP performance (i.e., today's single core Athlon etc.)
- a bank of CELL-like math processors that can be utilized when an application needs lots of horsepower, small and relatively simple in design, but high potential

This blueprint could easily be manipulated to provide the ideal desktop solution by having two of the general purpose application cores and a handful of the math processors, and provide the ideal workstation solution by having a single application core and a swarm of coprocessors (i.e., equal die space for both solutions).

Thoughts?
 
We may get to that...or not. The main issue with any kind of multiple cores/ multi-threading is that basically it`s quite complicated to make your code parallel all across. This has been an issue for ages, huge amounts of money have been spent on research for developing a smart compiler/parallelization tools etc., and still no generally applicable solution has been found.

As you may have noted, coding for Cell is less then trivial. I`m sure that multi-threading will eventually pick-up, it`s just that I`m not certain how soon that will happen.
 
Asymetric multi-core is more difficult to get good performance from than symmetric multicore, which is in turn more difficult than a single core.

Scheduling is more complex, software is more complex, and workload analysis becomes much more complex.

Cell is nothing new when it comes to asymetric multi-processing. There have been other multi-chip implementations that grouped a general-purpose core with multiple specialized processors.

The fact we don't hear about them much indicates how well they did.


Multi-core didn't take off until it became feasible to put multiple cores on the same die, which meant that transistor densities had to be higher than they were in the past and that the gains in single-core performance became increasingly expensive to achieve.

We've only recently reached that point when it comes to silicon. We didn't have room on a chip for banks of anything until recently, and it didn't make performance sense until recently. It's like asking why they didn't use NASCAR engines back in the 1900s; it just wasn't going to happen.

Software-wise, we are not much better off when it comes to easy development for such a platform.

edit:

SMP also shows the most gains in general performance up to ~4-way in the bulk of systems. Until chips expand past the point that they can fit more than just 4 cores, the need to appeal to the broadest market wins out.
 
Last edited by a moderator:
We may get to that...or not. The main issue with any kind of multiple cores/ multi-threading is that basically it`s quite complicated to make your code parallel all across. This has been an issue for ages, huge amounts of money have been spent on research for developing a smart compiler/parallelization tools etc., and still no generally applicable solution has been found.
I would expect that, which is why perhaps I should have clarified up front that the general purpose "application" CPU should get the majority of transistors, and be a pretty stout CPU for the majority of applications.
 
Asymetric multi-core is more difficult to get good performance from than symmetric multicore, which is in turn more difficult than a single core.

Scheduling is more complex, software is more complex, and workload analysis becomes much more complex.
What I was describing was a sort of assymetric dual-core chip with afterburners. Only the OS programmers need worry about the smaller general purpose core, as it's sole purpose in life would be to handle OS tasks. The vast majority of application programmers would only need to worry about the one (or perhaps two, symmetric) general purpose application cores, which I would envision being very similar to today's architectures trying to squeeze out single thread IPC, with dual thread (or maybe even dual core) an afterthought.

Only the "serious" programmers/applications need worry about the afterburner suite of coprocessors, which are there when you need serious performance and have the time/money to optimize the code to get every last bit out.

I wouldn't think for a minute that typical desktop programs should start worrying about threading for a dozen + cores. But the odd game or two, the CAD application, photoshop, some video editing programs... maybe there's a market there.

Here's a very simplistic breakdown of what I was describing:
Desktop version die space: 10% OS, 70% application, 20% coprocessors.
Workstation version die space: 10% OS, 40% application, 50% coprocessors.
 
Last edited by a moderator:
Only the "serious" programmers/applications need worry about the afterburner suite of coprocessors, which are there when you need serious performance and have the time/money to optimize the code to get every last bit out.

We're already kind of on the way to that sort of architecture today. It's called multi-core general purpose CPU + a big honking GPU. :)
 
Multicore isn't so much about Symmetric or Asymetric, it's about solving the communications problem.

The only real additional constraint Asymmetric processors bring is having to target the code to a particular processor type at compile time.

The cost of communication (latency) fundamentally limits the granularity of the parallelism, it's not worth computing something somewhere else if it takes you more work to ship the data there and back and synchronise with the results than it does to compute the result locally.
 
We're already kind of on the way to that sort of architecture today. It's called multi-core general purpose CPU + a big honking GPU. :)
That's an interesting thought. :) I forget about the GPGPU movement and the processing power that brings to the table. Which makes me further go hmm... is the AMD/ATi fusion concept a lot closer to what I described than I previously thought? Would AMD/ATi make the shaders of the GPU elements available for general purpose code even on this integrated chip, in the event that an application didn't need 3D graphics at all but could use the extra math oomph?

Hmm...
 
Multicore isn't so much about Symmetric or Asymetric, it's about solving the communications problem.

The only real additional constraint Asymmetric processors bring is having to target the code to a particular processor type at compile time.

The cost of communication (latency) fundamentally limits the granularity of the parallelism, it's not worth computing something somewhere else if it takes you more work to ship the data there and back and synchronise with the results than it does to compute the result locally.

I guess what I was thinking was that if 80-90% of applications can't make good use of parallelism, whether due to algorithm serialism or communication problems, and the other 10-20% are inherently well suited for parallel processing despite communication bottlenecks, then wouldn't the "ideal" CPU look something like that software distribution? A very powerful single core representing the majority of the die, and a suite of small, powerful, and admittedly more difficult to utilize (i.e., both small and powerful meaning harder to program for) taking up the remainder of the die?

When communication is a problem, don't sweat it... take advantage of the single GPCPU.
 
But isn't Cell asymmetric and symmetric at the same time? The PPE is asymmetric to the SPEs but they are all symmetrical.
 
I guess what I was thinking was that if 80-90% of applications can't make good use of parallelism, whether due to algorithm serialism or communication problems, and the other 10-20% are inherently well suited for parallel processing despite communication bottlenecks, then wouldn't the "ideal" CPU look something like that software distribution?

The ideal F-1 race car looks very different from the ideal coal truck. Mashing the two together doesn't make something that is better at both.

With respect to handling single-threaded performance, a lot of design features increase complexity in a quadratic manner, or offer sub-linear performance gains for the number of transistors used.

A highly parallel processor that would be absolutely terrible at any non-parallel load can skip all that, and use the savings to increase peak execution.

It's not quite a linear vs. quadratic scaling problem, but the overall issue is present. Being good at one kind of task within the constraints of silicon's limits in manufacturability and inherent inflexibility means giving up on being good at the other.

A very powerful single core representing the majority of the die, and a suite of small, powerful, and admittedly more difficult to utilize (i.e., both small and powerful meaning harder to program for) taking up the remainder of the die?

They have to be used. Every small core that isn't used is an irrevocable waste of silicon. It's not just that the core goes idle: it's that the chip gave up some percentage of its potential performance for a workload it never meets.

It's a problem where the chip is 10x faster at .01% of the workloads the chip will ever face, and 5% slower at 99.9% of everything else it does.

The reason why asymmetric (or nearly so) multicores are in the works for pretty much all chip manufacturers is that transistor budgets are expanding to the point where it's more like a chip could be made to be 10x faster and only lose perhaps 1% or something less on ~20% of its workload.

Thanks to other silicon scaling limits, the gains in general performance are so reduced that throwing more transistors into one core would lead to excessive heat or propagation delays, making the asymmetric route potentially free with respect to the loss in general performance.
 
The ideal F-1 race car looks very different from the ideal coal truck. Mashing the two together doesn't make something that is better at both.
I get that, but continuing with the car analogy (popular one, isn't it?) isn't the "desktop CPU" more akin to a passenger car than to either an F1 or dump truck? And in that context, a passenger car is neither a good F1 car or a better dump truck, but arguably faster than a dump truck and hauls a hell of a lot more than an F1 car. And while there is a lot of variability inside the class of "family vehicles" they all strive to make compromises of the extremes by taking useful concepts from each and presenting a "not the best of either, but decent at a lot of things" approach to the family man. And that seems to work pretty well, doesn't it?

And it seems to me that the more recent CPU evolutions have been either towards the dump truck or F1 extremes. We have way high IPC large and fast single cores, and we have small streamlined more finicky but potentially powerful suites of cores. I'm only thinking that, as the family car does for the family man, so should a desktop CPU do for the desktop user... be pretty decent at a lot of things, have some power on tap for the applications that can make use of it, and perhaps (like vehicles) be modular or flexible enough to allow different models to target different segments of the "desktop" marketplace (i.e., see above, single vs. dual application core, how many math processing elements, etc.).

Being good at one kind of task within the constraints of silicon's limits in manufacturability and inherent inflexibility means giving up on being good at the other.
Exactly. So why contemplate a design of EITHER a hundred in order powerful and finicky processor cores, OR one or two large high IPC brute force general purpose cores. Why not a decent GP core and a handful of helpers? Maximum application coverage for your transistor budget. No, it might not be the way to spend transistors to get peak performance for a particular app, but it might give the highest average performance over a few dozen useful apps.

They have to be used. Every small core that isn't used is an irrevocable waste of silicon. It's not just that the core goes idle: it's that the chip gave up some percentage of its potential performance for a workload it never meets.
And the second core in dual core is often unused. As are many processing elements in CELL often. As are many, many transitors in a GPU for most rendering apps. But all are approaches to get the best performance across the board given a transistor budget. All I'm suggesting is perhaps another approach.

[/quote]It's a problem where the chip is 10x faster at .01% of the workloads the chip will ever face, and 5% slower at 99.9% of everything else it does.[/quote]Yeah, but the 99% are typically office applications that don't need a horsepower king to begin with. It's those demanding applications that you notice being slower, and it's those that are worth sacrificing a bit of speed on Excel for to accelerate.
 
I get that, but continuing with the car analogy (popular one, isn't it?) isn't the "desktop CPU" more akin to a passenger car than to either an F1 or dump truck? And in that context, a passenger car is neither a good F1 car or a better dump truck, but arguably faster than a dump truck and hauls a hell of a lot more than an F1 car. And while there is a lot of variability inside the class of "family vehicles" they all strive to make compromises of the extremes by taking useful concepts from each and presenting a "not the best of either, but decent at a lot of things" approach to the family man. And that seems to work pretty well, doesn't it?

Asymmetric processors are closer to the analogy of a car that has 4 trunk compartments, and 6 seats.

3 of the trunk compartments can only hold groceries, 4 of the seats can only hold kids,
1 seat can only handle a person who can touch his nose with his tongue, and one headlight only works when it is raining.

Unless you can expand the car so it can have all these extra potentially useless parts and still have room for general use, it is a waste.

The same goes for chips. Until it is sure that general performance won't completely crap out on tasks everyone is already doing, it is not worthwhile.

Once transistor budgets increase to the point that the extras can fit, manufacturers will be cautious outside of specialized markets.

And it seems to me that the more recent CPU evolutions have been either towards the dump truck or F1 extremes. We have way high IPC large and fast single cores, and we have small streamlined more finicky but potentially powerful suites of cores. I'm only thinking that, as the family car does for the family man, so should a desktop CPU do for the desktop user... be pretty decent at a lot of things, have some power on tap for the applications that can make use of it, and perhaps (like vehicles) be modular or flexible enough to allow different models to target different segments of the "desktop" marketplace (i.e., see above, single vs. dual application core, how many math processing elements, etc.).

All the major chip manufacturers that aren't already doing this are planning to do this in about 2 process nodes.
The transistor budgets will be much larger, and the gains from SMP will have mostly been used up by quad-cores.

Until the more easily attained sources of improvement are used up, there is no incentive to go beyond that.

Exactly. So why contemplate a design of EITHER a hundred in order powerful and finicky processor cores, OR one or two large high IPC brute force general purpose cores. Why not a decent GP core and a handful of helpers? Maximum application coverage for your transistor budget. No, it might not be the way to spend transistors to get peak performance for a particular app, but it might give the highest average performance over a few dozen useful apps.
CPUs are expected to get high performance out of all their apps.
The number of apps capable of running on those helper cores at the time of release will be 0. There will not be many for years.
If by that time the killer ap for that helper core uses a non-supported codec, the core will go unused.

It is a safer bet to wait until transtor budgets increase to the point that you can put in a superior group of IPC cores and still have room left over for helper cores.

And the second core in dual core is often unused. As are many processing elements in CELL often. As are many, many transitors in a GPU for most rendering apps. But all are approaches to get the best performance across the board given a transistor budget. All I'm suggesting is perhaps another approach.
It's not too hard for an OS scheduler to pass a thread to the second core.
It's a little more difficult if the second core is a video codec processor and the second thread is for a word processor.

Specialized helper cores are very much an all or nothing affair. Either they are used, or they are not. Multiple general cores can be made to muddle through almost anything without the software needing to know much beyond spawning another thread.

SMP allows for concurrency without too much worry about the architecture. Asymmetric multi-processing forces software to explicitely program in something that is aware of the core layouts, or it will fail to take advantage of the extra cores.
Since no model of chip will ever have the entire market, and there will be variations in configuration (which will mean entire cores will go unused) software developers will balk.

Cell works fine in the PS3 because that's the only processor that will ever be used in that console. It will be much more difficult to make the case for something like Cell in the chaotic desktop market, where almost nobody recompiles or redesigns their software.
It won't happen in big-tin servers or mainframe for a long time either, since there's no guarantee that they even have the source code for the aps they'd need to recompile.
Cell stands a chance in HPC because they are used to rewriting their software anyway.

Yeah, but the 99% are typically office applications that don't need a horsepower king to begin with. It's those demanding applications that you notice being slower, and it's those that are worth sacrificing a bit of speed on Excel for to accelerate.

There is little incentive for the installed base to accept reduced performance on their apps. They'll just wait for a product that can give them better performance on their current software and better performance on other tasks.
 
Asymmetric processors are closer to the analogy of a car that has 4 trunk compartments, and 6 seats.
OK, I can buy that. I still can't help but think that the vast majority of desktop computers could use a seat that can only hold a child... the OS. Isn't that an ideal situation where you know one "application" in detail that will always be running and can design a core that is slim and trim and perfect for handling OS tasks?

CPUs are expected to get high performance out of all their apps.
The number of apps capable of running on those helper cores at the time of release will be 0.
Yeah, but not everything supported MMX to begin with either... guess I didn't see much difference.

It's not too hard for an OS scheduler to pass a thread to the second core. It's a little more difficult if the second core is a video codec processor and the second thread is for a word processor.
Passing off the second thread only helps if the application spawns multiple useful threads (not all do) or you are running two application (not always the case). So again, you're talking about transistors spent that will sometimes have zero utilization. Seems like an assymetric core is different only in degree: how many transitors it requires (less than a full second core), how often it could be utilized (less often than a full second core), and how much potential performance it offered (dunno, maybe equal to or better than a second full core). As far as being video specific or word processor specific, I think that might be taking the idea too far. Floating point performance seems to be the one area that is universally lacking when high demand applications meet general purpose cores... so small in-order FPU crunchers seem more appropriate.

CELL seems to have a good start on the "right type" of coprocessors, but I don't think its overall architecture (memory handling, for example) nor its general purpose core is well suited to the desktop environment.

Specialized helper cores are very much an all or nothing affair. Either they are used, or they are not. Multiple general cores can be made to muddle through almost anything without the software needing to know much beyond spawning another thread.
That would seem to be a big problem! With the GPGPU mentioned earlier, it makes one wonder if a suite of standardized coprocessors made available through a DX like API isn't heading in the right direction. In that case, we're really talking about where the most appropriate place is for those transitors to reside... on the CPU die, the GPU when it has available processing power to spare, or a separate card (i.e., physx etc.).

SMP allows for concurrency without too much worry about the architecture. Asymmetric multi-processing forces software to explicitely program in something that is aware of the core layouts, or it will fail to take advantage of the extra cores. Since no model of chip will ever have the entire market, and there will be variations in configuration (which will mean entire cores will go unused) software developers will balk.
I agree that standardization would be needed, and could be a problem initially as AMD and Intel pushed their own solutions, but all that is required is what exists on the level of GPU's. Standardized API with compliance in hardware towards that API. Transistors could be allocated differently to do certain or other tasks better, but all the developers would need worry about is the existance of the cores and the API used to take advantage of them.

How is that situation much different than what game developers currently face? There are variations in hardware, and while developers could (and some do) target specific hardware paths for optimization, they certainly don't have to.

No doubt though, this idea won't work as a novelty chip. The approach would have to be pretty uniform across both Intel and AMD, from top to bottom of the product lineups, just as having a "3D capable" graphics chip in some form is essentially mandatory no matter the sement of the market a computer is targeting. I admit that it took a while to get to that point with GPU's, especially considering the API struggles. Perhaps that in itself is a big enough stumbling block in the CPU space.

There is little incentive for the installed base to accept reduced performance on their apps. They'll just wait for a product that can give them better performance on their current software and better performance on other tasks.
Well, from a PR perspective you certainly couldn't release an assymetric chip that uses the same process node, die space, and clock frequency of the chip before it and has lower performance on many existing apps. The time to do it is when a new architecture is being released with a larger die, and perhaps smaller process and higher clock, as you alluded to. When you can show improvements across the board on existing apps and have the potential for much improved performance on apps that are the most demanding, that seems a viable PR position.

Of course, we've seen new GPU generations that weren't much faster than the previous generation on existing apps, so that isn't so unusual.
 
OK, I can buy that. I still can't help but think that the vast majority of desktop computers could use a seat that can only hold a child... the OS. Isn't that an ideal situation where you know one "application" in detail that will always be running and can design a core that is slim and trim and perfect for handling OS tasks?
Which OS? What combination of libraries?
I'm not up on OS code, but it's likely a very varied mix of parallel and serial sections, so specializing in anything would probably miss the other portions.
The OS usually tries not to monopolize the core it's on anyway, so the gains in performance would be focused on something that doesn't take up too much processor time anyway.

Yeah, but not everything supported MMX to begin with either... guess I didn't see much difference.
I don't think MMX ever really took off. I don't remember much uptake after SSE, and SSE's getting general use now only because it's taking over from x87 floating point.

Passing off the second thread only helps if the application spawns multiple useful threads (not all do) or you are running two application (not always the case). So again, you're talking about transistors spent that will sometimes have zero utilization.
Having multiple threads or two apps isn't all that rare.
1 OS + 1 app = two right there

Seems like an assymetric core is different only in degree: how many transitors it requires (less than a full second core), how often it could be utilized (less often than a full second core), and how much potential performance it offered (dunno, maybe equal to or better than a second full core). As far as being video specific or word processor specific, I think that might be taking the idea too far. Floating point performance seems to be the one area that is universally lacking when high demand applications meet general purpose cores... so small in-order FPU crunchers seem more appropriate.
There may be future cores that can play that role, once they can fit without compromising the general-purpose ones. Until that happens, there won't be much push for it, or rather there will be a stronger push against it.

In the desktop environment, there aren't too many tasks that need those FPU cores, and those that do often benefit from GPUs or other specialized hardware that work just as well off-chip.
The limited market would hurt volumes.

CELL seems to have a good start on the "right type" of coprocessors, but I don't think its overall architecture (memory handling, for example) nor its general purpose core is well suited to the desktop environment.
It also may be that Cell was at least a process node or two too early for the desktop market.


I agree that standardization would be needed, and could be a problem initially as AMD and Intel pushed their own solutions, but all that is required is what exists on the level of GPU's. Standardized API with compliance in hardware towards that API. Transistors could be allocated differently to do certain or other tasks better, but all the developers would need worry about is the existance of the cores and the API used to take advantage of them.

Intel and AMD don't do APIs, they do (low level) architectures. We're at DX10 and the x86 ISA's most recent revamp to x86-64 was the first significant change to the core architecture since the Pentium Pro.

How is that situation much different than what game developers currently face? There are variations in hardware, and while developers could (and some do) target specific hardware paths for optimization, they certainly don't have to.
Most game developers simply don't target a given CPU, or the ones with the resources rely on specially compiled paths with a fallback, because doing otherwise would basically keep them out of half of their market.
There's just no way to win, since there aren't as many ways to reduce settings like there are for graphics cards.

I admit that it took a while to get to that point with GPU's, especially considering the API struggles. Perhaps that in itself is a big enough stumbling block in the CPU space.
GPUs have gone through a lot of changes. They've gained new functions and dumped a lot of old ones. They can add extensions or they can deprecate them.
For CPUs, backwards compatibility can't be broken so easily, so every change must be made with the distant future and distant past in mind.
A missed guess can cost a company hundreds of millions worth of market share, so manufacturers are cautious.
 
Last edited by a moderator:
Having multiple threads or two apps isn't all that rare.
1 OS + 1 app = two right there
Wait, you just said that having a dedicated core for the OS isn't useful because the OS is designed to not utilize CPU resources often. But you're saying dual symmetric always has potential for improved performance because the OS counts as a thread. Well, I don't see where you can have it both ways. If dual cores always offer improvement because the OS can occupy one core and leave the other for an application, yet the OS is designed to not require nearly the full potential of a core's resources, doesn't that scream "smaller less powerful core for the OS, freeing up transitors to use elsewhere?" And I'm not saying "optimized" in the sense that the hardware is especially optimized at low level, just that the cache allocation, registers, int/fp performance, etc. (and in general, the number of transitors required) are reasonably matched to what an OS would need. That's all.

There may be future cores that can play that role, once they can fit without compromising the general-purpose ones. Until that happens, there won't be much push for it, or rather there will be a stronger push against it.
I guess I just don't understand that argument. No matter how many transitors we can cram on a chip, fp cores will always take away from potential small incremental gains in the general purpose core... whether those transitors could have been used for more cache, better bpu, or whatever. Unless you're saying that you can't really get any better single threaded IPC after you reach a certain transitor size, a couple of hundred million or so, after which point you have to go multi core of some form or fashion.

I'm definitely not that up on computer science... is that the case? Are you simply saying that we haven't had enough transitors available until roughly now to get near that theoretical max IPC per core? If so, then all your previous comments suddenly make a lot more sense. The whole "low hanging fruit first" idea?

Intel and AMD don't do APIs, they do (low level) architectures.
Nor do ATi or nVidia, but that arrangement seems to have worked itself out quite well. While not free from growing pains, I don't see any fundamental reason why the same couldn't happen for CPU's.

Most game developers simply don't target a given CPU, or the ones with the resources rely on specially compiled paths with a fallback, because doing otherwise would basically keep them out of half of their market.
Oh, I was talking about optimizing for specific GPU hardware, not CPU. Some do, but it certainly isn't required. No reason why CPU hardware should be any different.

For CPUs, backwards compatibility can't be broken so easily, so every change must be made with the distant future and distant past in mind.
A missed guess can cost a company hundreds of millions worth of market share, so manufacturers are cautious.
I would assume that the GP core would have to retain backward compatibility. Taking up the majority of die space, it only makes sense, so the majority of the chip is still available for legacy apps and (following above comments) should still be faster than previous versions. The smaller cores need not be so restrained. It also gives a possible solution for those saying backwards compatibility constrains potential performance for present and future apps.

Interesting ideas though, and I appreciate the feedback. I'm no computer engineering guru by any means so it helps to have a dose of reality thrown in to an exploration of ideas.
 
Wait, you just said that having a dedicated core for the OS isn't useful because the OS is designed to not utilize CPU resources often. But you're saying dual symmetric always has potential for improved performance because the OS counts as a thread.
I didn't say it wasn't useful at all. I said the gains wouldn't be as great compared to the loss in general performance would be by excluding other tasks from the core the OS is on. With SMP, general performance isn't penalized; the performance gain is icing on the cake.

Well, I don't see where you can have it both ways. If dual cores always offer improvement because the OS can occupy one core and leave the other for an application, yet the OS is designed to not require nearly the full potential of a core's resources, doesn't that scream "smaller less powerful core for the OS, freeing up transitors to use elsewhere?"
With SMP, it is possible for threads to context switch and share a core very easily. Asymmetric chips can't do that without making sure the threads that are ready to work actually do well on that core.

And I'm not saying "optimized" in the sense that the hardware is especially optimized at low level, just that the cache allocation, registers, int/fp performance, etc. (and in general, the number of transitors required) are reasonably matched to what an OS would need. That's all.
For the OS, it's more of a "get it done as fast as possible and switch to the next task" kind of thing.

I guess I just don't understand that argument. No matter how many transitors we can cram on a chip, fp cores will always take away from potential small incremental gains in the general purpose core... whether those transitors could have been used for more cache, better bpu, or whatever. Unless you're saying that you can't really get any better single threaded IPC after you reach a certain transitor size, a couple of hundred million or so, after which point you have to go multi core of some form or fashion.

That is pretty much what I'm saying. Signal propogation delays and heat concerns basically strangle performance improvements in single-core performance, and the problem gets much worse as geometries scale. It's very possible that adding more units to a single core will force it to clock lower or throttle to avoid overheating.

SMP offers the best performance gains in general use up to ~4 cores. After that, the cost of small specialized cores in addition to the general cores for general performance is minimal, while the upsides are great.

I'm definitely not that up on computer science... is that the case? Are you simply saying that we haven't had enough transitors available until roughly now to get near that theoretical max IPC per core? If so, then all your previous comments suddenly make a lot more sense. The whole "low hanging fruit first" idea?
The K8 is 3-wide, but averages 1 instruction or less per cycle on most code it runs on.
Conroe does better, but considering that it is 4 (sometimes 5)-wide, it isn't really all that better from a utilization standpoint.
They are significantly better than the chips that came before them, but their successors are not likely to maintain similar performance growth.

Some studies seem to indicate the ceiling is somewhere around 2 IPC as the best we can practically reach in general integer workloads if a core were designed for all-out performance.

We've reached the point where we can do a lot for IPC, but also reached the point in silicon scaling that we can no longer discount the heat and signal costs that result from using a lot of transistors at high speed.

Low-hanging fruit in single-core performance is mostly gone. There will be incremental gains at every process node, but nothing as great as going superscalar and OoO did back in the day.

Those incremental gains can be achieved without doubling the number of transistors per core. Actually, it's basically required for maintaining any performange gain at all that the cores not double in size with transistor densities.

Nor do ATi or nVidia, but that arrangement seems to have worked itself out quite well. While not free from growing pains, I don't see any fundamental reason why the same couldn't happen for CPU's.
ATI and Nvida can hide behind graphics drivers that abstract most of the inner workings from the outside. CTM and CUDA might change that, but even though they are thin driver layers, the spec is not as thorough as it is for a CPU.

I wonder if exposing low-level details for GPUs might come back to haunt them later, when a legacy software base won't let them revamp their designs like they've done before.

Oh, I was talking about optimizing for specific GPU hardware, not CPU. Some do, but it certainly isn't required. No reason why CPU hardware should be any different.
Because when a game falters or crashes because it was optimized for a different CPU, the user doesn't blame AMD or Intel. There are no performance optimizations or slider bars for anisotropy or AA when it comes to running driver code and game engines.
If a game is meant for more than half the market, great care must be taken that any optimizations are mostly transparent, because users aren't into recompiling.

I would assume that the GP core would have to retain backward compatibility. Taking up the majority of die space, it only makes sense, so the majority of the chip is still available for legacy apps and (following above comments) should still be faster than previous versions. The smaller cores need not be so restrained. It also gives a possible solution for those saying backwards compatibility constrains potential performance for present and future apps.
That is true. It just won't be for about another process generation or so on the desktop that the GP core(s) can have improved performance and still have room left over for specialized cores.

The desktop market is kind of strange, since it can't demand the insane margins of niche processors, and CPUs are sold as discrete components, so system sales don't go back to the manufacturer (IBM and SUN rely on full service revenue to offset design costs).

It's not exactly a low-cost segment like embedded chips, which can go for pennies worth of profit, but it's too commoditized (and too expensive to design and manufacture) to appeal to just one segment and be successful.

Things may change as market pressures and physical limits are encountered, but there's just a lot of baggage attached to that market that influences a lot of what happens around it.
 
That is pretty much what I'm saying. Signal propogation delays and heat concerns basically strangle performance improvements in single-core performance, and the problem gets much worse as geometries scale. It's very possible that adding more units to a single core will force it to clock lower or throttle to avoid overheating.

SMP offers the best performance gains in general use up to ~4 cores. After that, the cost of small specialized cores in addition to the general cores for general performance is minimal, while the upsides are great.
I think that is what I didn't understand up front. I guess I assumed that low hanging fruit was gathered several years back and that the scaling issues you speak of were prevelant then, not just surfacing now. Seems my thinking needed a shift in time moreso than concept.
 
For the OS, it's more of a "get it done as fast as possible and switch to the next task" kind of thing.
I just had another "hmm" thought. I don't doubt that what you describe is reality, but is that because it is the best way to construct and manage an OS or is that an artifact of having to share a single core for decades? I can't help but wonder... in a fantasy world where all CPU's had a smallish dedicated OS core included, what would be the ideal OS architecture to take advantage of that, how would that differ from today's OSes, and what nifty things would that allow for (or prohibit)?
 
I just had another "hmm" thought. I don't doubt that what you describe is reality, but is that because it is the best way to construct and manage an OS or is that an artifact of having to share a single core for decades? I can't help but wonder... in a fantasy world where all CPU's had a smallish dedicated OS core included, what would be the ideal OS architecture to take advantage of that, how would that differ from today's OSes, and what nifty things would that allow for (or prohibit)?

I know large systems distribute system management tasks to local control nodes, but a desktop is quite a bit smaller than that.

Since the OS is charged with bringing everything in a system together, splitting the OS up can have the effect of increasing communications overhead to the point that it slows everything else down.
 
Back
Top