[Beyond3D Article] Intel presentation reveals the future of the CPU-GPU war

I would more or less agree with regards to today's software, but with more cores available future software may attempt to do things we don't bother with today in the consumer space, computer vision related tasks in particular.

If anything like that takes hold it would be counter product to forgo extra cores in favor of GPU like elements. Especially when the majority of the populace isn't interested in high performance graphics, a just good enough solution provided by the many core CPU would likely suffice.


I can see your point. The only question is, are the future applications that would require such resources un-amenable to using the more GPU like elements? If we are comparing different combinations of multiple cores (heterogeneous or not) , the issue of concurrency and dependencies remain in both scenarios.
 
What do you mean by logically equivalent?
Functionally not physically. e.g. software-based cache instead of a physical cache. So programmer effort is required, but can be encapsulated. Provisional, in this case, on whether a "cache" based approach to on-die memory usage is appropriate to the algorithm.

Jawed
 
Indeed. Now the big question is, what *is* that ISA? My guess is it's x86 with some rather aggressive (to say the least!) extensions that are VLIW-like. And then (at least part of) Gesher would also support those same extensions in 2010, but perhaps with another implementation.

Deep down, I hope they don't go x86, or at least don't go fully x86. There's just so much cruft involved that simply doesn't need to be there for an architecture that's so different that backwards compatibility is almost a non-issue.

If they go for the super SSE vector instruction approach, it would be hard (impossible?) to capture the full range of GPU vector instructions while keeping x86 semantics.

How does one encode a MADD with only two operands?

If a later x86 core does sport such extensions, it would be a larger change to x86 than the addition of FP or SSE.
 
Umh..I see what you mean but I don't think it's a good analogy, CELL is vastly different from that thing and even if a sw layer can make it a bit more similar, it's still a completely different architecture.
 
Umh..I see what you mean but I don't think it's a good analogy, CELL is vastly different from that thing and even if a sw layer can make it a bit more similar, it's still a completely different architecture.
OK, fair enough. To me that thing looks like a "lazy", "let's just chuck a cache and snooping at this problem" approach.

I think comparing and contrasting Cell and GPUs is more interesting (for the kinds of high-throughput workloads that they are good at, noting that they're each good at different subsets).

This thing is just a placeholder. Cell solves throughput, data-sharing, communication, on-die bandwidth, latency problems, whilst sitting between GPUs and CPUs on that "war" axis.

Jawed
 
OK, fair enough. To me that thing looks like a "lazy", "let's just chuck a cache and snooping at this problem" approach.

I think comparing and contrasting Cell and GPUs is more interesting (for the kinds of high-throughput workloads that they are good at, noting that they're each good at different subsets).

This thing is just a placeholder. Cell solves throughput, data-sharing, communication, on-die bandwidth, latency problems, whilst sitting between GPUs and CPUs on that "war" axis.

Jawed

I don't know about solved.
IBM's Roadrunner project ties each of its Cell processors to a matching Opteron.

It may be that for large systems, there are certain problems Cell does not solve yet.

There are a number of ways to emulate part of Cell's functionality, and just by going highly multicore a good part of its advantages over x86 chips is gone.

Without knowing how memory and communications are allocated and controlled, we don't know if it's all just a cache+snoop solution.
 
I can see your point. The only question is, are the future applications that would require such resources un-amenable to using the more GPU like elements? If we are comparing different combinations of multiple cores (heterogeneous or not) , the issue of concurrency and dependencies remain in both scenarios.

I'd like to add that I'm not considering latency hiding threads and a high ALU density to be GPU like elements, rather most of the items from the William Mark and Henry Moreton link a little ways up to be GPU elements.

Then with that in mind I have my doubts that much outside of graphics is going to make use of the GPU elements. That's not to say it isn't possible, hasn't been done, or won't be done, I just see the general concept of throughput computing being more important to most apps, and that we're seeing a blurred line because GPUs are pretty decent throughput processing in their own right and are available right now. Then through our GPU tinted glasses we're seeing these up-coming throughput CPUs and seeing a similarity that really isn't there.

Though I think it's odd that Intel isn't building a chip to leverage what should be a thread/ALU advantage in their favor, but instead seems to be focusing more on improving thread synchronization performance. Which in itself seems odd because if synchronization is going to impact performance the app is not likely to scale well either. Eh, go figure! :)
 
I don't know about solved.
Nothing is ever truly solved, is it?

IBM's Roadrunner project ties each of its Cell processors to a matching Opteron.
So how's this Intel core concept different? Where's the middle ground?

It may be that for large systems, there are certain problems Cell does not solve yet.
It's not a static architecture. What we have right now is v1.0, targetted at PS3 and not at HPC. We should start hearing about the DP version of Cell soon and what changes it brings (such as being able to address more physical memory, a definite problem with the current Cell for HPC)...

There are a number of ways to emulate part of Cell's functionality, and just by going highly multicore a good part of its advantages over x86 chips is gone.
Well it'll certainly be interesting to see how a different design competes - what you talk about sounds like v0.1 of Cell. They're years behind.

Without knowing how memory and communications are allocated and controlled, we don't know if it's all just a cache+snoop solution.
Cell already "provides" all those bullet points - "multi-way cache" just looks very glib.

In the here and now, for example, a comparison of Cell and GPUs (and x86) for folding@home is more relevant than crystal-balling Intel stuff that's years away. Although one of the problems with the F@H implementation on PS3 is that it's derived from the GPU version. Regardless, hopefully soon there'll be some papers on the subject.

Jawed
 
So how's this Intel core concept different? Where's the middle ground?
It depends on whether those cores operate within the same memory space. The SPEs explicitely do not. The PPE for the most part is tasked with the random crud of system management. For large systems and complex problems, there is a lot more random crud, which is probably why the Opteron is there.

It seems prohibitive to task SPEs with too much of that, so while in peak numbers Cell is likely capable of handling more crud than it does, something about the internal organization makes it seem more worthwhile to have an Opteron do it.

It's not a static architecture. What we have right now is v1.0, targetted at PS3 and not at HPC. We should start hearing about the DP version of Cell soon and what changes it brings (such as being able to address more physical memory, a definite problem with the current Cell for HPC)...

Well it'll certainly be interesting to see how a different design competes - what you talk about sounds like v0.1 of Cell. They're years behind.
Years behind in design, but with Intel's engineering pool, only temporarily behind in implementation.

In the here and now, for example, a comparison of Cell and GPUs (and x86) for folding@home is more relevant than crystal-balling Intel stuff that's years away.

Although one of the problems with the F@H implementation on PS3 is that it's derived from the GPU version. Regardless, hopefully soon there'll be some papers on the subject.

Jawed

It's precisely because the work units Cell works on are not equivalent to the x86 ones that I'm curious.

I'm not saying Intel's got a winner, but there is potential for some interesting design choices that could fall under the vague statements and block diagrams.
 
Looks like the XBox 360 CPU with more cores and a flexible cache.
I think I have seen something close to it somewhere in the console technology forum...
 
Intel already has an ISA that would work. Their x86 core micro ops. And they are parallelizable. I think they will use that, and more or less emulate x86 compatability through firmware.

Simply take one of each elements of a current core, add a vector unit and attach a bus that allows them to run such a micro-op stream over multiple cores.

Ok, I know it isn't as easy as that, but it's the simplest direction for them to take.
 
Intel already has an ISA that would work. Their x86 core micro ops. And they are parallelizable. I think they will use that, and more or less emulate x86 compatability through firmware.

Simply take one of each elements of a current core, add a vector unit and attach a bus that allows them to run such a micro-op stream over multiple cores.

Ok, I know it isn't as easy as that, but it's the simplest direction for them to take.
I doubt they would use that. I would think micro ops are not an "ISA" because they don't have to be encoded in a specific way. They are just internal opcodes for a specific CPU microarchitecture. I'd be surprised if the micro-ops in a P4 are the same as the micro-ops in say, Core2.
 
As far as abstraction goes, why couldn't Intel design a chip that allows for both x86 instructions (that need to pass through a decoder) and native micro-op execution? Current 3D API's like DirectX and OpenGL aren't going away anytime soon. Developers don't access the metal directly, the software driver does. It'd be at the driver level that switches if an instruction stream passes through the decoder or not. Once past the decoder, instructions are put into a trace cache so they don't have to be decoded again much like the Pentium 4 did. Several x86 decoders could be thrown into the fixed function area and used as necessary by the actual cores. From a transistor budget stand point, the savings here are huge and allows scaling to a large number of cores/shaders while maintaining full x86 and GPU functionality.

In addition, there is no reason why the chip couldn't have a fully functional OoOE execution core. The x86 decoders are responsible for sending what instruction streams to what units for execution so they'd be the place to put branch detection logic. Native micro-op instruction streams have it even easier - they can bypass everything by indicating how branch heavy the instruction stream in some header information.

This design also allows for dynamic allocation of work between CPU and GPU style loads. This is great from an architectural stand point as one chip design can be made to scale from consumer to high end server chips. Manufacturing defects in the x86 decoders or the OoOE core would simply make it a consumer grade chip aimed at graphics.

The only downside is that necessity of large trace caches to keep things running smoothly and to keep the few x86 decoders from being over burdened. It'd likely be best to implement a small, nearly instant access L1 trace cache (< 3 cycles) and a larger, dedicated L2 trace cache (~10 cycles) per core. L1 data caches are dedicated per core but L2 data caches are shared throughout the entire chip. A large L3 cache (32 MB eDRAM, separate die) would act as both a frame buffer and shared unified cache. Code portability is one of the reasons Intel desperately needs to have x86 compatibility. However, the native micro-op design can change as long as it remains abstracted by various API's. Only the applications directly compiled for the native micro-op ISA would be of concern.

Another solution would to use some sort of code morphing software a la Transmeta instead of dedicated hardware decoders. With such a large number of core and even larger number of concurrently running threads, dedicating one thread to the purpose of code morphing an instruction stream for another thread isn't that big of a burden.
 
I would more or less agree with regards to today's software, but with more cores available future software may attempt to do things we don't bother with today in the consumer space, computer vision related tasks in particular.

If anything like that takes hold it would be counter product to forgo extra cores in favor of GPU like elements. Especially when the majority of the populace isn't interested in high performance graphics, a just good enough solution provided by the many core CPU would likely suffice.

The problem is that there will always be parts of pretty much any application that's highly sequential. If you put in a great effort to parallelize your code you will probably still be left with a sizeable chunk that's not parallel. If you manage to parallelize 75% of your code, that remaining 25% sequential code means you won't see any performance improvements going beyond quad-core. There will probably be a market for say 16-core CPUs in the server space eventually, but for a PC I doubt it.
 
There are a number of ways to emulate part of Cell's functionality, and just by going highly multicore a good part of its advantages over x86 chips is gone.

Cell has certain advantages which wont really appear until we get highly multicore parts - Cell was designed from the beginning to scale to large numbers of cores.

The memory system on Cell is really quite clever as it allows the SPEs full speed access to their local stores without any contention.

The part Intel describes has to keep all it's memories coherent, making things a lot more complex and adding latency everywhere. The more cores are added the more coherence traffic there will be.
 
Cell has certain advantages which wont really appear until we get highly multicore parts - Cell was designed from the beginning to scale to large numbers of cores.

The memory system on Cell is really quite clever as it allows the SPEs full speed access to their local stores without any contention.

The part Intel describes has to keep all it's memories coherent, making things a lot more complex and adding latency everywhere. The more cores are added the more coherence traffic there will be.

The slides indicate there is some kind of non-standard cache resource allocation going on.
What little that has been said about the possible design doesn't rule out a different implementation than standard caches. I think it hints that there will be changes to how memory and coherency are handled.
 
Slide Viewer

Can we get rid of the oh-so spiffy graphic viewer?
It adds nothing, and means you can't really keep more than one slide visible at a time or read the article with the slide open for reference.
 
Can we get rid of the oh-so spiffy graphic viewer?
It adds nothing, and means you can't really keep more than one slide visible at a time or read the article with the slide open for reference.

Can't even see anything here (Firefox 1.5), I have to open the images in new tabs.
 
Back
Top