Intel: 32 cores by 2010

KimB

Legend
http://www.tgdaily.com/2006/07/10/intel_32_core_processor/

While I'm sure this massive parallelism will be for server systems for a few years before it makes its way to home PC's, I still have to say that I didn't expect it quite so soon. The configuration looks pretty beefy, and the architecture will support 4 threads per core:
The first Keifer chip will be manufactured in 32 nm and use eight processing nodes with four cores each. Every node will have direct access to one 3 MB on-die last level cache (LLC) and 512 kB L2 cache. There will be a total of 8 x 3 MB LLC slices that are connected by a ring architecture and represent a total 24 MB of cache.
I do, however, have to take issue with this following snippet:
And surprisingly, Intel does not consider AMD's Opteron and successors as Kevet's and Keifer's benchmark. The documents seen by TG Daily aim Keifer at Sun's "Niagara" architecture, which is currently available in the "Ultra Sparc T1" processor. The T1 was launched last year with great fanfare as 1.2 GHz 8-core processor with 3 MB L2 cache and capable of handling a total 32 threads at a peak power of just 72 watts.
Well, duh! That shouldn't be remotely surprising.

Back on topic, though, I expect that game software is going to have a lot of catching up to do in the next few years. PPU's? Who needs 'em!

Edit: By the way, there's a more in depth article that is linked from within the above link.
 
On the tech side this is very cool. But then reality hits: as Chalnoth said, developers have a long way to catch up.

They are talking about 32 cores (128 threads) with 24MB of LLC cache and 16MB of L2 cache in the 2009/2010 timeframe -- that is going to require significant gains in how programmers approach such a beast.

I wonder what sort of system memory configuration they are looking at to feed such a beast.
 
Acert93 said:
On the tech side this is very cool. But then reality hits: as Chalnoth said, developers have a long way to catch up.
Well, I suspect that unless multiprocessing becomes a tech fad by then, these CPU's are instead going to be marketted to environments that are already highly parallel. Note that they are aiming this processor against an 8-way, 32-thread Sun processor.

Still, any time that we see Intel put out a server-level tech (barring IA64), it always seems like a desktop version isn't far behind. And once game developers have started down the path of adding parallelism beyond 2 CPU's, the number of CPU's becomes largely irrelevant, and you just throw more processing into those parts of the code that are easily-parallelizable. So it is conceivable that there will be games that will have enough physics and AI processing to make use of such a processor.

And it would be an Aegia killer :)
 
So the die is cast, silicon is over? I know there were dire predictions and everything, but this is it, nothing in the ol' bag of tricks? Also, to what degree do you guys suppose middleware/compilers will be thrown at the parallel problem in lieu of man-hours? Judging from some devs initial dabbling, I'm cautiously pessimistic.
 
Otto Dafe said:
So the die is cast, silicon is over? I know there were dire predictions and everything, but this is it, nothing in the ol' bag of tricks?
Well, very soon, yes. One way that things could improve significantly on silicon-based processors is a move to asynchronous (clockless) processing. A few have been developed for the integrated DSP market currently, and the benefits in processing power and power consumption are tremendous. But the writing is indeed on the wall. It won't be long before the only way to improve performance will be to add more silicon.

Also, to what degree do you guys suppose middleware/compilers will be thrown at the parallel problem in lieu of man-hours? Judging from some devs initial dabbling, I'm cautiously pessimistic.
Well, the problems associated with utilizing multiple processors are mostly algorithmic in nature. You can't just parallelize any old process. Improving compilers and middleware to attempt to do so won't every buy more than a few percent performance improvement over single-threaded processing.

The real answer is just that developers throw more processing power at tasks that are easily-parallelizable. For games, for example, this means more detailed physics and better AI.
 
Otto Dafe said:
Also, to what degree do you guys suppose middleware/compilers will be thrown at the parallel problem in lieu of man-hours? Judging from some devs initial dabbling, I'm cautiously pessimistic.
Experience (long experience) from the supercomputer world, where massively parallel architectures have been ubiquitous for decades, is that automatic parallelisation by the compiler is far from trivial. My personal experience of using APO (on large SGI Origin machines) is that it spots the easy stuff, eg. highly data-parallel loops. Anything remotely challenging and it just gives up and calls for the programmer to help it out.

It has its uses, eg. it can perform automatic data-flow analysis to spot data dependencies, something which can be tedious (and error prone) to do manually, especially if the loop is large. But stick a simple subroutine or function call in the loop and it takes the safe option and leaves the loop serial, you have to dive in and work out for yourself whether it's safe to parallelise it.

So basically ... compilers will get you only so far. The hard stuff will still have to be done manually, and it's hard, so it'll be time consuming. If you've got a data-parallel algorithm, or can adapt your algorithm to make it data-parallel, good for you. Otherwise, be prepared to break out the paracetamol.

Overall I'm somewhat sceptical that the rise of multi-core desktop CPUs is suddenly going to change this situation. It's not like the mainframe vendors over the past three decades haven't had a) super-bright people working for them, and b) a massive financial incentive to crack this problem.
 
Program-wide auto-parallellization requires the compiler to be able to carefully track all types of data dependencies across the program as a whole. Barring extremely expensive and difficult whole-program analysis, this is not possible in any of the more popular languages out there (C++, C#, Java) except for loops of trivial complexity; also, such large-scale analysis is generally not possible at all in the presence of late binding/dynamic linking.

Auto-parallellism of C/C++ code has been very much the holy grail of compiler design for several decades, but we are only barely closer to that goal today than we were 20 years ago, despite billions of dollars spent on compiler development - in general, it is probably folly to believe that there will ever be good automatic solutions for this class of languages. Every now and then, research groups in various places claim to have solved the problem, but these "solutions" usually solve only a narrow part of the problem as a whole.

That leaves behind the question of what a "good" language for auto-parallellized programming would look and feel like. Functional languages like Haskell do make it a hell of a lot easier for the compiler to reason systematically about data dependencies than imperative languages like C++/Java, but these languages generally have other features that are likely to be quite unpleasant for the average programmer (many popular design patterns, such as e.g. Model-View-Controller, are essentially impossible in Haskell, for example), as well as non-parallellism-related performance issues such as the need to make a copy of a data structure every time you wish to modify it.

In the absence of effective auto-parallellization, there will continue to be a lot of people falling back to manual parallellization. This, of course, is quite difficult and amazingly error-prone, with the added bonus that practically all parallellism-related bugs are by nature heisenbugs and as such take more or less forever to debug properly.
 
arjan de lumens said:
In the absence of effective auto-parallellization, there will continue to be a lot of people falling back to manual parallellization. This, of course, is quite difficult and amazingly error-prone, with the added bonus that practically all parallellism-related bugs are by nature heisenbugs and as such take more or less forever to debug properly.
No, it is not. It is merely limited to a subset of all available algorithms for efficient use. The massive explosion of 3D graphics acceleration proved this long ago. This means that software developers are going to switch to making use of different sorts of algorithms in their software, just as game developers changed how they did graphics for PC games after the advent of 3D accelerators.
 
Chalnoth said:
No, it is not. It is merely limited to a subset of all available algorithms for efficient use. The massive explosion of 3D graphics acceleration proved this long ago. This means that software developers are going to switch to making use of different sorts of algorithms in their software, just as game developers changed how they did graphics for PC games after the advent of 3D accelerators.
The kind of parallellism that GPUs provide can hardly be characterised as "manual" (unless you refer to CPU<->GPU functional parallellism, which is maintained by the GPU's driver anyway). Even now, the vertex/pixel/geometry shader programming model is full of intentional limitations in order to maintain correctness in the face of massive parallellism.
 
arjan de lumens said:
The kind of parallellism that GPUs provide can hardly be characterised as "manual" (unless you refer to CPU<->GPU functional parallellism, which is maintained by the GPU's driver anyway).
Of course it is. It's manually enforced by the programming paradigm.

Even now, the vertex/pixel/geometry shader programming model is full of intentional limitations in order to maintain correctness in the face of massive parallellism.
And in the same way you obtain parallelism on multi-processor setups: by placing limitations upon how your code behaves. You do this by selecting those algorithms which are very amenable to parallelization.
 
Chalnoth said:
Of course it is. It's manually enforced by the programming paradigm.
That's a somewhat odd use of the word "manually". While the CHOICE of programming paradigm itself is something you do manually, the actual mechanisms that enforce the paradigm are entirely automatic in the GPU case. You may run into situations where limitations of the paradigm block certain algorithms or design patterns, but this blocking is no more "manual" than that of a fence blocking you from walking into a garden.
And in the same way you obtain parallelism on multi-processor setups: by placing limitations upon how your code behaves. You do this by selecting those algorithms which are very amenable to parallelization.
The question here is one of how those limitations are actually enforced - whether they are enforced in the actual programming model or through programmer discipline only. My assertion is that the latter approach is a recipe for disaster and as such something one should strive to avoid.
 
arjan de lumens said:
The question here is one of how those limitations are actually enforced - whether they are enforced in the actual programming model or through programmer discipline only. My assertion is that the latter approach is a recipe for disaster and as such something one should strive to avoid.
Most of these things are going to be specialized algorithms offloaded to third-party libraries, or built into the engine. Examples are physics API's and a class-based skeleton AI interface set by the game engine.
 
Chalnoth said:
Most of these things are going to be specialized algorithms offloaded to third-party libraries, or built into the engine. Examples are physics API's and a class-based skeleton AI interface set by the game engine.
This still needs at least to some extent a common framework for the middleware packages to operate within. For physics/AI, due to the need for feeding results back to the game engine, you will want an asynchronous API, which leads to either polling (inefficient) or interrupts/callbacks (which produce some of the same correctness problems as full multithreading).

For the game engine itself, there is the issue that it appears to be rather common for devs to want to modify the engine to add functionality that they need for their particular game; for this to work in a highly-parallel setting, both the ones who write the original engine and those who augment it later need to make sure that it is kept parallellism-safe.
 
Chalnoth said:
Most of these things are going to be specialized algorithms offloaded to third-party libraries, or built into the engine. Examples are physics API's and a class-based skeleton AI interface set by the game engine.

This is more what I meant by "middleware", not a general purpose "Threadalizer" library or anything, just that it makes specific solutions like these much more compelling. Still, an app with a single thread of core logic managing a bunch of MT modules doesn't really get you off the hook, you're still in asynchronous land and you still have to think about just about every access to any resource.

arjan de lumens said:
In the absence of effective auto-parallellization, there will continue to be a lot of people falling back to manual parallellization. This, of course, is quite difficult and amazingly error-prone, with the added bonus that practically all parallellism-related bugs are by nature heisenbugs and as such take more or less forever to debug properly.

^^^This I think is a key point, introducing a source of non-deterministic (mis)behavior to a complex system is a dubious proposition, even with potentially great performance rewards. Following of course is that errors may occur between modules and functional blocks that are in themselves correct... It's interesting to me because previously the HW explosion was the programmer's best friend: Every year it was more cycles for more abstraction, more memory to waste, more precision for sloppy numerical methods. But now it's like the tail's wagging the dog again :???:
 
arjan de lumens said:
This still needs at least to some extent a common framework for the middleware packages to operate within. For physics/AI, due to the need for feeding results back to the game engine, you will want an asynchronous API, which leads to either polling (inefficient) or interrupts/callbacks (which produce some of the same correctness problems as full multithreading).
But you just place a simplified form of the communication into the way the software is structured. For example, one simple method would be to have a memory value of the thread that calls the physics thread(s) set to true once the physics thread(s) complete. If the calling thread gets done with everything it can do before reading in the physics data, and this flag is still false, it goes to sleep. The physics threads are programmed to always wake the calling thread on completion.

With simple multithreaded setups like this built into the game engine and/or API's, there's really no reason to worry about thread communication issues.

For the game engine itself, there is the issue that it appears to be rather common for devs to want to modify the engine to add functionality that they need for their particular game; for this to work in a highly-parallel setting, both the ones who write the original engine and those who augment it later need to make sure that it is kept parallellism-safe.
Well, of course, but the solution to this is simple: have an object-oriented game engine that is designed to be extended without changing the basic structure of programming flow. As long as the game engine is built to be fairly well multi-threaded for those parts of it that are easily to parallelize, as well as having a robust thread-safe communication structure built, there's just no reason for the programmer who is modifying the engine to ever screw up the multithreading.

An example of a thread-safe communication structure might be one used for AI: for each AI actor in the game, you would have a double-buffered message queue. At the start of the AI processing, all message queues would perform a buffer swap. After this buffer swap, the AI actors would all go their own way (probably bundled into groups of threads), and sometimes would send messages to other AI actors for processing in the next AI loop.

Anyway, the idea is that all any programmer modifying the engine would ever do is program single-threaded code, but placed within limitations set by the engine to keep things thread-safe.
 
Otto Dafe said:
This is more what I meant by "middleware", not a general purpose "Threadalizer" library or anything, just that it makes specific solutions like these much more compelling. Still, an app with a single thread of core logic managing a bunch of MT modules doesn't really get you off the hook, you're still in asynchronous land and you still have to think about just about every access to any resource.
So you manage it just like you manage every difficult problem in programming: you break it down into smaller problems. The answer here is building a simplified framework that performs adequately for the task at hand, and is multi-threaded. This framework should be such that the programmer is only peripherally aware of any sort of multi-threading going on, with the entire train of thought focused on the processing for one element, whatever that element may be (see pixel/vertex shaders as an example).

The framework would be designed with appropriate communication capabilities and limitations to keep it threadsafe, shield the multithreading from the programmer who implements the framework, and perform well in a multithreaded environment.
 
Chalnoth said:
But you just place a simplified form of the communication into the way the software is structured. For example, one simple method would be to have a memory value of the thread that calls the physics thread(s) set to true once the physics thread(s) complete. If the calling thread gets done with everything it can do before reading in the physics data, and this flag is still false, it goes to sleep. The physics threads are programmed to always wake the calling thread on completion.
This does get complicated a bit by the fact that you may well have multiple outstanding physics calculations (?), and you wish to be able to process the results one by one as they appear rather than just waiting for all of them - although that can be solved by replacing the true/false flag with a queue, which is still not unreasonably complex.
Well, of course, but the solution to this is simple: have an object-oriented game engine that is designed to be extended without changing the basic structure of programming flow. As long as the game engine is built to be fairly well multi-threaded for those parts of it that are easily to parallelize, as well as having a robust thread-safe communication structure built, there's just no reason for the programmer who is modifying the engine to ever screw up the multithreading.

An example of a thread-safe communication structure might be one used for AI: for each AI actor in the game, you would have a double-buffered message queue. At the start of the AI processing, all message queues would perform a buffer swap. After this buffer swap, the AI actors would all go their own way (probably bundled into groups of threads), and sometimes would send messages to other AI actors for processing in the next AI loop.

Anyway, the idea is that all any programmer modifying the engine would ever do is program single-threaded code, but placed within limitations set by the engine to keep things thread-safe.
For such an AI system, you would still need to impose some restrictions on the classes used. For example: would there exist data fields or structures that the AI agent can access, but which are not private to the individual AI agent itself? If such fields/structures exist, one would need to take care to ensure that they are never modified during the parallel-AI evaluation, neither by the AI agent nor by any code outside the AI agents. While this is not a very difficult constraint to obey, the programmer does need to be aware of the fact that such a constraint is present and that violating it WILL result in a big fat heisenbug (in particular since e.g. the C++/Java family languages do not provide the necessary constructs to enforce such a constraint at the language level).
 
Chalnoth said:
So you manage it just like you manage every difficult problem in programming: you break it down into smaller problems. The answer here is building a simplified framework that performs adequately for the task at hand, and is multi-threaded. This framework should be such that the programmer is only peripherally aware of any sort of multi-threading going on, with the entire train of thought focused on the processing for one element, whatever that element may be (see pixel/vertex shaders as an example).

The framework would be designed with appropriate communication capabilities and limitations to keep it threadsafe, shield the multithreading from the programmer who implements the framework, and perform well in a multithreaded environment.

Yeah, absolutely, and I don't mean to suggest that there are insurmountable problems here, or that it's not fun having new architectual decisions to make, but more that there is a degree of non-trivial complexity added, and, as usual, no magic bullet. One aspect of this complexity is that some black boxes may not get to be quite as opaque, and the consideration of the pre/post conditions of the invocation of a method need be much less casual than what is practiced now.
 
Back
Top