Intel: 32 cores by 2010

arjan de lumens said:
This does get complicated a bit by the fact that you may well have multiple outstanding physics calculations (?), and you wish to be able to process the results one by one as they appear rather than just waiting for all of them - although that can be solved by replacing the true/false flag with a queue, which is still not unreasonably complex.
Well, calculating the results one by one is obviously something which you would have to give up.

For such an AI system, you would still need to impose some restrictions on the classes used. For example: would there exist data fields or structures that the AI agent can access, but which are not private to the individual AI agent itself? If such fields/structures exist, one would need to take care to ensure that they are never modified during the parallel-AI evaluation, neither by the AI agent nor by any code outside the AI agents. While this is not a very difficult constraint to obey, the programmer does need to be aware of the fact that such a constraint is present and that violating it WILL result in a big fat heisenbug (in particular since e.g. the C++/Java family languages do not provide the necessary constructs to enforce such a constraint at the language level).
Improper use of any code library can result in bad bugs. But it can be managed somewhat by having the code convention of only reading other AI objects' information by reading from a virtual parent class.
 
demonic said:
I think by 2010, clock speeds will be totally irrelevant then!

Of course it is relevant. Do you want to use a 100 core CPU running at 1Hz? ;)
 
Chalnoth said:
Well, calculating the results one by one is obviously something which you would have to give up.
Why? A queue such as I indicated should be plenty enough to maintain parallellism for this situation - of course, for such a queue to work maximally efficiently, you need to take into account (in the queue or the main thread itself) that the calculations will finish in a different order than they were issued in.

What I want here is to obtain functional parallellism between the actual physics library itself and the decision logic in the main-thread that receives the physics results.
Improper use of any code library can result in bad bugs. But it can be managed somewhat by having the code convention of only reading other AI objects' information by reading from a virtual parent class.
Such a convention addresses some type safety problems, but does not, AFAICS, address any concurrency safety problems. You still need to make sure through other means that when an AI object (during AI evaluation) reads a piece of information from another AI object, that the other object does not have any opportunity to modify that piece of information in-place during the AI evaluation phase. Otherwise you get an annoying race condition.

There is also of course matters such as data structures outside the AI agents themselves that the AI agents may need to access but should not be able to modify (such as e.g. information about the landscape they are operating in; note also that this landscape information itself needs to be constrained so that it is not modified during AI evaluation), and even worse, static fields in the AI agent class itself (which is just evil).

Issues like these tend to pile up over time, requiring more and more constraints that need to be kept track of - across more and more different people's code. It doesn't take all that many constraints before you find yourself intentionally avoiding parallellism opportunities (or introduce coding practices that sacrifice parallellization opportunities in order to retain maintainability/verifiability) because they lead to cognitive overload (or maybe you choose to pursue those opportunities anyway - in which case, if you do run into cognitive overload, the codebase will fill up with enough heisenbugs to render it essentially unsalvageable).
 
arjan de lumens said:
Such a convention addresses some type safety problems, but does not, AFAICS, address any concurrency safety problems. You still need to make sure through other means that when an AI object (during AI evaluation) reads a piece of information from another AI object, that the other object does not have any opportunity to modify that piece of information in-place during the AI evaluation phase. Otherwise you get an annoying race condition.
But doing the reading through parent class function calls can solve that issue just fine. It's all about double buffering of the data that can be read. Simply defining the actual variables used here as private would prevent direct reading/writing by derived classes. This adds a somewhat clumsy layer for the programmer to make use of for custom properties that need to be communicated, but it would prevent errors.

Obviously static class members would need to be avoided or used only very carefully. But no matter which way you slice it, discipline in multithreaded code is something that programmers are going to have to deal with, no matter what. Many-core processors are coming, and to make use of them, we need software that is both writable and capable of making use of more cores. And there are certainly ways of getting it done.

I'm sure there will be a number of companies that will be stupid about multithreading their code. They'll either learn quickly, or be weeded out. There will be other companies that will be smart about it, and their software will run fast, offer better features (since more processing will be exposed to the consumer), and be less buggy.
 
Java provides alot of debugged concurrency primitive classes for programmers to use to take the pain out of alot of concurrent programming, Atomic updaters, concurrent hash tables, sets, queues, multiple lock types, condition variables, count down latches, barriers, et al. It is somewhat easier to use than say, OpenMP, but still, the existence of such utilities does not guarantee that programmers will use them properly. And they do nothing to assist with autoparallelization. The Java Memory Model does help somewhat with SMP optimizations, take alook at Azul Systems for example, which can scale properly written Java apps over a 48 core CPU/378 core system.

Realistically, if you want autoparallelization and less race condition bugs, you need a language that offers a domain specific subset language modeled around something like pi-calculus. You could probably do this with C++, much like Haskell has monads to introduce stateful, imperative programming inside of a lazy pure functional language, C++ could have new pi-calculus subset syntax which the compiler would know about.

If you think about it, we have this situation today in GPU programming. You write a C++ program. You include in it, a subset of code written in a domain specific language, HLSL, which can be autoparallelized by the driver/GPU into thousands of threads. Now, for Intel's 32 cores, we need an extension of C++ that will allow the C++ compiler to do the same thing the GPU does, for parts of your application written using the new concurrent subset.

Google today runs a highly efficient cluster of hundreds of thousands of commodity linux boxes, with a software API layer on top called Map/Reduce along with GFS (Google File System). C++ programmers at google write their algorithms in map/reduce style, and the system scheduler takes care of the rest. And frankly, I think many people will be highly surprised by the types of algorithms which fit into map/reduce semantics which many will blithely assert to be inherently serial.
 
Chalnoth said:
But doing the reading through parent class function calls can solve that issue just fine. It's all about double buffering of the data that can be read. Simply defining the actual variables used here as private would prevent direct reading/writing by derived classes. This adds a somewhat clumsy layer for the programmer to make use of for custom properties that need to be communicated, but it would prevent errors.
Well, OK. Not elegant, but it can prevent at least some misuse of the library.
Obviously static class members would need to be avoided or used only very carefully. But no matter which way you slice it, discipline in multithreaded code is something that programmers are going to have to deal with, no matter what. Many-core processors are coming, and to make use of them, we need software that is both writable and capable of making use of more cores. And there are certainly ways of getting it done.
Severe demands on programmer discipline is usually an indication that the language you are using does not provide the appropriate abstractions for the problem at hand - in this sense, multithreading in C++/Java resembles structured programming in Assembly or object-orientation in straight C - it is entirely possible for a programmer that has enough time, skill, determination and discipline to do it and do it really well indeed (there are undoubtedly people who can put 32 cores to good use from C++), but a code monkey with appropriate tools will get the same task done much more quickly and cheaply.

I do not know for certain what an appropriate language for multithreaded programming will look like, but something like Haskell or Fortress looks like a much more promising place to start from than any of the C derivatives (even though Fortress still lacks a usable implementation and Haskell in its current form leaves something to be desired performance-wise) - in particular if you want to obtain automated functional parallellism (as opposed to just pure data parallellism, which is more or less a "solved" problem these days).
 
pcchen said:
Of course it is relevant. Do you want to use a 100 core CPU running at 1Hz? ;)

Hehe, I think in the way that the consumer sees it. I.e my P4 is 3ghz and why am I losing some by "upgrading" to conroe.

So with 32 cores, I doubt we will get past 5Ghz.
 
arjan de lumens said:
I do not know for certain what an appropriate language for multithreaded programming will look like, but something like Haskell or Fortress looks like a much more promising place to start from than any of the C derivatives (even though Fortress still lacks a usable implementation and Haskell in its current form leaves something to be desired performance-wise) - in particular if you want to obtain automated functional parallellism (as opposed to just pure data parallellism, which is more or less a "solved" problem these days).

Take a look at Occam or Erlang. Occam is built on pi-calculus, the mathematical formalism of concurrency, and Erlang is a sort of hybrid, being functional like lambda calculus derivatives, but supporting pi-calculus style concurrency features.

Erlang isn't just a toy. (ErLang = Ericsson Language), it is used by telecom companies and network equipment manufacturers to run services in the operator network. So when you make a call on T-Mobile for example, you're exercising ErLang.
 
Back
Top