Efficient scheduling for single- and multi-core?

Reverend · Oct 11, 2006

Hi [maven],

Late to the discussion but some of the following thoughts of mine have been mentioned by some of the responses in this thread. Anyhow...

Scheduling turns out to be the easy part. Like you said, the best strategy is to allocate one OS thread for each hardware thread, and try to use OS thread affinities to force them to migrate to different hardware threads.

The hard part is keeping all of those threads supplied with useful work at all times in order to maximize overall performance. There are many ad-hoc solutions in different applications. For example, one of the game engines I know of uses two fixed-function threads (rendering and gameplay) plus a pool of available threads that can uniformly handle arbitrary tasks that are handed off to them (including physics, animation, shadowing.)

But there are two very well-studied general solutions for dividing work up into tiny units with well-defined dependencies.

The most general one is aimed at purely functional programming languages like Haskell, ML, and a subset of OCaml. In these languages, computations don't have side-effects but return a result that purely depends on their inputs. This way, computations can be executed in any order, including simultaneously, without changing the observable results of the program.

In terms of low-level implementation, code in such a language is generated so that each individual computation (basically, a subterm of an expression) can be stored in a "thunk", a data structure containing the computation's code and its inputs, and then the thunk's value can be computed on demand. The best overview of this scheme can be found here.

The other solution for dividing work up into small tasks is better suited for C-family languages. It's less general, but will probably see large-scale deployment in future applications sooner than the more general work above (check out some Google results).

I think future programming languages will head in these two directions quite pervasively in order to scale to very large-scale multicore CPUs over the next 10-15 years.

This is one of the more interesting threads I've read here. Thanks for bringing it up [maven] !

DemoCoder · Oct 11, 2006

Reverend said:
In terms of low-level implementation, code in such a language is generated so that each individual computation (basically, a subterm of an expression) can be stored in a "thunk", a data structure containing the computation's code and its inputs, and then the thunk's value can be computed on demand. The best overview of this scheme can be found here.

This implementation is not because of functional programming, but related to non-strict (i.e. lazy) languages. You can have lazy functional or lazy imperative, or strict functional and strict imperative. Concurrent Clean for example has both strict and non-strict capabilities, if you want parameters to be evaluated immediately, you just declare it strict and no thunk is required. Clean also has uniqueness types which allow safe destructive updates.

Missing from the discussion is Pi-Calculus based languages. Functional languages as well as LISP dialects (even imperative ones) derive from Lambda Calculus. Lambda Calculus has no concept of concurrency or communication. Pi-Calculus is the theoretical framework for concurrent/distributed processes. There are a number of languages based on it, like Occam. Erlang is "sorta" Pi-Calculus, but not in a formal way. It is also not pure non-strict functional like Haskell.

kimg · Oct 11, 2006

I think, it will come new languages for multicore, in the longer term. Languages who simply thinks "objects" in a new way, and is not related to current C/C++, but replace it.

DemoCoder · Oct 11, 2006

object oriented programming is overblown as a solution. In fact, OO increases average mutation and dynamic dispatch, which hampers optimization. Most static compilers can't optimize across OO call boundaries due to late binding, that is, they can't convert polymorphic calls into monomorphic calls for data analysis and inlining. Some profile-feedback assisted compilers can, as long as you have good code coverage, and VM based languages like Java, C#, etc can with type-feedback and speculative optimizations based on run-time profiling. Languages which are "virtual" by default (pretty much everything except C++) turn everything into late-bound calls.

In general, OO dropped performance and made optimizers alot more difficult to write. By contrast, functional languages make functions first class and operate on abstract data types instead. One can garner many of the same benefits of OO without the headache. For example, in Haskell, one can create a new "modular arithmetic" datatype and do math with it, without needing objects with state and methods.

I use OO all the time, but many people IMHO think it is the solution for everything, whereas I think it should be used where appropriate, just like FP.

Gubbi · Oct 11, 2006

Xmas said:
No, Pascal makes a distinction between having a return value (function) or not (procedure), but there are no assumptions about side effects. The C equivalent is whether the return type is void.

Doh! Don't know why I had come to believe that. Kind of scary. It's been 12-15 years since i've messed with Pascal.

Cheers

Frank · Oct 11, 2006

I think OO is a good start for a model, but you need other mechanisms. But if I had to design some parallel language / OS, I would start with creating independent processes for each object and making it impossible to directly access their data elements without going through methods (properties). And I would dispatch all interactions, with a callback function.

DemoCoder · Oct 11, 2006

If all interactions must occur via callback, then one cannot statically compile but must run on a VM, since callbacks presumably can be late bound, in which case, dataflow analysis and inlining at call boundaries would be hampered.

Frank · Oct 11, 2006

DemoCoder said:
If all interactions must occur via callback, then one cannot statically compile but must run on a VM, since callbacks presumably can be late bound, in which case, dataflow analysis and inlining at call boundaries would be hampered.

Well, you would gather your data through callbacks, and after that you can process it and post it back. And you don't have to wait for callbacks when posting. So it's not as bad as it sounds. Although you need to make copies and not reference the properties directly while doing so. But the compiler can take care of that to some extend.

DemoCoder · Oct 11, 2006

What are you talking about? "Posting"? A static compiler upon seeing a call site has to be able to prove where the call site leads before it can say, inline code, or allocate registers across call sites. This won't work if call sites can be late bound. Therefore, one has to switch to dynamic compilation. However, even in the case of dynamic compilation, you end up only being able to prove a small fraction of call sites using profile collected data, and therefore, you still suffer a performance hit.

The fact of the matter is, late binding (OO, callbacks, dynamically loadable modules, etc) result in a performance drop. The same situation does not occur in functional languages because FP permits equational reasoning. One can pass callbacks, but the FP compiler can always prove which function it was because there is no mutation, and no aliasing. Trivially, it can simply use substitution for example.

So if you want maximum function call performance, you don't want mutation of variables that hold pointers to functions.

Frank · Oct 11, 2006

Ok. How would you do it better?

Although I think static binding across processes that can be created on the fly is not something you should want. If you want static binding, forget about dynamic distribution of processes across your available resources or load balancing. And I don't really care about the small benefits you can get from inlining code that should be distributed in the first place.

DemoCoder · Oct 11, 2006

Huh? Small benefits from inling code? The vast majority of performance speedup from optimization comes from inlining code. Go compile some code and disable inlining and see what happens. Especially with high level programming where methods tend to be short increasing the number of calls.

You can't have a programming language where every call is distributed and non-inlineable, it wouldn't make any sense. Even in distributed programming, the distributed processes still make local function calls.

Maybe I'm misunderstanding what you mean by "interaction" In object oriented terminology, this amounts to messaging an object, which means "method call". Perhaps you meant "distributed interaction"

In any case, callbacks I do not think are the solution, since they imply synchronicity. The whole concept of RPC is wrong for distributed programming. If you want to be delay/latency tolerant, you need to work from an asynchronous message passing standpoint.

nutball · Oct 12, 2006

DiGuru said:
I think OO is a good start for a model, but you need other mechanisms. But if I had to design some parallel language / OS, I would start with creating independent processes for each object and making it impossible to directly access their data elements without going through methods (properties). And I would dispatch all interactions, with a callback function.

That sounds like a recipe for a lot of heavyweight inter-process communication to me.

MfA · Oct 12, 2006

DemoCoder said:
There are a number of languages based on it, like Occam.

Wow, transputers were really ahead of their time then

(CSP/Occam predates pi-calculus, most major practical difference is probably that everything is synchrynous in CSP.)

Frank · Oct 12, 2006

nutball said:
That sounds like a recipe for a lot of heavyweight inter-process communication to me.

Well, it's what you need to make it asynchronous.

Frank · Oct 12, 2006

DemoCoder said:
Huh? Small benefits from inling code? The vast majority of performance speedup from optimization comes from inlining code. Go compile some code and disable inlining and see what happens. Especially with high level programming where methods tend to be short increasing the number of calls.

You can't have a programming language where every call is distributed and non-inlineable, it wouldn't make any sense. Even in distributed programming, the distributed processes still make local function calls.

Yes, but what are you going to inline? Do you demand a static model in advance? That won't cut it.

Maybe I'm misunderstanding what you mean by "interaction" In object oriented terminology, this amounts to messaging an object, which means "method call". Perhaps you meant "distributed interaction"

Both. But you don't have to make each and every object into a different process. Everything completely embedded can stay that way.

In any case, callbacks I do not think are the solution, since they imply synchronicity. The whole concept of RPC is wrong for distributed programming. If you want to be delay/latency tolerant, you need to work from an asynchronous message passing standpoint.

Yes, and for that you need callbacks. You just have to be smart about when you use them.

If you are going to stream everything, that implies that you won't get an immediate response. It's like a long stream of messages that will be handled "when ready".

DemoCoder · Oct 12, 2006

Callback functions usually expect a return value. RPC is typically asynchronous. Usually when distributed system use asynchronous "callbacks", they are called messages, not callbacks.

Some systems have a hybrid. Function calls are asynchronous (return no value), but instead give the caller a "promise" The promise may be checked later to see if it has anything in it (like a result).

Frank · Oct 12, 2006

DemoCoder said:
Callback functions usually expect a return value. RPC is typically asynchronous. Usually when distributed system use asynchronous "callbacks", they are called messages, not callbacks.

Some systems have a hybrid. Function calls are asynchronous (return no value), but instead give the caller a "promise" The promise may be checked later to see if it has anything in it (like a result).

Yes, like that.

Sure, we want to stream jobs, but we also want to be able to process something that requires external data (that resides in another object). We have to wait for that.

But, that doesn't matter. If you can only process 8 things at a time, but have 100 processes in wait, you're still going as fast as you can. It's only a matter of scheduling.

Frank · Oct 12, 2006

Btw.

If you have small local storages, you would really want to batch the processes so you don't need to do a context switch of all that memory. And you can do that by having all the dependent processes query their demands up front. Although you can be deadlocked that way as well.

DemoCoder · Oct 12, 2006

DiGuru said:
Yes, but what are you going to inline? Do you demand a static model in advance? That won't cut it.

Whether you demand to see all the source during compilation or not, even during dynamic JIT compilation, polymorphism and late binding confound optimization. What are you going to inline? *Method calls* what else! Anyone who thinks inlining method calls within a process is not beneficial needs to restudy their compiler course. Object oriented software shrinks method calls down to just a few lines of code, especially getters/settings and predicates. Resulting in HUGE penalties if you can't inline.

You are not really doing a good job of explaining your proposed "solution". A solution which sounds absurd to me: one process per object? Callback methods for interactions? A recipe not only for the slowest software on earth, but one that will be be highly prone to race conditions with needless rules of parallelism (one process per object, Why God Why?)

Both. But you don't have to make each and every object into a different process. Everything completely embedded can stay that way.

What does this have to do with the point that that interaction I simply asked you what "interaction" means. If it means objects interacting on the same machine within a shared memory space, then everything I said about the performance degrading of heavy use of OO holds. If your entire language paradigm is based around callbacks, your performance will suck miserably compared to an imperative language like C on an SMP architecture or single process machine.

You are suggesting a language paradigm where every object is a separate process logically, and where inter-object interactions are "callbacks" presumably to invoke methods on other objects. You then assert a magical compiler will figure out when to blend these objects into the same physical OS process, and when not to.

I say, what's the point? We know that the less shared state, the less mutated state, the less interaction, the better for embarassingly parallel distribution. I say, stick with pure functional core and datastructures, not objects, and make interaction something that the developer has to plan for by using message passing and pattern matching.

For distributed programming, it is better to think in terms of protocol design, then to use "first class" RPC mechanisms everywhere (callbacks)

Yes, and for that you need callbacks. You just have to be smart about when you use them.

A callback is a procedure call. In a distributed setting, they are usually regarded as synchronous. Message passing is NOT callbacks. It is exchanging messages. TupleSpace/LINDA is not a "callback" mechanism. Erlang is message passing based, not method callback based. You send a tuple to a process ID, and then you use pattern matching on the message to determine how to react.

If you are going to stream everything, that implies that you won't get an immediate response. It's like a long stream of messages that will be handled "when ready".

This is a non-sequitur to this thread, which was languages to extract parallelism or help design parallel applications. One, we haven't talked about streaming, nevertheless, streaming messages != callbacks. You are mixing your metaphors.

Like I said, people are too enamored with the OO paradigm. Everything turns into OO. All entities are objects. Objects pass objects. Objects send Objects which are marshalled/demarshalled. Late binding everywhere. It's a nightmare for the compiler, and it doesn't make parallel programming any easier. Java is probably the mainstream language that has had integrated support for multiithreaded programming the longest, yet, despite this heritage, despite an excellently designed concurrency utility package. Despite a rigorous memory model, which many languages have ill-defined. Despite all of this work, race conditions and deadlocks are just as common, and multithreaded programming is still as error prone as it has always been.

Contrast this with Erlang, which most people report exhibits astonishingly less bugs in heavily threaded apps. Why? Because the paradigm of objects, shared mutable state, and calling functions between each other, observers, event listeners (e.g. CALLBACKS), is the wrong paradigm.

If you want event driven programming, you design a system for event messaging, instead of faking it with objects and method callbacks.

Frank · Oct 12, 2006

Ok. I'll think about it some more. I'll make a model when I have time.

Efficient scheduling for single- and multi-core?

Reverend

DemoCoder

kimg

DemoCoder

Gubbi

Frank

Certified not a majority

DemoCoder

Frank

Certified not a majority

DemoCoder

Frank

Certified not a majority

DemoCoder

nutball

MfA

Frank

Certified not a majority

Frank

Certified not a majority

DemoCoder

Frank

Certified not a majority

Frank

Certified not a majority

DemoCoder

Frank

Certified not a majority

Similar threads