Efficient scheduling for single- and multi-core?

[maven]

Regular
With all the multi-core systems coming out (hello my nice little Mac Pro ;)), it is becoming increasingly difficult to distribute your workload in a sensible manner. I plan on writing a small, multi-platform scheduler that takes care of of those things for me, and I'll make it open source as well (as I'd like to use it in my other open source endeavours as well). Note that one of the most important features of it is that it should be near-"lossless" on a single-core system, so I don't feel bad about creating thousands of "work-units" to get something done. Enter tish - tiny scheduler. No code is written, but I wanted to discuss / get some feedback on some of the ideas I've had.

Essentially, tish is supposed to allocate an OS task for each available core and via processor affinity force that to be always scheduled on the same one. tish then dishes out work-units to all the available worker-threads, taking into account the things like input / output regions (of memory) so that data in the L2 cache can be reused when the thread gets scheduled on a core that has access to the same cache (not necessarily the same core as L2 cache can be shared).
So work-units would need to specify their input data / arguments, what sort of output they're producing (e.g. single number or a stream), possibly some sort of load vector (with components for integer, fpu, mmx, sse, "bandwidth") that can be used to optimize scheduling differing workloads for hyperthreading-like execution unit sharing, dependencies on other results / work-units (which could be referenced as inputs maybe), and of course the code that is supposed to be executed.

I'm not quite sure yet what the best way is to specify this graph, or how to make things like "inputs" and "outputs" generic enough without being too cumbersome. Similarly, some investigation will have to be done so figure out how make the communication between the scheduler and its worker threads as lockless as possible. Also (at least when running on a desktop system), multiple applications using tish should ideally all use the same scheduler and worker threads, but that's not a high priority.

Implementation-wise, I think the problem would be well suited to an OO-approach (i.e. C++), but portability-wise I would probably prefer sticking to basic C (as otherwise all clients would have to be either C++ or use some sort of wrapper, neither of which is particularly appealing).

In any case, I would welcome some feedback / ideas / criticism, and in the best case, collaborators...
 
OK I'm not an expert, but here is my opinion. Take it how you will,

IMO this is that this sort of thing will eventually be done at the compiler level, breaking the code up into small 'units of work' that are not too dependant on one another. The overhead of tracking and scheduling as many as you suggest would be quite considerable at runtime.
However I only really see this starting to happen once we start seeing massively multicore systems (ie, 32+ cores). At such a time it wouldn't be a problem devoting a single core to scheduling the others, but right now, I'm not sure.

I also see problems from a code perspective. You are looking at some very complicated dependency graphs just for determining what needs to have finished before a task runs, let alone the kind of optimisation you speak of.

I'm not sure targeting C is a good idea. These days you'd better have a damn good reason to use C over C++, not to mention you already need a good reason to use C++ over C#/java.

I'd also wonder what sort of syntax you'd be looking at. C doesn't lend itself to this sort of thing all too well. And if you even think of mentioning function pointers I will have to start torturing small fluffy animals... (Last 4 days I've spent tracking down a once-in-5-hours crash in some code we licensed, came down to a function pointer being overwritten by an array access with an uninitialised variable. 2 character fix. - ohh and it wouldn't occur in debug compiles - Example 523 of why I avoid unmanaged code unless I have to.)


Furthermore, you can already get surprising performance boosts by breaking your application up into larger chunks.
Breaking up the 'collect objects to be rendered' and 'call api to render' procedures in my own code, in one extreme case (rendering 10k cubes) increases performance 30->55fps with 2 threads. With a 50%->93% cpu usage increase).

Overall I think a compiler will do a much better job of determining how to allocate resources (in this case processor cores) than the programmer. Not to mention maintenance....!


Don't get me wrong, I'm all for easier multicore, just I would wonder if a better focus would be such an automated system. Maybe taking something fairly simple like .net IL trying to break it into small blocks and a dependency graph. Looking at my own code, there are examples where this could happen all over the place.
 
OK I'm not an expert, but here is my opinion. Take it how you will,

IMO this is that this sort of thing will eventually be done at the compiler level, breaking the code up into small 'units of work' that are not too dependant on one another.

Isn't that something we have heard many times since the dawn of (S)MP - but are still waiting for?
 
You are looking at some very complicated dependency graphs just for determining what needs to have finished before a task runs
What's wrong with "if all inputs are available we can run"? Making sure you don't dead/live lock is the responsibility of the developer, not the scheduler.
 
I appreciate the response. :)

Personally I think 32 cores are a lot closer than most people think. To actually use these well, we'll need to come up with some new programming languages IMO, but until these arrive, I don't think relying on clever compilers will be enough, especially as C / C++ makes dependency / aliasing analysis very difficult.
The coarser parallelism is fine for 2-4 cores, but after that is where the real difficulty is.

The whole managed / unmanaged code discussion is fine, but unless there is a lightweight CLR that is available (and hopefully preinstalled as well as functionally identical) everywhere, I'll have to stick to the lowest common demoninator, in particular for low-level libraries.
And yes, C is probably a bad choice for what I want to do, so I'll probably code some sort of prototype and see how I can address that with syntactic sugar (macros), otherwise (and at the moment more likely) I might try C++, which of course means any users of the library will have to use that as well.

In any case, I've posted the above into the ArsTechnica forums as well, and there have been some good pointers, in particular to Microsoft's CCR (Concurrency and Coordination Runtime) for .Net 2.0 which is rather close to what I want to achieve (with the focus on coordination, not necessarily concurrency - the channel9 interview is very good), albeit much more powerful (and a bit more complicated).
 
[maven];841489 said:
To actually use these well, we'll need to come up with some new programming languages IMO, but until these arrive, I don't think relying on clever compilers will be enough, especially as C / C++ makes dependency / aliasing analysis very difficult.
There are already a few starting to pop up... RapidMind and Peakstream to name two. One could argue that Matlab has had a parallel syntax for quite some time.

I can certainly see compilers getting smarter, but I do not believe that they can solve the entire problem. There is no silver bullet for parallelizing scalar code. For example, there are efficient ways to do a parallel sort (ex. bitonic sort), but they cannot be automatically derived from scalar sorts (ex. quicksort); they are different algorithms and must be designed with different constraints.

Auto-vectorization is certainly a useful tool, but it will not solve the fundimental problem. Languages that force you into the proper coding pattern are a good start. Parallel standard libraries will also help to some extent (see Intel's Thread Building Blocks). In the end, people will need to start writing parallel code that is fully scalable to N processors.
 
As far as languages for parallellism are concerned, I have seen a few times purely-functional languages being hailed as a silver bullet; mostly because such languages make it very easy for automated tools to analyze data dependencies (unlike C/C++/C#/Java, aliasing is a complete non-issue). However, these languages are not particularly easy to learn or use (global variables and other persistent state = :runaway: ), and as such their popularity is limited. It also doesn't help that most of them are still mainly research projects.
 
As far as languages for parallellism are concerned, I have seen a few times purely-functional languages being hailed as a silver bullet; mostly because such languages make it very easy for automated tools to analyze data dependencies (unlike C/C++/C#/Java, aliasing is a complete non-issue).
Functional languages can help, certainly, but they aren't the final answer either. Data parallelism is the computational model that scales perfectly to any number of processors (subject to the input size of course), and data parallel algorithms are fundimentally different from sequential ones.

Writing things in a data parallel way is certainly much different than the previous models, but I wouldn't say that it is rediculously hard or anything. People will learn and research will continue to chip away on traditionally scalar solutions to hard problems. I suspect that new classes of difficult problems will surface that resist being parallelized and thus in the long run are relatively difficult to solve in a reasonable amount of time. Still, there has actually been a lot of research in parallel algorithms already, and more will come as parallel computing becomes the norm.
 
forgive my ignorance here but if i write a program in say delphi i can use your tish to make it take advantage of multicore ???
 
I think the right approach for general purpose parallism (i.e. on generic code, not a special argorithm) is a message passing object model. One could probably incorperate hardware into the CPU to facilitate such message passing and associated queues with high performance. Then the schedular can analyse this graph of data passing, look for nicely streaming/parallel areas, then allocate CPUs appropriately. Probably, you won't be able to extract meaningful parallism out of most individual programs, but from their collective interactions, I'm sure you can find something.

A compiler that takes advantage of this would be one that operates on an OOP language and automatically determines which objects are of sufficent complexity to warrent becoming their own message passing unit. This is sort of like, but in the opposite direction of, how a compiler determines what code to inline. The message passing itself would behave like any other calling convention. Probably, the particular syntax of class and object creation would have to be modified to cause the programmer to provide additional information to help with this determination (ex. indications of persistance, complexity, dependancy, explicit declaration, etc...). Profile guided optimisation and (for managed code) things like LLVM, would be of great assistance here as well.
 
Last edited by a moderator:
Functional languages can help, certainly, but they aren't the final answer either. Data parallelism is the computational model that scales perfectly to any number of processors (subject to the input size of course), and data parallel algorithms are fundimentally different from sequential ones.
Much of the attractiveness of functional languages in the parallellism case is that they appear to make it possible to auto-extract functional/control-flow parallellism; with a properly designed array comprehension mechanism (the one in Fortress looks promising, the one in Haskell does not), it should be fairly easy to express heavily data-parallel algorithms and extract large amounts of data parallellism as well.

I think the right approach for general purpose parallism (i.e. on generic code, not a special argorithm) is a message passing object model.
There exist languages built around this idea; the most well-known is probably Erlang (which is used widely in the telecom industry; unlike most other languages, it appears to be fairly easy in Erlang to write extremely heavily multithreaded programs (10000+ threads) and still have them actually work in a reasonably efficient and correct manner).
 
parallelizing a tracer in bluebottle

There is an OS "bluebottle" based on active oberon. It was relatively easy to parallelize my realtime voxel raytracer to take advantage of multiple cores. Naturally nothing is more suited to parallelization than this sort of algorithm. So far, the box it runs on has a single two-core chip; it is anticipated that next year a a 16 or 32 core box will be acquireable, at which point a really nice demo will be unleashed.
 
Much of the attractiveness of functional languages in the parallellism case is that they appear to make it possible to auto-extract functional/control-flow parallellism; with a properly designed array comprehension mechanism (the one in Fortress looks promising, the one in Haskell does not), it should be fairly easy to express heavily data-parallel algorithms and extract large amounts of data parallellism as well.
Much of the repulsion lies in the fact that functional programming hides the true nature of the code (hardware is inherently imperative). It's easier to extract parallelism from functional code yet it's harder to implement parallel algorithms in functional programming languages.
 
Much of the repulsion lies in the fact that functional programming hides the true nature of the code (hardware is inherently imperative). It's easier to extract parallelism from functional code yet it's harder to implement parallel algorithms in functional programming languages.

By demanding full knowledge about the "true nature" of how a system works, you place heavy constraints on how the system can be implemented. The computer industry is full of examples of how hiding the "true nature" of something brings along major benefits:
  • Virtual memory
  • OS virtualization
  • Operating systems implementing file systems, hiding the true nature of physical disk accesses from you.
  • Modern superscalar OoO CPUs with trace caches and whatnot, where the sequential nature of assembly language quite effectively hides from the programmer how the CPU actually operates internally.
  • APIs like OpenGL and Direct3d, which to a large extent hide from the user the implementation differences between GPU designs from e.g. Nvidia and PowerVR.
 
Functional languages are useful for extracting parallism because a function per definition has no side effects, so dependecy analysis is vastly simplified. Any two functions not dependent on each other can be scheduled in parallel. Of course sometimes (often really) side effects are wanted or needed.

Pascal already makes a distinction between functions (no side effects) and procedures (possible side effects). A similar distinction could be introduced to C/C++. In fact I find it puzzling that such a mechanism hasn't been introduced yet.

Cheers
 
Last edited by a moderator:
Pascal already makes a distinction between functions (no side effects) and procedures (possible side effects).
No, Pascal makes a distinction between having a return value (function) or not (procedure), but there are no assumptions about side effects. The C equivalent is whether the return type is void.
 
Functional languages are useful for extracting parallism because a function per definition has no side effects, so dependecy analysis is vastly simplified. Any two functions not dependent on each other can be scheduled in parallel. Of course sometimes (often really) side effects are wanted or needed.
Cheers

While this is theoretically true, the practicality is limited.
In the end serial algorythms are serial, whether written in a functional language or an imperative one. And there is much less imlicit parallelism than a lot of people try and claim.

What is true about functional languages is that because functions are first class constructs (currying also helps) and for historic reasons there is a tendency to express solutions in terms of constructs like Map and Fold, these are inherently parallel. But nothing stops you from adding the same constructs to a C like language, getting people to use them is another matter.

IMO all of the "useable" functional languages support a number of imperative constructs anyway, avoiding all side-effects is hard.
 
Gubbi - well, GCC already has 2 such distinctions for C/C++. Lookup function attributes in GCC and checkout

__attribute__ ((pure))
__attribute__ ((const))


(I don't think the compiler actually does anything to check whether or not the specified attributes have anything to do with reality though).
 
The biggest obstacle I've encountered with parallel design IS that trade-off between your original desire to have near "loss-less" performance on single-core architectures vs. how much performance gain you achieve on parallel platforms.

Every step of the way will be riddled with the choice of how much you give to multiple core vs. how much you take from single-core. Parallel efficiency almost always translates to additional overhead detrimental or of no value to the single/linear side of things.

You might also want to have a look at OpenMP, which is a totally different direction but still has some good work on parallel design fundamentals.

Language choice.. well, I've always designed ground-up, specific parallel code in C, but for an API that is re-usable, you'll definately want C# or C++. It'll be the only way you'll keep your sanity if you wish to give generic usage across more than a single application.
 
Much of the repulsion lies in the fact that functional programming hides the true nature of the code (hardware is inherently imperative). It's easier to extract parallelism from functional code yet it's harder to implement parallel algorithms in functional programming languages.

Depends on which language and which version of Haskell. For Concurrent Clean, Erlang, or Concurrent Haskell, it's easy. In fact, with Haskell + STM, it's *easier* to implement parallel algorithms. If you don't like STM, than Monads are available for mutexes and shared data. You can also decide whether you want lightweight threads or OS threads.

Besides, "imperative" style programming is still available when you want it. You just use Monads. For example, Haskell has a Byte array Monad that uses inplace update. The idea in Haskell and other functional languages with imperative ability is that you keep your functional and imperative pieces separate, so optimizations can apply differently to each.

This is not really any different than today's DX "functional" shaders where one programs in paradigm that is effectively side effect free (can't modify memory state until the "end")
 
Back
Top