Hi [maven],
Late to the discussion but some of the following thoughts of mine have been mentioned by some of the responses in this thread. Anyhow...
Scheduling turns out to be the easy part. Like you said, the best strategy is to allocate one OS thread for each hardware thread, and try to use OS thread affinities to force them to migrate to different hardware threads.
The hard part is keeping all of those threads supplied with useful work at all times in order to maximize overall performance. There are many ad-hoc solutions in different applications. For example, one of the game engines I know of uses two fixed-function threads (rendering and gameplay) plus a pool of available threads that can uniformly handle arbitrary tasks that are handed off to them (including physics, animation, shadowing.)
But there are two very well-studied general solutions for dividing work up into tiny units with well-defined dependencies.
The most general one is aimed at purely functional programming languages like Haskell, ML, and a subset of OCaml. In these languages, computations don't have side-effects but return a result that purely depends on their inputs. This way, computations can be executed in any order, including simultaneously, without changing the observable results of the program.
In terms of low-level implementation, code in such a language is generated so that each individual computation (basically, a subterm of an expression) can be stored in a "thunk", a data structure containing the computation's code and its inputs, and then the thunk's value can be computed on demand. The best overview of this scheme can be found here.
The other solution for dividing work up into small tasks is better suited for C-family languages. It's less general, but will probably see large-scale deployment in future applications sooner than the more general work above (check out some Google results).
I think future programming languages will head in these two directions quite pervasively in order to scale to very large-scale multicore CPUs over the next 10-15 years.
This is one of the more interesting threads I've read here. Thanks for bringing it up [maven] !
Late to the discussion but some of the following thoughts of mine have been mentioned by some of the responses in this thread. Anyhow...
Scheduling turns out to be the easy part. Like you said, the best strategy is to allocate one OS thread for each hardware thread, and try to use OS thread affinities to force them to migrate to different hardware threads.
The hard part is keeping all of those threads supplied with useful work at all times in order to maximize overall performance. There are many ad-hoc solutions in different applications. For example, one of the game engines I know of uses two fixed-function threads (rendering and gameplay) plus a pool of available threads that can uniformly handle arbitrary tasks that are handed off to them (including physics, animation, shadowing.)
But there are two very well-studied general solutions for dividing work up into tiny units with well-defined dependencies.
The most general one is aimed at purely functional programming languages like Haskell, ML, and a subset of OCaml. In these languages, computations don't have side-effects but return a result that purely depends on their inputs. This way, computations can be executed in any order, including simultaneously, without changing the observable results of the program.
In terms of low-level implementation, code in such a language is generated so that each individual computation (basically, a subterm of an expression) can be stored in a "thunk", a data structure containing the computation's code and its inputs, and then the thunk's value can be computed on demand. The best overview of this scheme can be found here.
The other solution for dividing work up into small tasks is better suited for C-family languages. It's less general, but will probably see large-scale deployment in future applications sooner than the more general work above (check out some Google results).
I think future programming languages will head in these two directions quite pervasively in order to scale to very large-scale multicore CPUs over the next 10-15 years.
This is one of the more interesting threads I've read here. Thanks for bringing it up [maven] !