Efficient scheduling for single- and multi-core?

Yes, JITs work really well for UI apps, which is why Smalltalk was great for UIs (but killed by stupid Smalltalk vendors)
 
The Java HotSpot JIT Compiler does global register allocation via graph coloring.

There is no need to view the situation as AOT/JIT. Hybrid systems are all around. Static compilation, followed by JIT profile collection, followed by reoptimization with idle CPU.

HotSpot takes its name because it doesn't optimize everything, instead, it uses heuristics to decide which are the "hot" parts of the code, and how much work it can devote to optimization based on predicting how long it will take.

I know about optimizing hot code, but that's not really what I'm talking about.
I haven't changed the focus of my discussion from my first reply in the thread.
It's the algorithmic reformulation+architectural specific optimization+parallelism extraction+interpretation of an implicit program definition that was discussed in the post I first responded to that I don't see happening JIT.

It's just too much to put on the JIT interpreter/compiler/environment, while incremental gains from less pie-in-the sky ideas like HotSpot and other techniques can match it with a far lower likelihood of catastrophic mistakes. The algorithmic reformulation alone would probably wreck a lot of hot-spot optimizations by completely invalidating a lot of the collected profile data with an entirely different execution trace.

The implicit program definition would take more time to unravel than necessary. It would be better to precompute it long before run-time and update a table of known good implementations at regular intervals, or just have one representative system run in optimization mode and share its results.
 
Last edited by a moderator:
I'm not sure of any algorithm reformulation systems in general. The idea seems IMHO to violate Hilbert's Tenth Problem/Halting Theorem.

I am aware of theorem provers and discovery systems which are extremely limited, and formal programming systems (like Isabelle), but nothing that could take a high level constraint specification, of say, a Sort function, and derive sorting algorithms. (e.g. sort X = for all x_i, x_j in X, i<j => x_i <= x_j)

The idea in general is undecidable and I'm apt to conclude of limited useful except perhaps experimental searches for new and interesting algorithms ala genetic programming.

What's more interesting to me is the limited problem of changing functional code into parallel code, or changing strict pure functional code into code with destructive update (e.g. given QuickSort in Haskell, produce the in-place version) These are tractable problems, the goal of which is to produce a system of programming that is amenable to parallelism and memory coherency without having to explicitly write it. The reason why it's tractable is because the developer is forced to write code in a manner that makes it tractable (pure functional) The problem for imperative destructive update languages is far less tractable.

Less interesting to me is the idea of writing some pre conditions, and then an assertion of the final result, and letting the compiler figure out the algorithm. Constraint based systems usually devolve into pseudo-implicit programming, as people manipulate the constraints to produce more deterministic results.

The only successful declarative constraint based systems I know of are:
1) SQL/OQL
2) Regular expressions/XPath/etc
3) spreadsheets


Most expert systems, be it Prolog backwards chaining or Rete-based forward chaining (JESS/CLIPS/etc) seem to devolve into imperative style once they grow beyond a certain complexity.
 
I'm not sure of any algorithm reformulation systems in general. The idea seems IMHO to violate Hilbert's Tenth Problem/Halting Theorem.

I was thinking of a constrained brute-force (stupidly so) method of algorithm formulation that, given certain conditions, should eventually produce a solution that is isomorphic for a limited selection of algorithms (I think this should work for functions that do not maintain state between calls and can be expected to or purposefully be designed to halt). I think it would presume some kind of functional language or environment that cleans up all the nasty imperative side effects.

Since the full run of a given system's state before and after a function call is finite, it's susceptable to just plain brute-force number crunching.

What such a system would have to be given is a pre-packaged problem set:
1) an exemplar of the algorithm it must reformulate
2) a limit on the algorithm's input size (ie. "find a new algorithm that works on a given input/packet size or smaller")
a)This input packet must include the input data, various system variables, and anything else that might change how the instructions execute (it would be huge, but finite). This corresponds to one entry in a whole list that corresponds to the full input range possible within the bounds of a given system (huger).
3) a limit on the size of the reformulated solution (ie. "find a new algorithm that is no bigger than N instructions")
4) a limit on the run-time (halt within X cycles on all inputs or be considered invalid), since usability limits will probably disallow any solution that takes 10,000 years
5) an output trace that corresponds to the outputs of the function that corresponds to all the results of the exemplar algorithm running the contents of the input packet (also huge)

The formulator will include a simulator of the target architecture, and it could just produce every combination of instructions in the ISA up the size limit, run each algorithm on the input and compare it to the output of the exemplar.

If there are N instructions in the ISA, and the size limit of the solution is X, then it's something like N^X+(N-1)^X+(N-2)^X... combinations. It's massive, but finite.

The size of the input list would correspond to the every combination of the binary representation of system state + input packet.
2^P in size.

The number of cycles that would be run would be at maximum the time/cycle constraint given.

Checking the output against the exemplar's output trace would be linear to the size of the output list, though that would in the worst case be equal to the size of the input list. It could probably happen in parallel or nearly so with the completion of a test run.

So finding a solution would take in the worst case would be proportional to
(TimeLimit*#InstructionCombos*InputListSize)

One is a fixed value, one is a summation of exponential factors, one is exponential, and they're multiplied together. So it would probably take a while.

On the one hand, this solution is very naive, it doesn't cull out nonsense combinations and it doesn't reuse results. On the other, it is unbelievably parallel, so I guess we'll finally have an excuse to have all those cores after all. ;)

Ways of culling out pointless execution or reusing results would cut down on the vast amount of waste, but they could interfere with the perfect independence of each test case. Smarter use of induction to find repetitive outputs or to prove scalability to all input sizes would also compress the problem size.

Hopefully, whenever a tester finds a valid answer, it can share the result with everyone else.
Since, once you have an answer, you'd know that there would be no better solution for a given architecture within the size limit. If we'd just plug in the formulator program into a formulator program, I suppose we'd be able to find the ultimate formulator for a limited set of functions. ;)

Most expert systems, be it Prolog backwards chaining or Rete-based forward chaining (JESS/CLIPS/etc) seem to devolve into imperative style once they grow beyond a certain complexity.

Prolog, that's what I had in school.
 
Last edited by a moderator:
What you're suggesting is similar to GNU SuperOpt, but beyond just exponential running times, it won't work because the fastest algorithm is not always the shortest, nor is the best algorithm always the best on all inputs.

Fast Fourier Multiplication is longer than Karatsuba's Algorithm, but is significantly faster for asymtotically bigger problems. Quicksort has worst case running time that is worse than HeapSort or MergeSort, but it has amortized average running time that is equivalent or faster on most inputs. Fibonacci Heaps will lose against other priority queue implementations on Dijkstra's algorithm on small graphs, but win on large graphs. So your inputs must contain all possible equivalance classes of inputs for the problem (relevant sizes, relevant best/worst/average cases. A statistically valid distribution of expected inputs you expect the algorithm to handle)

Then there is the problem of test driven approaches. The generate-and-test approach handwaves over the "test" part, which is proving algorithm correctness. It's not as simple as comparing output to an exemplar because one can't be sure that the two algorithms are isomorphic (produce the same output for the same input).

Moreover, tossing algorithms out that take longer than M cycles or N instructions on a given input will not lead to global optima. That's because many outstanding algorithms that perform amazingly on some inputs will have pathologically bad worst case running time on others. Evaluating the best algorithm will require statistical analysis or amortized run time analysis on a set of representative input sequences that you think model your problem space.

Finding optimal algorithms via brute force is undecidable, it is exactly equivalent to the halting problem. Just feed in an exemplar that searches for the first violation of Goldbach's conjecture, or better yet, feed in the complement of the brute-force search optimizer. If such an algorithm would work, it could settle undecidable math problems by "optimizing" a brute-force search for counter-example.

It will work in very limited cases, and not because of the slowness of it, but because of metamathematical barriers.
 
The formulator will include a simulator of the target architecture, and it could just produce every combination of instructions in the ISA up the size limit, run each algorithm on the input and compare it to the output of the exemplar.
This technique is called "superoptimization"; research is being done on the subject, but for the time being, the practical limit on how long programs one can superoptimize appears to be somewhere around 3 to 4 instructions.

A size limit of "X" instructions and an ISA of "N" instructions does not produce very helpful constraints; when you take into account that instructions have arguments, N is perhaps somewhere in the range 10^5 to 10^6 for x86 (wild-ass guess; not considering instructions with large immediates, just instructions with register/memory arguments) and X needs to be on the order of a few dozen for any algorithm that is supposed to be more than just a peephole optimization, at which point we are talking about computation times comparable to the timeframe for the Heat Death of the Universe.
 
Last edited by a moderator:
I could use some help. There are two things of which I would want to know if I'm on the right track. The overall idea is to create a setup that is as easily useable to parallelize processes on random, multicore hardware as possible. And I assume they have some local storage.
....
Part of what you said can automatically be done by the WSClock algorithm for Virtual Memory. This algorithm keeps the process "Work Set" of pages in memory ;)
 
A size limit of "X" instructions and an ISA of "N" instructions does not produce very helpful constraints; when you take into account that instructions have arguments, N is perhaps somewhere in the range 10^5 to 10^6 for x86 (wild-ass guess; not considering instructions with large immediates, just instructions with register/memory arguments) and X needs to be on the order of a few dozen for any algorithm that is supposed to be more than just a peephole optimization, at which point we are talking about computation times comparable to the timeframe for the Heat Death of the Universe.

I was aware of how massively the problem size explodes, which is why I used those smileys. I blame the internet for not making clear my half-serious tone.

In the setup I laid out, if the function worked on only one register, it would have 2^64 values, not including the minimum representation of processor flags and system state, which would add to the exponent.

At best, I hoped it would take several dozen years on a massive compute cluster to do something like a simple sort of a five-element list, something quite a bit less complex than a full program.

DemoCoder said:
What you're suggesting is similar to GNU SuperOpt, but beyond just exponential running times, it won't work because the fastest algorithm is not always the shortest, nor is the best algorithm always the best on all inputs.
In theory, one could derive a good idea of the best times to use a given algorithm from the data gathered from the simulator and input/output traces. If I modified the very crude method I described so that it made a finer distinction with some of the rejection criteria, pathological cases could be factored into the decision making.
This meta-data could be used to index the pool of algorithms, though I won't hazard to guess how massively huge the problem is of computing from such a data set.

Assuming all this could be done, and only hypothetically, a system's heuristics could be engineered that would run through the small metatags and pick through the list of algorithms that had already been computed.
 
Last edited by a moderator:
There is a lot of work done in using genetic algorithms to program gate arrays. The first one I heard of (about 10 years ago) , used that to make a gate array that would (IIRC) produce a square wave when the input was high, and a low signal otherwise. It was very interesting, in that it used less gates than the human solution, and that there were some gates that seemed not to have any function, but would break the device when removed.

But all solutions that work with algorithms (that I know of) require a strict definition of the output up front.
 
Yes, I remember that gate array example, the "non-functional" gates were taking advantage of some other physical effect, like EM interference/leakage or some other effect. While it is an amazing example, GA/GP are no panacea either. They have situations where they produce very good solutions, just like simulated annealing and other metaalgorithms. However, the quality of result often depends on the space being searched. SA works well when optimal solutions and nearly optimal solutions have bounded neighborhoods and exhibit nice gradients that can be "hill climbed", GA works well when the fitness landscape has nice gradients and global optimums that are either unique or very large in comparison to local maxima. Both both GA and SA have pathologically bad outcomes on some problems, and GA has the problem of picking the best fitness function -- itself -- a concept with a search space.

It would be nice to live in a world where tackling very hard problems had an easy answer. Hopefully, the total computer power of civilization continues to increase exponentially for a long time, so that we can find solutions to problems in reasonable time without waiting for evolutionary searches to take 4 billion years. :)

Hopefully we'll continue to prove more and more hard problems to be in QP/ZQP/BQP.
 
Yes. But in the mean time, it is a lot more hard work to define all the possible outputs up front, and it requires much more processing power to come up with a good solution, than using a "not so bad" algorithm in the first place, or even just using the brute force method. So, the main advantage of a JIT is the statistics and the easy solutions, like unrolling or inlining.

Then again, if you have the cycles to spare, it's pretty good.
 
Btw, an update about my emulator:

I removed all the standard stuff, like a flat adress space and independent processes. After all, if it was as easy as that, we wouldn't need a better solution.

So, at the moment I'm experimenting with kernels as processes, that allow you to "register" ways to modify streams. Like kernel modules. Ask the OS for a function to be performed on a stream, and it will attach the module and create the streams required.

And the ops are becoming streamlined, so while it's a VM, it executes with minimal overhead. Of course, that still needs scheduling, memory management and all the other stuff. But it looks promiing, in theory. We'll see how it evolves.
 
Btw, an update about my emulator:

I removed all the standard stuff, like a flat adress space and independent processes. After all, if it was as easy as that, we wouldn't need a better solution.
Can I ask why you would want to abandon a flat address space? What would you use instead, and what gains do you expect from a more complex address space?

So, at the moment I'm experimenting with kernels as processes, that allow you to "register" ways to modify streams. Like kernel modules. Ask the OS for a function to be performed on a stream, and it will attach the module and create the streams required.
How low-level are the funtions that need to be registered with the OS? It seems like a lot to have every function calling the OS.

What about the other parts of your design, like the memory mapping of everything?
 
Hey Maven - are you still with us? Haven't seen your posts in this thread recently - even though you started it. Could you summarize what (if anything :smile: ) you have found out from all of the replies you have received?
 
I'm still here... :)
The conversation has moved in direction which I find interesting but not terribly useful in the short- to medium-term timeframe. If you're working on Windows only, the CCR looks like a good thing (although I'm not too hot on the rather ambiguous operator-overloading syntax). As for me, I've been working a bit on prototypes in C (a bit of justification here, but there always comes a point where things get a bit messy and I start over. I have the general way in which I want it to work kinda figured out, but there's still plenty of thinking left to do (e.g. you submitted some interdependent jobs, so how do wait wait for these to finish if they are able to create new work-units that weren't part of the initial job on their own).
Another thing I'm not yet clear about is the usual C-dilemma of who owns (in memory management terms) each data-structure, but I'll worry about that when I get there.
For the first iteration, I'll keep things as simple as possible (simple round-robbin scheduling of "ready" work-units to as many worker-threads as there are cores, single global lock for each thread's run-queue and job-data-structure, ...), but I'll probably only get around to that in December...
 
I just wanted to point out that if you check out any good book on high-level synthesis ( you can find some papers here for example http://www.ida.liu.se/~cadlab/synthesis.html) you will find a lot of information about scheduling which may be useful.

This information is oriented towards hardware design, but the interesting thing about hardware is that it is inherently parallel - so although the info may not be directly applicable to multi-threading you might get some ideas from it.
 
[maven said:
I'm still here... :)
The conversation has moved in direction which I find interesting but not terribly useful in the short- to medium-term timeframe.
Yeah, the responses here are mostly OT but interesting ones :)

This implementation is not because of functional programming, but related to non-strict (i.e. lazy) languages. You can have lazy functional or lazy imperative, or strict functional and strict imperative. [,,,]
Yeah, thunks of the sort described in the link I gave are the key ingredient. Whether the language is lazy, lenient, or eager and sequential or concurrent is just a question of who reduces the thunks and when.

DemoCoder said:
Missing from the discussion is Pi-Calculus based languages. Functional languages as well as LISP dialects (even imperative ones) derive from Lambda Calculus. Lambda Calculus has no concept of concurrency or communication. [...]
I see this as a feature of lambda calculus rather than a shortcoming. If your language is functional, then the result of any function for a given set of parameters is the same, regardless of the order in which you evaluate its components, so concurrency is implicit, and invisible.

Pi-Calculus is another one of those attempts to identify something that looks important to computing, and design custom language constructs (or just theoretical constructs) to make them explicit. This approach adds unnecessary complication IMO.
 
Can I ask why you would want to abandon a flat address space? What would you use instead, and what gains do you expect from a more complex address space?
Well, you could easily do page switching to main memory in software (would still be very fast), but there is only a use for that when you run multiple processes in their own, large virtual memory space. Which still gives lots of overhead and memory contestion.

How low-level are the funtions that need to be registered with the OS? It seems like a lot to have every function calling the OS.

What about the other parts of your design, like the memory mapping of everything?
When you don't have everything separate in their own, protected memory space, everything needs to be relocated, unless you do everything relative. And you have to hook everything together, need the right entry points, etc. So, memory mapping everything and assembling it on load seems the best solution. Precompile to virtual assembly, like with shaders. And using p-code, eventually. Like a JIT, really.

And they wouldn't really be small functions as such, more like processes. It's like a very simple nano-kernel that handles I/O and memory, and schedules a set of "processes" according to their workload and priority. That's it. Load (assemble) them, register their function, and you can use it until you turn the power off, or unload it.

So, you get a first "process" that gathers data from memory and/or I/O, and has one or more output streams that all require a specific function. So, the system will see if that function is avalable somewhere, and if it isn't, ask a processor that has a low workload to load it and accept the stream. And at the end of the line, it goes back to memory or I/O.

But, first I'll have to finish it and see if it works.
 
Tiny update: I think it works, but mostly in my head. I've been swamped with other work, real life and all. And I tend to lose interest when I think I have it figured out.

A working emulator would be nice, but it'll be done when it's done.

Oh, and it requires some very alien thinking to be able to write your application. But that would be the whole point in the first place. Like, you don't have absolute varables or data structures. You gather, demand action, manage, and pull the plug when the time is up.

The best technical analogy I can come up with, would be a whole lot of emulated micro controllers. In that sense, it's back to the beginning, but they all have a very fast network interface. And the interesting part is, that it is a lot like managing a project team or company. The actual work done is neglectible, all the work is in reducing the overhead and getting it done in the first place. As expected.


It is really very much like automating the management of a large corporate digital network, with the work done at the nodes(PCs, servers, routers, printers, keyboards, etc). And all of it is distributed. It looks very much like a *nix network, with software distribution as required for Windows computers.


At the moment, I'm trying to come up with an analogy of a "basic process", like a worker. What are the needs, and how can we stream the workflow with as little intercommunication (like meetings) as possible? Is it better to make all the modules as independant as possible, or is it better to take strict control?

I like the modules to be as independent as possible, and they have the cycles to spare. But while message (job) queues offer the best outlook, they are much more prone to unexpected behaviour, and you need a strict way to enforce completion (deadlines).

I like it a lot.

:D
 
Back
Top