Efficient scheduling for single- and multi-core?

DemoCoder · Oct 12, 2006

DiGuru said:
Yes, like that.

Sure, we want to stream jobs, but we also want to be able to process something that requires external data (that resides in another object). We have to wait for that.

Unfortunately, the notion of "promises" has turned out not to work too well in practice. The problem is, if you call for the promise before it is ready, you effectively get no benefit of latency hiding and simply devolve into a synchronous function call.

What then ends up happening is people try the following:

1) keep calling polling methods to check on status -- ugh! major overhead
2) try to move the fetch of the promise's value as far down as you can in your code, putting other non-dependent code before it. sound familiar? Remember how on NVidia vertex shaders, you have to manually separate vertex texture fetches with lots of math instructions to hide the latency?
3) they end up implementing a priority queue, which they put promises into, and in which case, values which become ready are moved to the top for dequeuing.

In the end, they fallback to implementing a message passing model and send their own asynchronous result messages back. They restructure their code so that pattern matches against the correct incoming messages triggers the usage of the value.

I'm just telling you that RPC is syntactic sugar for lazy distributed programming. It's a nice, convenient lie to the programmer, hiding the true cost of distributed interactions, and influencing him to write code is a manner which is not compatible with asynchronous messaging, or able to hide latency.

Frank · Oct 12, 2006

Well, I would advocate using pipes *ONLY*. And I think I have a pretty good idea about how it could be done, eventually.

But on the other hand, I'm a programmer. And I want things as easy as the next one. So I'm trying to come up with something that would work with minimal changes to the current model, if at all possible.

But, I'll think it over and come back on this.

In the mean time: I would be HIGHLY interested in your proposal.

flick556 · Oct 12, 2006

I don’t think the hard part is breaking code into self contained chunks. Both streaming and threaded models work nicely for most code. The challenge is once you have these chunks you need to identify the “critical path” and schedule task accordingly. It’s very easy to accelerate things that have little to no effect on the overall execution time. This is why GPUs have so many things built into them to make execution deterministic. Every thing from memory consumption to the number of clock cycles for a shader can be determined at compile time. This info allows things to be scheduled properly.

The best solution in the cpu world is to have a hardware specific JIT compiler and a language structured with this in mind. Branches and dependent reads complicate the job of the compiler / scheduler, but in the CPU world these things are important. If two apps are running simultaneously they should pretty much be combined into one highly parallel app then compiled together when possible. In many ways it’s like a flash back to the days of real time computing except with JIT thrown in to add a huge layer of abstraction between the hardware and software.

Frank · Oct 12, 2006

Most likely, AMD and Intel will try to make things as easy and transparent as possible to the average programmer, and so will go one step beyond their current abstraction model, and supply hardware operated virtual PC machines, on which you can install an OS of your liking, without having to bother about the undelying hardware.

That is in contrast to consoles, where the programmers like to program close to the metal.

So, I don't think PCs will become faster at single task execution at all, or multi-threaded for that, but will try and embed the OS kernel. The Linux one, not the Windows one. There is no gain to be had, there.

Microsoft isn't going to be happy when the quad-cores are going to hit the market. They probably have nightmares about that for quite some time by now. Linux will simply kill them on raw performance. Although it would make their OS model a lot simpler by allowing all the VM stuff Windows already does in hardware.

flick556 · Oct 13, 2006

DiGuru said:
Most likely, AMD and Intel will try to make things as easy and transparent as possible to the average programmer, and so will go one step beyond their current abstraction model, and supply hardware operated virtual PC machines, on which you can install an OS of your liking, without having to bother about the undelying hardware.

That is in contrast to consoles, where the programmers like to program close to the metal.

So, I don't think PCs will become faster at single task execution at all, or multi-threaded for that, but will try and embed the OS kernel. The Linux one, not the Windows one. There is no gain to be had, there.

Microsoft isn't going to be happy when the quad-cores are going to hit the market. They probably have nightmares about that for quite some time by now. Linux will simply kill them on raw performance. Although it would make their OS model a lot simpler by allowing all the VM stuff Windows already does in hardware.

I don't understand how the OS and hardware became so seperate in the first place. You really can't make sweeping changes to one without effecting the other. Thus things don't change all that much.

Frank · Oct 13, 2006

Ok. I think I can make a demo with an "object" model of a streaming architecture that might work. But I'm pretty sure that the way the compiler is going to compile it will make it effectively useless. I'll think about it a bit more.

DudeMiester · Oct 13, 2006

flick556 said:
The best solution in the cpu world is to have a hardware specific JIT compiler and a language structured with this in mind. Branches and dependent reads complicate the job of the compiler / scheduler, but in the CPU world these things are important. If two apps are running simultaneously they should pretty much be combined into one highly parallel app then compiled together when possible. In many ways itâ€™s like a flash back to the days of real time computing except with JIT thrown in to add a huge layer of abstraction between the hardware and software.

I don't see how the JIT is necessary. If the apps are compiled in such a way that a rough outline of their internal objects and communication pathways are visibile to the OS (i.e. in a message passing network), this is enough. For example, if two apps use an object that is known to be stateless and thread safe, then the OS can tranparently combine those objects (make it like a service) and remove the redundancy. Moreover, I think using a JIT is a bad idea, because there are many cases where you need to do low level instruction specific opitimsation (ex. SSE stuff). Overall, the OS scheduler doesn't need instruction level granularity, that's the realm of the instruction schedular in the CPU, and object level/message passing unit visibility is enough.

flick556 · Oct 15, 2006

DudeMiester said:
I don't see how the JIT is necessary. If the apps are compiled in such a way that a rough outline of their internal objects and communication pathways are visibile to the OS (i.e. in a message passing network), this is enough. For example, if two apps use an object that is known to be stateless and thread safe, then the OS can tranparently combine those objects (make it like a service) and remove the redundancy. Moreover, I think using a JIT is a bad idea, because there are many cases where you need to do low level instruction specific opitimsation (ex. SSE stuff). Overall, the OS scheduler doesn't need instruction level granularity, that's the realm of the instruction schedular in the CPU, and object level/message passing unit visibility is enough.

The JIT compiler is so developers can work independent of specific hardware, allowing forward compatibility to be achieved. When processors go multicore these cores will not always be symmetric. The basic idea is there is a standard language and library. From their processor manufactures can choose to accelerate either certain library functions or general performance. The JIT compiler has the freedom to replace code or library calls with hardware specific versions. The amount of possible hardware configurations grows as we add more processors to the system. It then becomes unreasonable to perform all these different compilations at design time.

There is a fairly large overhead in doing JIT, but the benefit is it can allow for extremely complex and exotic processor designs to remain compatible with existing programs. You could add several different types of processors to the system each designed for more specific tasks. The language itself should express higher level concepts or tasks to the compiler giving it more freedom to make decisions on how to achieve the final results. Similar to SQL where you specify what you want, not how you want to create it.

For example if you want numbers satisfying a certain mathematical equation, instead of performing the steps of the equation procedurally for each number you want, you state the equation implicitly , then ask for x numbers that satisfy it. This gives the compiler the freedom to decide the order and manor in which to tackle said equation x times. Because the compiler sees this larger problem as a whole, it can recognize calculations that can be reused, as well as any parallelism that can be exploited. Because it’s JIT compiled it can perform these optimizations with very specific knowledge of the machine it’s executing on. Another option for the compiler is compile on install.

3dilettante · Oct 16, 2006

flick556 said:
For example if you want numbers satisfying a certain mathematical equation, instead of performing the steps of the equation procedurally for each number you want, you state the equation implicitly , then ask for x numbers that satisfy it. This gives the compiler the freedom to decide the order and manor in which to tackle said equation x times. Because the compiler sees this larger problem as a whole, it can recognize calculations that can be reused, as well as any parallelism that can be exploited. Because it’s JIT compiled it can perform these optimizations with very specific knowledge of the machine it’s executing on.

I'd be careful about counting on the compiler to do all that. What you have outlined is a JIT auto-parallelizing, aggressively optimizing, algorithm-formulating compiler. I'd also worry about doing something that is at the very least NP-complete JIT.

There are aggressively optimizing compilers, though most of the highly aggressive settings are used very carefuly. Auto-parallelizing compilers haven't exactly wowed anybody, and I'm sure there'd be a nobel for the guy who thinks up a compiler that can reformulate algorithm implementations on the fly.

I can't remember which logical language I worked on in school that had the implicit definition of results that you are outlining. It's stupid that I can't remember the name of something so incredibly slow. Granted, the environment it ran in basically just generated all possible values until they met the logical conditions, so it's not what I'd call particularly robust.

flick556 · Oct 16, 2006

3dilettante said:
I'd be careful about counting on the compiler to do all that. What you have outlined is a JIT auto-parallelizing, aggressively optimizing, algorithm-formulating compiler. I'd also worry about doing something that is at the very least NP-complete JIT.

There are aggressively optimizing compilers, though most of the highly aggressive settings are used very carefuly. Auto-parallelizing compilers haven't exactly wowed anybody, and I'm sure there'd be a nobel for the guy who thinks up a compiler that can reformulate algorithm implementations on the fly.

I can't remember which logical language I worked on in school that had the implicit definition of results that you are outlining. It's stupid that I can't remember the name of something so incredibly slow. Granted, the environment it ran in basically just generated all possible values until they met the logical conditions, so it's not what I'd call particularly robust.

lol the future is cool isn't it. I really think this is the direction things are heading in the next 5 to 10 years. When you look at the roadmaps with dozens of cores on them, it really seems that these features will be needed. The speed of the compiler at these tasks has allot to do with the languages not beeing structured with these things in mind. You will basically have a VM managed by Intel or AMD, who knows how M$ will fit in this.

nutball · Oct 17, 2006

flick556 said:
lol the future is cool isn't it. I really think this is the direction things are heading in the next 5 to 10 years. When you look at the roadmaps with dozens of cores on them, it really seems that these features will be needed.

Never mind the future, what about the past?

Multi-processor computers are hardly new, and some very rich organisations have been both selling them and using them for decades now. The financial imperative both to develop compilers which are capable of seamless parallelisation and to design languages which are amenable to auto-parallelisation have been very large and have been around for decades. So both the financial means and the motivation have been in place but no-one has cracked the problem.

Now looking forward we have roadmaps with dozens of cores -- in the vast majority of PCs on the planet most of these cores will sit idle whilst one runs Word or Excel. Where is the impetus for change coming from? What I'm looking for you to identify an organisation which has more money to spend on compiler development than, eg. IBM, and has a financial incentive to spend many tens of billions of dollars designing new languages and new compilers which will provide tangible financial benefits to a meaningful fraction of their customers.

In short: you seem to be arguing that the situation going forward is radically different from the situtation looking back (at the whole industry, not just PC space), and that for this reason all the previous barriers will melt away. I don't agree with you.

Frank · Oct 17, 2006

To be able to experiment, I've written a simple multicore emulator to emulate a number of simple processors, with some local memory and multiple interconnect mechanisms. The numbers and amounts are selectable, and only limited by available RAM. Including an assembler and disassembler et al.

But, I'm not happy with it yet. I just made up a bunch of adressing modes and opcodes and handled them all, individually, in code. It's a mess. I'm going to put some logic in the instructions and add a real decoding stage.

dnavas · Oct 18, 2006

nutball said:
Never mind the future, what about the past?

Multi-processor computers are hardly new, and some very rich organisations have been both selling them and using them for decades now.
...
In short: you seem to be arguing that the situation going forward is radically different from the situtation looking back (at the whole industry, not just PC space), and that for this reason all the previous barriers will melt away. I don't agree with you.

I reject that notion on the basis of experience. There are other more pressing and financially lucrative places to have put your money. With perfect wisdom, would you have invested in Netscape or Amdahl in 1996?

I experimented with a number of language add-ons in the early 90's, creating a coding structure that allowed methods to be bound to execution threads, or to run in the calling thread's context, run synchronously or asynchronously, dispatched separately from execution, allowed to execute within a class of threads, and a myriad other combinations. At the same time, I was involved with a very smart gentleman up in Berkeley and another smarty pants down in San Antonio creating a RAD-like environment for signal analysis. The mechanism for communication between the modules would be recognizable to anyone who's been following the Stream Processor thread.

I'm doing Java performance now, and the reason my career took that path was that there was a tad more money in the world for application server development than there was for non-destructive evaluation, and the people I met and respected and worked with were dealing more with webapp and java-related issues than the very interesting problems (to me) of the way to break code into easily distributed, independent execution pieces.

I have every confidence that, as the problem becomes enough of a pain point, enterprising individuals will find unique and interesting solutions. I hope that relieves any anxiety you may have regarding the utility that future processors may provide.

-Dave

DemoCoder · Oct 18, 2006

3dilettante said:
I'd be careful about counting on the compiler to do all that. What you have outlined is a JIT auto-parallelizing, aggressively optimizing, algorithm-formulating compiler. I'd also worry about doing something that is at the very least NP-complete JIT.

NP-complete is irrelevent. Many compiler algorithms are NP complete (graph color register allocators, hazard avoidance scheduling), that doesn't preclude their use, it just means one has to use a heuristic to get "good enough" solutions rather than an exponential algorithm to get globally "optimal" solutions. Whatever is NP-complete on a JIT is NP-complete in a AOT/Static compiler. However, what is soluable on a JIT is not always soluable on a static compiler. For example, systems with late binding and metaclasses can't do optimization statically very well without 100% code coverage unit tests and a profile-feedback compiler.

I wouldn't worry about NP-complete, I'd worry about undecidability. An algorithm formulating compiler is isomorphic to a formal proof system, since at the very least, the compiler would need to prove the algorithm correct, not the least of which is whether the algorithm halts on inputs that it is supposed to halt on.

Auto-parallelizing compilers haven't exactly wowed anybody, and I'm sure there'd be a nobel for the guy who thinks up a compiler that can reformulate algorithm implementations on the fly.

I'd actually call DX9/OGL GLSL device drivers auto-parallelizing compilers in a sense, since they take functional code (shaders take non-mutable inputs and produce a single output, and cannot mutate any state shared with other shaders during a "pass"/transaction. In fact, the inability to write to memory except to return an I/O result at "the end" maps 1:1 with pure functionality, which IMHO implies the best HLL for shader writing is a pure functional one. The C-base ones have all kinds of "gotchas" because of the fact that they can't mutate external memory during execution) and run it with extreme parallelism.

This just demonstrates that functional code is easy to extract data parallelism from. I gather when people say "auto parallelizing" they really mean "take imperative strict non-unique destructive-update languages and extract auto-parallelism"

I can't remember which logical language I worked on in school that had the implicit definition of results that you are outlining. It's stupid that I can't remember the name of something so incredibly slow. Granted, the environment it ran in basically just generated all possible values until they met the logical conditions, so it's not what I'd call particularly robust.

Well, there are two types of systems in this area: constraint based programming languages and expert systems with forwards or backwards chaining. Plus there are declarative query languages like SQL. Not all or slow, but all have a restricted model of computation. (SQL = relational algebra)

nutball · Oct 18, 2006

dnavas said:
I have every confidence that, as the problem becomes enough of a pain point, enterprising individuals will find unique and interesting solutions. I hope that relieves any anxiety you may have regarding the utility that future processors may provide.

Oh it's not anxiety, I'm perfectly sanguine about these things

Those extra cores will come in very useful for the sort of stuff I do, and if you're right I'll no longer have to expend months and months and thousands of lines of code hiding latency in my message-passing parallel codes. Just press a button and off I go to ten thousand cores! I like your future!

(Though I still am curious what benefit Janice the secretary and the billions like her are going to get out of a Intel Core Hexadecanium).

3dilettante · Oct 18, 2006

DemoCoder said:
NP-complete is irrelevent. Many compiler algorithms are NP complete (graph color register allocators, hazard avoidance scheduling), that doesn't preclude their use, it just means one has to use a heuristic to get "good enough" solutions rather than an exponential algorithm to get globally "optimal" solutions. Whatever is NP-complete on a JIT is NP-complete in a AOT/Static compiler. However, what is soluable on a JIT is not always soluable on a static compiler. For example, systems with late binding and metaclasses can't do optimization statically very well without 100% code coverage unit tests and a profile-feedback compiler.

It's the just-in-time aspect that makes me leery of a compiler that has that much on its plate. I'm unaware of any compiler that does all three things outlined with such a time constraint. Making the solutions satisficing isn't going to make juggling three goals any easier.

*edit*
On second thought, I should clarify my thought process. You are correct that NP-completeness isn't necessarily the determining point. My reason for using NP-completeness as a point was that the scheduler's time to solution would not scale linearly.
I think in the case of a scheduling load that is proportional to the number of processes or processors, the compiler+scheduler+implementor's computational load would take an increasing proportion of computation resources as the system or application is made more parallel.

Unless all of the diverse goals for this JIT system environment can all be made to scale at most linearly with respect to the number of cores, or the overhead remains so small as to be masked by other factors, then the JIT nature threatens to make initial latencies or changes for the sake of optimization too costly.

I agree with your point on isomorphism, and I wonder if the system can prove this in non-trivial cases in linear-scaling time/computation.
**

I don't believe the performance gains exist in most cases, and they could probably be nearly matched by less aggressive implementations, only without the very real chance of such a complex JIT compiler having a bug that seriously louses everything up.

Occassionally optimizing the code from collected real-time profiling can probably get most of the performance gains without needing another core or cores running constantly to do it. Perhaps every hour or so in some environments, or even less frequently.

Developers in a more commoditized market could probably run their apps on a representative platform and just ship the execution and optimization profile with the product. They could improve performance with a profile patch, if need be.

*another edit*
Instead of having the JIT system/virtual machine actively optimizing, the optimization phase could be decoupled from everyday processing. It would still get profiling information, but it wouldn't try to optimize or reimplement algorithms by recomputing the load.

Rather, the decoupled optimizer could work through trace data in idle or prescheduled times and create a meta-data packet that would be run on something much more simple in the run-time environment.

It is more likely that running through a fixed profiler/implementor result script can be run with scaling that is a lot more favorable.
**

Frank · Oct 18, 2006

I could use some help. There are two things of which I would want to know if I'm on the right track. The overall idea is to create a setup that is as easily useable to parallelize processes on random, multicore hardware as possible. And I assume they have some local storage.

The first one is the memory setup, including segmentation/paging. I think it would make sense to have a flat, 48 bit adress space. And I would divide that into domains (32 bit), segments (16 bit), blocks (8 bit), consisting of individual pages that are all 256 * 4 bytes (everything in 32 bit words) in size.

Does that actually makes sense? And if it doesn't, what would be a better way to do it?

The next one is, that I would want to memory map everything. From harddisks down to registers. And have a single page (256 * 32 bit) contain everything that is needed for the process to run. EVERYTHING would be relative, and there is no scope outside the process. You would have to ask the OS (process 0). And that would have to be a nano-kernel.

That single, main page for each process would consist of everything it would need to keep track of. The memory mapped registers (including the program counter, stack pointer, base adresses, segment sizes, instruction counter, pointer(s) to its page/block/segment tables, etc, etc), and all other things we can think of. The idea is to make a model that can be as independent as possible.

For the processor, I would keep a list of which software registers of what process are mapped to which physical registers, perform all operations relative to the base pointers of the individual processes (taking paging et al into account) and perform the actual page/block/segment switching.

How does that sound?

Btw, as I'm gone this way already, it would be best if it mimicked the way hardware does it as close as possible, while creating a model that is fast and easy to run as a JIT.

Oh, btw. I think it would be best if the actual instructions would be more like: r.a += [r.b + 15]++ than regular assembly.

Ah, and what about adressing modes? Do we want to support them all, or use the RISC idea?

Edit: I forgot to describe the memory handling model:

A single process only sees it's own address space, so addresses 0..255 would be mapped as registers, and it would have a local space (with it's own page tables) that is at most 256 KB in size. That would be totally local memory, so you cannot map anything else there. Although the OS can swap it (or parts of it) to main memory, if needed.

Beyond that, you have segments (each 64 MB). They are used to map random resources into, and aren't really paged, although each DMA transfer would be in multiples of 1KB pages, as would all other communications. Although messages probably should be handled independently.

And beyond that, you have the virtual domains, each 32 bit adress and 32 bit data in size (16 GB). You can map everything else there. So, a harddisk would have windows of at most 16 GB to access it's data.

And we would want to limit the amount of active pages, blocks, segments and domains as much as possible, to a maximum of 16..64 of each (depending on the size of individual entries), for each process. So it can all fit a single page table. Although we should make it possible to have multiples of those.

Frank · Oct 18, 2006

On second thought, it might be better to limit all of that. We don't really want any process but the nano-kernel to be able to access all that. But I don't like limits, as they become obsolete rather fast in IT.

Anyway, what would be good limits, for distributed processes? And what would be for the kernel? Do we want a single model for both, or simply a VM for the processes?

DemoCoder · Oct 18, 2006

3dilettante said:
It's the just-in-time aspect that makes me leery of a compiler that has that much on its plate. I'm unaware of any compiler that does all three things outlined with such a time constraint.

The Java HotSpot JIT Compiler does global register allocation via graph coloring.

There is no need to view the situation as AOT/JIT. Hybrid systems are all around. Static compilation, followed by JIT profile collection, followed by reoptimization with idle CPU.

HotSpot takes its name because it doesn't optimize everything, instead, it uses heuristics to decide which are the "hot" parts of the code, and how much work it can devote to optimization based on predicting how long it will take. HotSpot has two modes 'client' and 'server'. Client mode uses a different set of algorithms and tries to minimize pauses, interruptions, and startup time for real time UI feedback. Server mode uses more aggressive optimizations where throughput matters more than latency. (e.g. graph coloring vs linear scan register allocation)

In Java6 a new hybrid JIT is being introduced. Code can be interpreted, then compiled via the C1 compiler (optimized for startup, latency), then compiled via C2 "server" compiler (as the app runs longer). Think of this as a 3 gear automatic transmission.

IBM's Java VM goes further. It has 4 gears: interpreted, trivially compiled (baseline compiler, no optimizations, but translation of IR to asm), compiled, and aggressively compiled. (think of it the last three cc -g, -O1, -O2). Add to this, they have AOT (ahead of time/static compile) as well.

Some research JITs have included the ability to store profile data between runs too, so that something which was once AOT compiled, can be switched to aggressive JIT optimizations on restart without recollecting profile info.

As I mentioned in another post, modern OO programming languages and modern component-oriented programming styles interfere with static compilation ala C. Polymorphism is the enemy of inline and global data flow analysis (e.g. cross-method register allocation), and so is other forms of late binding (COM, DLL/so, etc) Since C's linker is as dumb as they come, and C object format does not store IR related information, optimizing against pre-compiled third party libraries is hard as well. Also, most operating systems do not have a binary format (e.g. ELF) which stores high level IR representation along with executable code, interprocedural analysis is hampered as well.

For example, if you dynamically load a DLL/SO, and call a method, there is no way for the compiler to statically determine a) what DLL was loaded and b) of the method that is called in that DLL which methods the DLL might go on to call!

JIT systems fix the growing dynamism problem of OO languages and component-oriented programming by storing enough information to re-link and recompile methods at run time, and perform whole-program analysis across previous event horizons like dynamically loaded third party libraries.

Moreover, with JITs you get more flexible garbage collection algorithms, like generational copying which can vastly improve memory hierarchy performance, as well as run-time lock elimination based on biased locking and lock coarsening as well as runtime escape analysis.

You wouldn't want a JIT in an embedded scenario (like a console) due to unpredictability, but for most of the other apps in the world, they make sense, especially as CPUs diversify their architecture. JITs are a win for the same reason that OOOE is a win vs static Itanium architectures , and that's runtime information is critical to some applications.

The ideal is to have a continuum from aggressive static compilation to profile directed JIT recompilation when the user wants it. Static compilers with feedback don't neccessarily cut the mustard when you have a long running app that you don't want to reboot.

(BTW, another benefit is that JIT systems allow *runtime* patching of code as well as debugging of heavily optimized code through deoptimization. I can attach to a running process and patch a method with a fixed version. Great for non-stop interactive development)

Frank · Oct 18, 2006

As an encore about JIT: how much time would that JIT compiler have to optimize something like Word? More than 90% of all the CPU cycles, for sure.

Efficient scheduling for single- and multi-core?

DemoCoder

Frank

Certified not a majority

flick556

Frank

Certified not a majority

flick556

Frank

Certified not a majority

DudeMiester

flick556

3dilettante

flick556

nutball

Frank

Certified not a majority

dnavas

DemoCoder

nutball

3dilettante

Frank

Certified not a majority

Frank

Certified not a majority

DemoCoder

Frank

Certified not a majority

Similar threads