Future console CPUs: will they go back to OoOE, and other questions.

3dilettante · Oct 3, 2006

It will happen, but mainly with increasing processor frequency, which is not increasing as fast as one would wish due to chip technology hitting physical limits which have to be circumvented. A single core is better than a dual core any day, if you can get the same performance out of it as a dual core with the same number of transistors. Sadly you can't get this. The reason the multi-core designs out there are proliferating is because of the limits to single core performance, limits which are expanding slowly, but limits that are still there none the less.

The point I was addressing was the idea that because of chips like those from the Terascale project, Intel was going to give up on OOE.
That's not going to happen. That single-threaded performance will no longer grow as rapidly as it once did doesn't mean that new designs are just going to give up on all the capabilities that lead up to this point.
On average, OOE (not even the ultra-speculative stuff of the latest gen) gives about a 50% boost in performance.
Adding 40 more cores to an 80 simple core chip is not going to lead to the same boost thanks to diminishing returns.

I am not saying multi-tasking performance has to be bad, but if you are that concerned about latency in response, then you should not multi-task because that is the biggest latency problem right there - worse than almost everything else.

I got off on a tangent, the latency I was addressing was the wall-clock time of completion for a given stream of instructions.

High-throughput designs trade off on this latency for the advantage of having more streams running simultaneously. As long as the portion of code that runs serially is kept small, the overall penalty to performance scaling is small.

However, in problems that are not perfectly parallel, the serial component begins to dominate sooner. What is troubling with respect to going so far back as to rely on a pool in-order scalar pipelines is that in all but the most ideal workloads, the limited extraction of even easy ILP means that an even larger portion of the code gets lumped into that serial component of Amdahl's equation than is necessary.

Either a single core takes forever to finish a stream, or it will enlist the aid of other cores. This is not without cost, inter-core communication will never be faster than result forwarding, register storage, and cache access within a core.
The more cores there are, the more sacrifices that must be made to adequately link and coordinate them.
It's not a win to say parallel execution does better on a problem if only because the single-threaded performance has gone down the crapper.

If a problem is not very parallel--and a vast number are not, the serial portion (which is bigger than it needs to be because the cores are so feeble) becomes the limiter of performance almost immediately. If a problem is not embarassingly parallel--most problems are not, then past a certain point the sea of cores winds up waiting for some critical stream to complete, wasting time and energy.

No number of cores can change that, but designers can help minimize how much code is forced to run serially with more robust cores. Even a small number of wider OO cores could catch many of these cases.

There are a number of real-time operating OSes out there, and none of the true RTOSes are multi-tasking OSes for exactly this reason. With multi-tasking say video decoding, sound, disk i/o and say running an application at the same, you can and do get video or sound stuttering. You aren't likely to get that if you run the time critical stuff - the video and audio on SPEs for example, and for this type of use it is efficient since you utilise the SPEs fully while you are streaming, and you can use those SPEs for something else once you stop streaming.

It also helps that most real-time OSes are used for environments that don't really need great audio or video quality.
Real-time is about stringent maintenance of deterministic behavior and fast response.

In most situations, the constraints don't need to be that harsh, which is good because tight constraints mean reduced performance.

Sound or video stuttering in most user environments can be handled by maybe four cores, so long as the IO junk can be handed to one of them.

Most heavy parallel environments don't need to be real-time either, and for large systems it would be incredibly serializing to force determinism.

Let's note that multiprocessing of any kind is a far greater source of non-deterministic behavior than wider issue, OOE, or context switching.

You can of course reduce the timeslices to very short times to get better response, however if you do this, your performance drops. Also you can have time critical code interrupting time critical code, and non-reentrant code in libraries which you may need to use which require spin locks or semaphores which causes even more latency.

Locks and semaphores don't care how many cores there are. True, it helps if there's at least one, and it makes a difference if there are more than one, but they are software constructs. They don't know anything about the silicon that runs them.

I suppose you could have a core for every thread in the environment, and 99% of the time those cores are unavailable and doing nothing.
It's just better to aim for a reasonable solution, where the cost of context switching is minimal but hardware utilization is high.
Why not have a context switch and get over it?
With SMT or SoEMT, the cost is either not there or it happens in less that 20 cycles.

It's all about diminishing returns. It's why there aren't 1000-way associative caches. We could do it, but the difference from 18-way is almost nothing.

darkblu · Oct 3, 2006

3dilettante said:
Locks and semaphores don't care how many cores there are. True, it helps if there's at least one, and it makes a difference if there are more than one, but they are software constructs. They don't know anything about the silicon that runs them.

i believe SPM's point was that the operational latency of a construct like a spin-lock is fairly direcly dependent on how many cores are trying to acquire it. and i tend to agree with him on this: increasing the number of cores may increase your overall throughput, but it has low chances of not affecting your thread-wise latencies as overall contention increases.

It's all about diminishing returns. It's why there aren't 1000-way associative caches. We could do it, but the difference from 18-way is almost nothing.

you're neglecting the fully-associative caches. currently, though, i'm left with the impression that real low-latency caches tend to step back from high/full-associativity. but somebody with a closer look at various architectures may correct me here.

Simon F · Oct 3, 2006

3dilettante said:
It's all about diminishing returns. It's why there aren't 1000-way associative caches. We could do it, but the difference from 18-way is almost nothing.

Actually, IIRC, there have been some CPUs with fully-associative caches but they might not have had quite that many cache lines.

MfA · Oct 3, 2006

darkblu said:
i believe SPM's point was that the operational latency of a construct like a spin-lock is fairly direcly dependent on how many cores are trying to acquire it.

How often are you going to need spin locks for synchronization in a game engine though? It ain't an OS.

ban25 · Oct 3, 2006

Simon F said:
Is that with or without the -j command line option on (GNU's) make?

Even with a parallel build, it's still very slow at this kind of workload. A compile that might take 15 minutes on an Opteron is more like an hour on Niagara. Heavily serialized tasks are even worse, like a database bcp that takes 5-6 hours. But then, what do you expect from a scalar 1.2 GHz core?

It makes a great webserver, though, where you have plenty of threads (at least one per request) and typically low ILP anyway. However, it would be a mistake to look at an architecture like this, or Intel's Terascale, and think it's the way of the future on the desktop or in game consoles.

Frank · Oct 3, 2006

3dilettante said:
Terascale is Intel's proof of concept for the idea of putting many small cores on a die. It's not the replacement for OOO according to the rest of their roadmaps, which include 2-4 heavy duty OOO cores surrounded by a bunch of smaller specialized ones.

There are markets that could use something like Terascale, but it's definitely not the one where Intel makes most of its money.

Yes, because single-thread performance now means: doing all kinds of different workloads at the same time, from a single instruction stream. That will change.

In the near future, you will need to make a separate task for each type of specialized workload, and in 5-10 years from now you will probably have to make a single task for each pipe. That's the only way you can improve performance all around.

The first use of modern OoOE was in a single-piped FPU unit for the IBM 360.
OoO exists because the semantics of the Von Neumann computer architecture require that each instruction be done sequentially, even if the actual data flow of a problem is not serial.

A must come before B which is before C, even if there is absolutely no reason besides the fact that instructions must be written in some order that they be in that order.

OoO still exists because a 50% performance gain is nothing to sneeze at.
There are a lot of problems in which splitting execution into streams or threads does not lead to a better than 50% gain, or leads to a net loss.

Yes, good point. But that's only a problem with deep pipelines and/or when you feed multiple, dependent pipes through the same instruction decoder. If your micro-core only consists of a single pipe, it (might) make more sense to spend the transistors in making the execute stage a single clock. Broaden it, instead of deepening it.

There will always be a demand for single-threaded performance, regardless of how well concurrency takes off. There will be those compute-intensive critical tasks where latency is very important, and large pools of simple cores do not have the turnaround time of a single strong one.

Yes, that's true. But the single thread of today will be split into multiple, workload-dependent tasks tomorrow. It's the only way!

Communication within a single core is on the order of 1-2 cycles. Something as simple as a L2 cache access takes something like 12 cycles on an x86 core. The communications latency for a fabric supporting many tiny cores will not be better than a monolithic core's cache latency, and depending on the network, can be much worse.

So we know there will be tasks that require a certain amount of result forwarding that will either take 1-2 cycles on a monolithic core or something several (or six, ten, or twenty) times longer getting routed through a switch network.

Yes, but you should think in streams and micro-threads, not individual, dependent instructions.

GPUs have a an embarassingly parallel workload to work on. That makes them an exceptionally poor example for splitting anything that isn't similarly embarassingly parallel.

That doesn't matter if you look at a single set of staggered pipes, that all stream their data from one specialized unit to the next.

ERP · Oct 3, 2006

MfA said:
How often are you going to need spin locks for synchronization in a game engine though? It ain't an OS.

FWIW I think most game engines have more resemblence to OS's than any other piece of software.
The core of a game engine is about sequencing and scheduling, this is especially true on the gameplay side. In fact I've aalways been somewhat surprised that most game engines don't steal more from the world of Real time OS's.

If your going to run across different cores your going to have to deal with contentions for shared resources (most frequently memory), and what people refer to as lock free datastrutures are just spin locks on small code fragments with a low likely hood of collision.

MfA · Oct 3, 2006

If we are just going to repeat tired old arguements Ill just repeat my truism, most of the compute intensive work in games boils down to simulation ... and reality is massively parallel.

ERP, having code getting write locks on data in a non deterministic manner just seems ... wrong to me. As far as I can see in between I/O refreshes (ie. the vast majority of the time in a game engine) your code should be deterministic. Spin locks to me seem more of a symptom of a problem than a solution.

3dilettante · Oct 3, 2006

darkblu said:
i believe SPM's point was that the operational latency of a construct like a spin-lock is fairly direcly dependent on how many cores are trying to acquire it. and i tend to agree with him on this: increasing the number of cores may increase your overall throughput, but it has low chances of not affecting your thread-wise latencies as overall contention increases.

I forgot about the contention aspect of the lock, but that would reinforce my position, since I'm the one arguing the point that multiple cores bring larger contention and synchronization issues.

SPM's position is that due to context switching, a single core has a worse lock contention problem between the switched threads that is somehow not a problem when they are spread out on other cores.

I would argue that, while not completely penalty free, a single core forced to churn through synchronization structures for whatever reason, would likely resolve them very quickly with a cache access.

you're neglecting the fully-associative caches. currently, though, i'm left with the impression that real low-latency caches tend to step back from high/full-associativity. but somebody with a closer look at various architectures may correct me here.

Any cache clocked near core speed with reasonable latencies has to step back or be incredibly small. While Itanium 2's ALAT isn't exactly a cache, since it doesn't store more than an address, it is fully associative and only has 32 entries.
Resolving more than 2 or 4 cache tags within three cycles is hard to pull off at GHz speeds in the L1 data cache.

A fully associative cache at megabyte sizes would have an access time worse than a DRAM module, since not even an entire DRAM module is accessed all at once.

Frank · Oct 3, 2006

ERP said:
FWIW I think most game engines have more resemblence to OS's than any other piece of software.
The core of a game engine is about sequencing and scheduling, this is especially true on the gameplay side. In fact I've aalways been somewhat surprised that most game engines don't steal more from the world of Real time OS's.

If your going to run across different cores your going to have to deal with contentions for shared resources (most frequently memory), and what people refer to as lock free datastrutures are just spin locks on small code fragments with a low likely hood of collision.

If we're going to look at games (and we already had this same argument a few times, IIRC), it's about the direction you're thinking. Instead of trying to determine all the states of all the objects in advance, you should have them decide what to do by themselves and see where you end up.

Like, instead of a single AI thread that loops through all the objects in sequence while expressing the rules, you first have the objects do as they feel like, and then take a look around to see what happened, and apply the rules.

You make triggers: when two objects interact (that's what you're interested in), you add them to the end of the list for reprocessing and flag them for interaction. Otherwise, just let them roam free.

For example: are you interested in a bullet that is in flight? Of course not. You only become interested when it hits a target.

ERP · Oct 3, 2006

MfA said:
If we are just going to repeat tired old arguements Ill just repeat my truism, most of the compute intensive work in games boils down to simulation ... and reality is massively parallel.

ERP, having code getting write locks on data in a non deterministic manner just seems ... wrong to me. As far as I can see in between I/O refreshes (ie. the vast majority of the time in a game engine) your code should be deterministic. Spin locks to me seem more of a symptom of a problem than a solution.

My point had little to do with how serial or parallel Game code is, either way, most games at their core are just sequencers/schedulers. To be fair I have seen people apply DB models and structure games in terms of queries on a database, I prefer the OS model and structuring in terms of Actions on the data.

Ignoring the AI argument for a minute. Let's take a trivial problem.

I want to submit graphics primitives on seperate threads. This is useful even if it's just for debuging.
Since the GPU is basically a big state machine, I have basically 2 choices either I build N independant lists where N is the number of "threads" submitting primitives and combine them at the end or I have a shared data structure (likely a double buffered array of lockfree stacks to do trivial bucket sorting on some criteria). Which of these is faster is going to be dependant on the cost of the contention on the lock free structures versus the cost of the merge operation.

Even if you have largely independant tasks, at some point you have to resolve acess to your shared resources. In a shared memory environment spinlocks (in lockfree data structures) will in general be orders of magnitude faster than Mutex's or similar synchronisation primitives.

Frank · Oct 3, 2006

ERP said:
My point had little to do with how serial or parallel Game code is, either way, most games at their core are just sequencers/schedulers. To be fair I have seen people apply DB models and structure games in terms of queries on a database, I prefer the OS model and structuring in terms of Actions on the data.

Ignoring the AI argument for a minute. Let's take a trivial problem.

I want to submit graphics primitives on seperate threads. This is useful even if it's just for debuging.
Since the GPU is basically a big state machine, I have basically 2 choices either I build N independant lists where N is the number of "threads" submitting primitives and combine them at the end or I have a shared data structure (likely a double buffered array of lockfree stacks to do trivial bucket sorting on some criteria). Which of these is faster is going to be dependant on the cost of the contention on the lock free structures versus the cost of the merge operation.

Even if you have largely independant tasks, at some point you have to resolve acess to your shared resources. In a shared memory environment spinlocks (in lockfree data structures) will in general be orders of magnitude faster than Mutex's or similar synchronisation primitives.

I think that is the best argument you can make for adding an unified (GP)GPU to the CPU and treating it like a Cell. Which is essentially where everyone is heading.

3dilettante · Oct 3, 2006

DiGuru said:
Yes, because single-thread performance now means: doing all kinds of different workloads at the same time, from a single instruction stream. That will change.

That isn't all that single-threaded performance is. Even in a perfectly threaded application, the fastest the task will ever complete is the length of time it takes for the longest executing thread to complete its pass.
That's proportional to the single-threaded performance of the core it's on.

The amount of time a given control thread or critical data result takes to complete is determined by the single-threaded performance of the core it's running on.
The amount of time every other core is stuck doing nothing waiting on that critical thread is directly proportional to the single-threaded performance of that one unlucky core.

The faster parallel code is run, the more important the bottleneck code becomes.
Whatever serial sections of code there are, and there will always be such sections, will take as long as it takes for the cores they reside in. Depending on the single-thread performance of the core.

That's not going to change. Ever.
It's math, it's reality. It's a law of diminishing returns. A dude named Amdahl wrote up a nifty equation about it.

I'm not saying there can't be a lot of specialized processors and smaller cores on a chip. I'm just saying that it would be incredibly helpful to have at least one strong OOE chip to step in on those critical code paths.

What would the cost be in a 2010 chip to have a Conroe II processor on the die that would otherwise have with 64 small cores?
What if that means I could only have 50 micro cores to go with the Conroe II?
What's the worst that could happen?
In a purely parallel case, it's 51 cores vs 64.
A 20% disadvantage, if for some reason the Conroe II only counted as a single mini-core performance wise.
By 2011, it will be 128 cores, but the Conroe II is probably smaller. It's 128 vs 115, or 10% less.

The instant one thread's completion time becomes a bottleneck, that Conroe II becomes worth more than a thousand micro cores. So it's either you're really unlucky and you have a perfectly scaling parallel problem, and it loses by the exact percentage of core shortfall, or it does way better, depending on how often the chip has to wait on a single thread.

Here's the fun part, it won't ever lose by 20%. 2 cores are a little short of being twice as good as 1 core, and 4 cores a little less than twice as good as 2 cores, and so on.
It could be almost 20% behind, or maybe 2% behind in a number of special tasks.
It will be ahead everywhere else.

In the near future, you will need to make a separate task for each type of specialized workload, and in 5-10 years from now you will probably have to make a single task for each pipe. That's the only way you can improve performance all around.

There is such a thing as too much of a good thing. Regressing a design that far back in single-threaded performance will actually make problems less parallelizable, defeating the purpose of having a lot of cores.

Yes, good point. But that's only a problem with deep pipelines and/or when you feed multiple, dependent pipes through the same instruction decoder. If your micro-core only consists of a single pipe, it (might) make more sense to spend the transistors in making the execute stage a single clock. Broaden it, instead of deepening it.

That is not what OOE is about. OOE is about a processor trying to execute whatever instructions it can as soon as their data is available, regardless of whether an instruction that just happened to come before it has stalled.

Instructions always come in an order, because they have to be written in some order. There's no reason why instructions in a given order must execute in that order if the data is available.

It doesn't matter what kind of decoder a chip has or how deep the pipeline is. If an instruction on the chip can be run, it will be run as soon as possible, so that the chip doesn't waste time doing nothing when it could be doing something.

OOE is very useful in superscalar processors, but it still has benefits on a single pipeline. Unless you believe no instruction will ever be wanting for data again after 2010, OOE will still give a performance boost.

Yes, that's true. But the single thread of today will be split into multiple, workload-dependent tasks tomorrow. It's the only way!

It can't be split infinitely. The more threads that run, or the more cores that are present, the more they need to play nice. You don't want to add work where you don't need to add it. You only go as far as you need.

Yes, but you should think in streams and micro-threads, not individual, dependent instructions.

But there's a point where where the multicore approach gets to such an absurd degree that something as simple as satisfying a dependent instruction takes 10 times longer to to complete.
Dependencies aren't going away, they happen all the time. Why make something that happens all the time take even more of all of the time?

That doesn't matter if you look at a single set of staggered pipes, that all stream their data from one specialized unit to the next.

That doesn't answer the question of how treating a non-embarassingly parallel problem as an embarassingly parallel one makes all the problems go away.

If embarassingly parallel were how we could describe every problem we could run into, it wouldn't be called embarassingly parallel, it would be called solved along with every other problem there ever was and ever will be before, during, and after the instant we asked.

Since reality as we know it insists on saying certain things can't happen before others, that's obviously not the case.

Threads cannot take you where the data refuses to go. You cannot make work where the workload refuses to divide.

DemoCoder · Oct 3, 2006

I think calling the retry of the memory transaction in lock free algorithms a "spin lock" is confusing terminology. Lock has a well defined meaning, one of exclusive access to a resource for a particular extent, followed by a release of the lock. Lock free algorithms typically attempt an atomic memory write, if they fail, they recompute the memory transaction, and try again. They don't have a release operation, and don't hold onto any exclusivity.

Yes, through a vague analogy, one can call them "spinlocks", but that would be confusing. Mutexes are also often implemented as spinlocks, but the performance of Mutexes vs atomic CAS/TAS under contention loads on large number of processors is orders of magnitude in difference. Does one call database transactions on tables with multiple readers and writers which don't use table level locking "spinlocks"?

Here are an interesting paper on scaling performance up to 45x on a 300+ core Java machine http://developers.sun.com/learning/javaoneonline/2006/coolstuff/TS-5354.pdf that uses several technicals (lock free, lock splitting, batching, etc)

Also, not every lock-free algorithm needs any kind of "retry" at all. There are some concurrent-read concurrent-write algorithms that work via a randomized partition function that ensures each thread only touches a certain memory location at one time (typically a PRNG based on hash of data), and still others that work by having threads that write to the same memory location at the same time end up all writing the same value anyway.

In any case, what Mfa was saying is that reality is highly parallel, but I would add that it is also high local, in that actors and entities only touch a few of their neighbors and do not in general, exert meaningful force (atleast for game simulations) on objects far away. And even when objects far away are influenced (like seeing movement on the horizon) it is often not each individual object that is being perceived, but an aggregrate representation. Which to me says that reality should be represented at multiple levels in a hierarchy, where each thread runs simulation locally, but also computes some macroscopic or aggregate representation for "at a distance" interaction. (e.g. I compute explosion simulation for local viewers, but promote a high level "explosion has occured" mnemonic with parameters higher up. Long distance viewers can perceive the explosion by replacing the mnemonic with a local, but less accurae, facsimile)

The problem with OS models is that scheduling becomes harder and harder as the number of cores goes up. How will you efficiently schedule work and maintain your mutex'ed data on 80 cores? With something like software transactional memory, you might pay a performance penalty of 20-50% depending on implementation (hardware support could knock this way down), but your actual performance will, even under high contention, scale very well on a high number of cores, and what's more, your code will be incredibly simple with practically zero deadlocks, race conditions, or over-contended locks.

I think no matter what, atleast until the death of silicon lithography, the future is going to be multi-core, and with a high number of cores, it will make more and more sense for hardware vendors to add acceleration for STM because it's the only programming model that non-uber-elite-ultra-brainiacs can use to write highly parallel performant and safe code on a large number of CPUs.

Frank · Oct 3, 2006

3dilettante said:
That isn't all that single-threaded performance is. Even in a perfectly threaded application, the fastest the task will ever complete is the length of time it takes for the longest executing thread to complete its pass.
That's proportional to the single-threaded performance of the core it's on.

As in: of course, there is a totally serial master thread that waits for all of them to complete? But isn't that exactly what you want to get away from?

The amount of time a given control thread or critical data result takes to complete is determined by the single-threaded performance of the core it's running on.
The amount of time every other core is stuck doing nothing waiting on that critical thread is directly proportional to the single-threaded performance of that one unlucky core.

The faster parallel code is run, the more important the bottleneck code becomes.
Whatever serial sections of code there are, and there will always be such sections, will take as long as it takes for the cores they reside in. Depending on the single-thread performance of the core.

That's not going to change. Ever.
It's math, it's reality. It's a law of diminishing returns. A dude named Amdahl wrote up a nifty equation about it.

Yes. It is. But, ONLY if you look at a serial managing thread that spawns off child threads and waits for them. A serial paradigm to implement multiple tasks. You're simply doing the OOOE on a higher level! No, that isn't a good way to go.

I'm not saying there can't be a lot of specialized processors and smaller cores on a chip. I'm just saying that it would be incredibly helpful to have at least one strong OOE chip to step in on those critical code paths.

What would the cost be in a 2010 chip to have a Conroe II processor on the die that would otherwise have with 64 small cores?
What if that means I could only have 50 micro cores to go with the Conroe II?
What's the worst that could happen?
In a purely parallel case, it's 51 cores vs 64.
A 20% disadvantage, if for some reason the Conroe II only counted as a single mini-core performance wise.
By 2011, it will be 128 cores, but the Conroe II is probably smaller. It's 128 vs 115, or 10% less.

Ah, but I'm not denying you want a single core to manage stuff. Although you could distribute that as well.

Rather, I think scheduling and spawning threads and waiting for their completion is stupid. Use streams instead. That makes our model into a asynchronous model. Which is exactly what we want for such an architecture.

Don't spawn tasks, but add workloads to the queue of the core that can handle the workload and is almost finished with it's current one.

Sure, deadlocks will still be a problem, but no more than they already are with a single-task model.

The instant one thread's completion time becomes a bottleneck, that Conroe II becomes worth more than a thousand micro cores. So it's either you're really unlucky and you have a perfectly scaling parallel problem, and it loses by the exact percentage of core shortfall, or it does way better, depending on how often the chip has to wait on a single thread.

So, what you're advocating is: increase the IPC count and only use a single, faster core!

Alternatively, (because that is't going to happen), if you were forced to use a model consisting of at most a single monolithic OOOE core and many specialised ones, what would you propose to make efficent use of all those other cores? Because that is in short what is going to happen.

Simply saying that all those other cores are no use is of equal value to the speed of your application: none.

Here's the fun part, it won't ever lose by 20%. 2 cores are a little short of being twice as good as 1 core, and 4 cores a little less than twice as good as 2 cores, and so on.
It could be almost 20% behind, or maybe 2% behind in a number of special tasks.
It will be ahead everywhere else.

yes, when you could pick and choose that single, monolithic core to run all your applications, and as long as task swithcing is the only valid model.

Streams and micro-tasks (to create those streams) are much better.

There is such a thing as too much of a good thing. Regressing a design that far back in single-threaded performance will actually make problems less parallelizable, defeating the purpose of having a lot of cores.

But, you don't have a choice in the matter. Nobody does. I am extremely sure, that Intel and AMD would give very much to be able to build a monolithic core that is much faster at single task execution than the current ones. They simply aren't able to do that, no matter how hard they try.

Sure, a few percent improvement is in the picture. But that's it.

That is not what OOE is about. OOE is about a processor trying to execute whatever instructions it can as soon as their data is available, regardless of whether an instruction that just happened to come before it has stalled.

Instructions always come in an order, because they have to be written in some order. There's no reason why instructions in a given order must execute in that order if the data is available.

It doesn't matter what kind of decoder a chip has or how deep the pipeline is. If an instruction on the chip can be run, it will be run as soon as possible, so that the chip doesn't waste time doing nothing when it could be doing something.

OOE is very useful in superscalar processors, but it still has benefits on a single pipeline. Unless you believe no instruction will ever be wanting for data again after 2010, OOE will still give a performance boost.

Even when the execution stage only takes a single clock???

And that is just as feasible as a long pipeline. Even more so: it's the only thing remaining to speed up a single thread!

It can't be split infinitely. The more threads that run, or the more cores that are present, the more they need to play nice. You don't want to add work where you don't need to add it. You only go as far as you need.

That's why you want independent streams, that run until completion.

But there's a point where where the multicore approach gets to such an absurd degree that something as simple as satisfying a dependent instruction takes 10 times longer to to complete.
Dependencies aren't going away, they happen all the time. Why make something that happens all the time take even more of all of the time?

Because you can, if you use micro-tasks and streams. They are, by their very definition, independent. You need a different programming paradigm, though.

That doesn't answer the question of how treating a non-embarassingly parallel problem as an embarassingly parallel one makes all the problems go away.

If embarassingly parallel were how we could describe every problem we could run into, it wouldn't be called embarassingly parallel, it would be called solved along with every other problem there ever was and ever will be before, during, and after the instant we asked.

Since reality as we know it insists on saying certain things can't happen before others, that's obviously not the case.

Threads cannot take you where the data refuses to go. You cannot make work where the workload refuses to divide.

Both are, at their core, streaming algorithms.

Frank · Oct 3, 2006

DemoCoder said:
In any case, what Mfa was saying is that reality is highly parallel, but I would add that it is also high local, in that actors and entities only touch a few of their neighbors and do not in general, exert meaningful force (atleast for game simulations) on objects far away.

Yes, that is the important part. As long as you can get your spatial indexing right that is, of course.

You don't need to loop the entire game world to see if there is something that interacts with each object. They can (for the most part) take care of themselves.

And that's also why collision detection can be embarrasingly parallel (as you explained), which is often the hardest bit. Even more, if you use vectors and flag local collisions, you don't even need to do a very expensive global collision pass that requires backing up locations of objects and calculating all their volumes at all.

The problem with OS models is that scheduling becomes harder and harder as the number of cores goes up. How will you efficiently schedule work and maintain your mutex'ed data on 80 cores? With something like software transactional memory, you might pay a performance penalty of 20-50% depending on implementation (hardware support could knock this way down), but your actual performance will, even under high contention, scale very well on a high number of cores, and what's more, your code will be incredibly simple with practically zero deadlocks, race conditions, or over-contended locks.

Absolutely. At some point, the bookkeeping needed will take more time and memory than available, for one. Like, 32 bit Windows trying to access more than 128 GB of memory: it simply isn't possible.

I think no matter what, atleast until the death of silicon lithography, the future is going to be multi-core, and with a high number of cores, it will make more and more sense for hardware vendors to add acceleration for STM because it's the only programming model that non-uber-elite-ultra-brainiacs can use to write highly parallel performant and safe code on a large number of CPUs.

3dilettante · Oct 3, 2006

DiGuru said:
As in: of course, there is a totally serial master thread that waits for all of them to complete? But isn't that exactly what you want to get away from?

Not exactly. What will happen is that some subset of threads can't be safely allowed to pass if another thread hasn't completed.

Depending on the application, the rest may not be allowed to go too far ahead before it starts to endanger correctness.

What happens is that the faster you compute the parallel portions of the application, the more those small serial segments start to get in the way, even though they didn't before.

Yes. It is. But, ONLY if you look at a serial managing thread that spawns off child threads and waits for them. A serial paradigm to implement multiple tasks. You're simply doing the OOOE on a higher level! No, that isn't a good way to go.

No it is not only with that threading model. Amdahl's law doesn't apply only to threads, it applies to the composition of a problem.
It is trivially true that there is a serial component of a problem. As a silly illustration, if there's a start, middle, and end, that's three serial steps right there. That's a minimum of three clock cycles for a chip.
The problem could be 90% parallel, but once you process the parallel part twice as fast, the serial portion remains the same. It becomes a larger and larger portion of execution time the better you get at the parallel segment.
The serial segment also increases in size the more you divide a problem up. Something has to recombine the results, and that takes a non-zero amount of time (though hopefully it scales at logN or better).

Once the time needed to run the parallel parts of a program is small enough, the serial part determines how long it takes to finish a task. One additional IPC oriented core could significantly speed up the task, as opposed the the bazillions of percent of extra performance an extra jillion micro cores won't be adding.

Ah, but I'm not denying you want a single core to manage stuff. Although you could distribute that as well.

It's not just managing stuff, it's there to step in when a thread pops up that's too heavy for one of the mini cores. Just saying a task can be distributed doesn't make it so.

Weak analogy:
It's the difference between asking Andre the Giant to body slam Hulk Hogan or asking two dozen dwarfs. Sure the two dozen dwarfs could do it, given enough time, but Andre could pick up him up, toss him, and be sitting down to dinner before they did it.

Rather, I think scheduling and spawning threads and waiting for their completion is stupid. Use streams instead. That makes our model into a asynchronous model. Which is exactly what we want for such an architecture.

You can't make streams if the parallel work just doesn't exist. A person can't do 1+1 faster if you tell ten billion people to help him.

Don't spawn tasks, but add workloads to the queue of the core that can handle the workload and is almost finished with it's current one.

What decides to add workloads to the core in question? There's obviously something keeping track, and that means that something has a harder and harder time the more cores you add.

Sure, deadlocks will still be a problem, but no more than they already are with a single-task model.

Not really. That would just be a really buggy program. It's really special if a chip can deadlock if it's the only one doing the locking.

So, what you're advocating is: increase the IPC count and only use a single, faster core!

Only if you ignore every sentence where I said the exact opposite.

Alternatively, (because that is't going to happen), if you were forced to use a model consisting of at most a single monolithic OOOE core and many specialised ones, what would you propose to make efficent use of all those other cores? Because that is in short what is going to happen.

Whatever it is you propose will make efficient use of all those mini cores if that monolithic core weren't there.

Simply saying that all those other cores are no use is of equal value to the speed of your application: none.

If the parallism isn't there for a problem, then ten billion additional cores will not matter. Once the number of cores exceeds the number of threads that can use them, there's absolutely no point for the other cores to get involved.

yes, when you could pick and choose that single, monolithic core to run all your applications, and as long as task swithcing is the only valid model.

Streams and micro-tasks (to create those streams) are much better.

I think you missed the point. The point of the larger core is to step in when something critical is holding back some or a lot of the cores and finish it faster so everyone else can get back to work faster.

Unless you think it's possible to produce a perfectly parallel and streamed workload on the fly every time, it's not going to go well.

But, you don't have a choice in the matter. Nobody does. I am extremely sure, that Intel and AMD would give very much to be able to build a monolithic core that is much faster at single task execution than the current ones. They simply aren't able to do that, no matter how hard they try.

It's almost trivially easy to make a core that's faster than a single-piped mini-core, which is what I'm arguing.

Sure, a few percent improvement is in the picture. But that's it.

You scoff at a few percent gain, so you go decide to fall back over several hundred?

Even when the execution stage only takes a single clock???

Execution for a lot of instructions takes a single clock now, hasn't made OOE any less useful.

You also can't use a billion transistors to make something like a multiply or divide that much faster. It's physically impossible, the time it takes for signals to propogate across the unit is too much to cram into a clock cycle without making the cycle way too long.
Don't blame me, blame Einstein and that annoying speed of light thing.

Second, is that memory is still the wild-card. A load can take anything from several cycles to hundreds, to many hundreds, to millions of cycles. The millions part is hard to handle, the rest can be.

They will still have loads in the future, and probably multiplies and divides unless streams make such things obsolete as well.

And that is just as feasible as a long pipeline. Even more so: it's the only thing remaining to speed up a single thread!

You're right, it is just as feasible as having a pipeline. Are you planning on throwing that out too?
There are many degrees of OOE, they aren't all massive hardware hogs.

That's why you want independent streams, that run until completion.

Have complicated workloads where parts of a program depend on other parts ceased to exist? If the data flow of a problem says there is a dependency, you can't just hand-wave it away.

Because you can, if you use micro-tasks and streams. They are, by their very definition, independent. You need a different programming paradigm, though.

And you need to ignore every problem that can't be made by definition independent. A small price to pay, I'm sure.

Both are, at their core, streaming algorithms.

I'm not sure what that has to do with what I was saying.
It does manage to ignore things that don't stream, so I guess that's a bonus.

Frank · Oct 4, 2006

Ok, fair enough.

But don't you think that a chip with, say, 16 x86 cores is, well, overkill for most tasks, and pretty inefficient for the things that might want the speed? As you said yourself, using two cores to run a current task gives generally barely 40% improvement.

So, we need to come up with a better way to run programs. Throw von Neuman in the garbage bin, and Amdahl with him, so to say. And build something that does work in this changed world.

liolio · Oct 4, 2006

there have been a post about ps4.
So I will be way less technical than some here (especialy in this post) but i have to speculate.

I think Sony is likely to stay with IBM
So I guess than for ps4 we will have a cell like design with stronger general purpose cores and more and stronger spe.

For Ms I think the situation is more "iffy".
They're likely to stay with ati.

I can see ms using the sale cell design as sony, and i don't think there is a need (after reading this thread) for a chip a lot of identical 'do it all" cores for the workload the future console cpu are aiming.

Amd what to include real dsp on the core of theirs cpu.When I look to amd project this could end with a way more asymetric cpu design than what sti are likely to deliver.

So do you think they will leave IBM and go for for a full amd /ati solution (both cpu and gpu)?

What would you prefer some general purpose cores with lot of identical spe or some general purpose cores with some spe like unit and some more specialized unit?

Uneducated question, but if you can give me some clues you're all welcome.

ERP · Oct 4, 2006

I've said this before and I'll say it again.
I don't think anyone is aguing that large scale parallelism isn't the future. The interesting question isn't what will the processor look like, it's for a system with say 100 cores.
What will be communication between the cores look like?
What will the memory system look like?

Future console CPUs: will they go back to OoOE, and other questions.

3dilettante

darkblu

Simon F

Tea maker

MfA

ban25

Frank

Certified not a majority

ERP

MfA

3dilettante

Frank

Certified not a majority

ERP

Frank

Certified not a majority

3dilettante

DemoCoder

Frank

Certified not a majority

Frank

Certified not a majority

3dilettante

Frank

Certified not a majority

liolio

Aquoiboniste

ERP

Similar threads