Future console CPUs: will they go back to OoOE, and other questions.

Fox5 · Sep 30, 2006

swaaye said:
AMD likes it more? Well, derivates of the P6 core have been in use now since ~1995. Yonah is an awful lot like Pentium Pro in many ways. Core 2 Duo is essentially a modernized version of it. Netburst was the only major change and it's gone now lol.

The same can be said of K8 and K7, but those aren't even close to the same age as P6.

Ok, Intel has "reused" the P6 core alot (though I wouldn't call core 2 duo a slight modification, I think it's had enough changes to call it a new microarchitecture), but Intel at least maintains multiple processor lines.
Since the athlon, AMD's server processors have been athlons, their embedded processors have been athlons (minus the lowest end geodes which I think AMD has cut loose), and their desktop processors have been athlons. Intel had the Pentium 4, Pentium M, Xscale, and Itanium all going at once, though with Conroe they've made an attempt to consolidate their lines so that mobile and desktop processors will be the same again, and I believe they sold off the xscale division, but as of right now they still are actively producing Core 2 Duo, Core Duo, and Itanium, with the Pentium 4's being phased out. Intel generally maintains multiple processor lines at a time, while AMD generally doesn't. I doubt we'll see Intel phase out OOOE by 2010 for desktop processors, rather a new line of server processors will emerge that don't have it, which is a divergence that AMD may not be able to afford to follow.

ban25 · Sep 30, 2006

ADEX said:
This discussion has been about OOO being added back into console processors. I've been reading over Intel's presentations on "Terascale" processing which is planned to appear by around 2010 and it looks very much that they are planning on dropping OOO in favour of a large number of heavily threaded processors.

So, if even Intel are planning on dropping OOO what chance is there of it going into a future console processor?

Intel is not planning to drop OOO. The only reason they have regained their performance leadership is the agressively OOOE Core 2.

What are you going to do with 80 simple cores on a desktop or worse, in a game console? For trivially parallel workloads like multiuser servers where latency is not an issue, there is an application for large multicore/multithreaded processors such as these. But that's as about as far as it goes.

How many people here have run Sun cc or gcc on Niagara? I'm also curious how many people arguing for these radical architectures are actually developers, let alone game developers.

SPM · Sep 30, 2006

ban25 said:
Intel is not planning to drop OOO. The only reason they have regained their performance leadership is the agressively OOOE Core 2.

I thought the shared cache was the reason. Some people are suggesting that the reason why Conroe is faster than a dual core Athlon on typical desktop application, but slower than a dual core P4 on multi-threaded benchmarks is due to the fact that it has a massive amount of combined cache 4MB which can be used by a single processor. On primarily single threaded applications (like games and desktop applications) Conroe benchmarks very well. On multi-threaded applications the dual core P4 and Athlons will perform relatively better against Conroe than they do on primarily single threaded game and PC application benchmarks. If this is true, then it does throw a spanner into the argument that multiple symmetric cores are the way to go for PC desktop applications and games.

http://talkback.zdnet.com/5208-10533-0.html?forumID=1&threadID=21722&messageID=413037&start=-1

Oooe is absolutely indispensible for good perfornance running generic and legacy binary code as in Windows. However for a closed system like games consoles, compiler optimisation with ioe or limited oooe is probably going to be better for cost/performance.

As I said earlier symmetric multi-core is good for file/database servers, but not necessarily for a desktop PC where much of the workload is single threaded. I believe better performance and cost/performance is possible on the desktop by using a very fast single core with as much cache and oooe capability as possible built in, and a number of small Cell-like asymmetric cores for media acceleration. This will give the highest performance on PC applications and PC games.

What are you going to do with 80 simple cores on a desktop or worse, in a game console? For trivially parallel workloads like multiuser servers where latency is not an issue, there is an application for large multicore/multithreaded processors such as these. But that's as about as far as it goes.

There is application for multiple simple cores in supercomputing, in games consoles and media acceleration, not in conventional servers. As I mentioned earlier conventional servers are i/o intensive and spawn multiple independent threads that have to handle stacks, buffers and other pointer driven access to large data structures in RAM. A simple core is bad at this, but a symmetric multi-core processor architecture is perfect.

80 cores is over the top for most things other than for say a substitute for GPU. If you have that kind of space left on your chip after you put your top of the range oooe single core with massive cache in, and putting in whatever simple cores you need for media acceleration (say 4 to 8), then I would spend it on an integrated GPU rather than 72 more simple cores.

If you want to reduce latency, you would be better off going for ioe and local store rather than cache and oooe because the former is more predictable. Also, for minimum latency, if you have to split a task between multiple cores, it is better to have an architecture with a single master core tightly bound to a number of simple slave cores like Cell, rather than have loosely bound symmetrical cores that perform independently of each other.

How many people here have run Sun cc or gcc on Niagara? I'm also curious how many people arguing for these radical architectures are actually developers, let alone game developers.

I agree, people are going a little over the top, salivating at the number of cores etc. The bleeding edge in technology in my opinion is cost and cost/performance ratio. Any idiot can come up with an incredibly powerful computer if you give him/her enough money. The really difficult challence is cost and cost/performance. That is why I consider consoles to be cutting edge technology even when they lag behind PCs in absolute performance.

Frank · Oct 1, 2006

The single reason why OOOE exist is, that you feed multiple pipes through a single instruction stream. As soon as you switch to a model in which all the pipes have their own instruction stream, there is no need for OOOE anymore, whatsoever. And that's definitely the direction we're heading.

So, with multiple monolitic cores (short term), you want OOOE. With many small cores that only consist of a specialised pipe and some general purpose logic (longer term), you don't want or need OOOE. What you still might want is speculative stores. More like the model Demo talked about.

So, the question we should ask ourselves is: how are we going to split our instruction stream for each task into multiple independant streams that only use a single type of workload? And I think GPUs are a very good example of a staggered model to do something like that.

Frank · Oct 1, 2006

Btw, when the GPUs switched from fixed pipelines to shaders, they emulated the fixed pipeline. And when x86 went from single pipe to multiple pipes, they did the same: interpret the x86 instructions, and translate and schedule to the instructions the individual pipes understood.

So, emulating x86 instuctions with such a design will be much easier than it seems.

ban25 · Oct 1, 2006

DiGuru said:
The single reason why OOOE exist is, that you feed multiple pipes through a single instruction stream. As soon as you switch to a model in which all the pipes have their own instruction stream, there is no need for OOOE anymore, whatsoever. And that's definitely the direction we're heading.

Well, if you look at what's being worked on by companies like Sun, they are clearly focusing on creating more robust cores that deliver higher IPC, including features like OOOE. Explicit threading simply won't get you the kind of granularity necessary to cover an L1 miss for those 20 cycles or so, and there are still too many algorithms which are difficult to parallelize. The GPU is not really a good comparison because it has a very specific and trivially parallel workload. In fact, it's more of a counter argument because as GPUs become more powerful and capable of a wider range of workloads, they will begin to take over those highly parallel tasks from the CPU -- leaving the CPU with mostly serial workloads.

Frank · Oct 1, 2006

Nah, most people are simply talking about single task/single core. They allow multiple tasks, but only look at the execution speed of a single one. And they allow multiple cores as well, but only if they speed up that single instruction stream (task). And a stream that does integer, FP and vector at the same time, at that. Without any regard for the other threads that might be running. You want yours to be faster, whatever.

pascal · Oct 2, 2006

Some guess work / speculation below.

After reading some articles my guess is that OoOE will come back in the next Xenos (by ~2011 ?).
Reasons:
- Developers complains
- Some algorithms or part of it are hard to parallelise.
- AmdhalÂ´s law concerning the macro blocks dependencies inside of the game.
- Much better fabrication process, maybe 35nm.

Based on this thread http://www.beyond3d.com/forum/showthread.php?t=34227 other guess is that the next process will be two generations after 65nm, then it means 35nm.

This could means a new chip with:
- Lots of cache (at least 4MB with 27mm2)
- Mores cores, maybe 6 OoOE deep pipeline SMT RISC at 8.4 GHz (33 mm2)
- Glue logic (25 mm2)
- 50 Giga dot products/sec.
- 609 GigaFlops
total 85mm2 at 35nm with low power dissipation.

Each core with the equivalent speed of a 5.9 GHz contemporary CPU.
Or maybe stronger cores than that.

Now someone please find the memory bandwith.

3dilettante · Oct 2, 2006

This discussion has been about OOO being added back into console processors. I've been reading over Intel's presentations on "Terascale" processing which is planned to appear by around 2010 and it looks very much that they are planning on dropping OOO in favour of a large number of heavily threaded processors.

So, if even Intel are planning on dropping OOO what chance is there of it going into a future console processor?

Terascale is Intel's proof of concept for the idea of putting many small cores on a die. It's not the replacement for OOO according to the rest of their roadmaps, which include 2-4 heavy duty OOO cores surrounded by a bunch of smaller specialized ones.

There are markets that could use something like Terascale, but it's definitely not the one where Intel makes most of its money.

DiGuru said:
The single reason why OOOE exist is, that you feed multiple pipes through a single instruction stream. As soon as you switch to a model in which all the pipes have their own instruction stream, there is no need for OOOE anymore, whatsoever. And that's definitely the direction we're heading.

The first use of modern OoOE was in a single-piped FPU unit for the IBM 360.
OoO exists because the semantics of the Von Neumann computer architecture require that each instruction be done sequentially, even if the actual data flow of a problem is not serial.

A must come before B which is before C, even if there is absolutely no reason besides the fact that instructions must be written in some order that they be in that order.

OoO still exists because a 50% performance gain is nothing to sneeze at.
There are a lot of problems in which splitting execution into streams or threads does not lead to a better than 50% gain, or leads to a net loss.

So, with multiple monolitic cores (short term), you want OOOE. With many small cores that only consist of a specialised pipe and some general purpose logic (longer term), you don't want or need OOOE. What you still might want is speculative stores. More like the model Demo talked about.

There will always be a demand for single-threaded performance, regardless of how well concurrency takes off. There will be those compute-intensive critical tasks where latency is very important, and large pools of simple cores do not have the turnaround time of a single strong one.

Communication within a single core is on the order of 1-2 cycles. Something as simple as a L2 cache access takes something like 12 cycles on an x86 core. The communications latency for a fabric supporting many tiny cores will not be better than a monolithic core's cache latency, and depending on the network, can be much worse.

So we know there will be tasks that require a certain amount of result forwarding that will either take 1-2 cycles on a monolithic core or something several (or six, ten, or twenty) times longer getting routed through a switch network.

So, the question we should ask ourselves is: how are we going to split our instruction stream for each task into multiple independant streams that only use a single type of workload? And I think GPUs are a very good example of a staggered model to do something like that.

GPUs have a an embarassingly parallel workload to work on. That makes them an exceptionally poor example for splitting anything that isn't similarly embarassingly parallel.

SPM · Oct 2, 2006

3dilettante said:
There will always be a demand for single-threaded performance, regardless of how well concurrency takes off.

For the desktop certainly - most desktop applications and games will depend on single thread performance. For other applications like servers maybe not, at least not to the exclusion of multi-thread performance, and in these cases, given that single thread performance is reaching it's limit, multi-core is the only way to go.

There will be those compute-intensive critical tasks where latency is very important, and large pools of simple cores do not have the turnaround time of a single strong one.

Actually this is completely wrong. With a large core, you have to multi-task since you have a limited number of cores (unless of course you are running a single tasking OS like MS-DOS). It requires time to interrupt other processes, and for the CPU to make a context change - bigtime latency there! With a large number of small cores like Cell, you can dedicate a processor to a critical task, so no context switch is required. Also running instructions from local store - no cache misses. Latency should be a hell of a lot lower with a large number of small cores like Cell.

ban25 · Oct 2, 2006

SPM said:
Actually this is completely wrong. With a large core, you have to multi-task since you have a limited number of cores (unless of course you are running a single tasking OS like MS-DOS). It requires time to interrupt other processes, and for the CPU to make a context change - bigtime latency there! With a large number of small cores like Cell, you can dedicate a processor to a critical task, so no context switch is required. Also running instructions from local store - no cache misses. Latency should be a hell of a lot lower with a large number of small cores like Cell.

No, I think you're missing the point. Yes, you may have higher throughput with more cores (i.e. you will be able to do more), but you won't be able to complete any one task faster.

Like I've mentioned several times already, go get yourself a Niagara box and try to compile something on it.

Gubbi · Oct 2, 2006

pascal said:
- Mores cores, maybe 6 OoOE deep pipeline SMT RISC at 8.4 GHz (33 mm2)

Don't know if Power will continue with SMT, but Intel has abandoned SMT for good I think.

The new Core 2 builds on the old Active Register Record methodology from Pentium PRO/P-2/P-3/P-M which is similar in concept to the Future File approach used in K7/K8.

Here you have a register file (the active register/future file) for the most speculated state and a register file for your retired state. The active register file needs to be massively multiported (like normal register files), but compared to the register renaming used in the P4 (and others), it has two advantages:

1. Size: It only holds values for architected register (or empty in which case the instruction tag for the instruction that will produce the register value will be used)
2. CAM-free. As 3ddilettante said earlier big CAMs are complex and power hungry structures, avoiding these is probably a good idea.

1.) is also why implementing SMT might be a bad idea. Doubling this structure in size to support two contexts will more than double power consumption and with 100% certainty make it slower (if not impact clock frequency, then increase latency).

The sole purpose of SMT is to use execution units more effectively, but execution units are basically free today.

Cheers

Gubbi · Oct 2, 2006

SPM said:
It requires time to interrupt other processes, and for the CPU to make a context change - bigtime latency there! With a large number of small cores like Cell, you can dedicate a processor to a critical task, so no context switch is required.

Until of course you need to run one process more than there are cores in CELL. CELL SPEs are particularly useless at timeslicing since the local store is part of the context and needs to be saved (enormous context!!).

Also running instructions from local store - no cache misses. Latency should be a hell of a lot lower with a large number of small cores like Cell.

Please explain to me why a CPU with, say, 256KB cache needs to fetch instructions more often than a SPE from main memory?

Cheers

ADEX · Oct 2, 2006

Don't know if Power will continue with SMT, but Intel has abandoned SMT for good I think.

It's due to come back in near future designs. They're working on 4 and 8 core chips with 2 or 4 threads per core. - Couldn't find the link but their CTO said that a while back in an interview.

The sole purpose of SMT is to use execution units more effectively, but execution units are basically free today.

In terms of transistors yes, in terms of power nothing is free. SMT is both space and power efficient so expect to see lots more of it - Sun are talking about 512 threads within 5 years.

Yes, you may have higher throughput with more cores (i.e. you will be able to do more), but you won't be able to complete any one task faster.

I'm also curious how many people arguing for these radical architectures are actually developers, let alone game developers.

The problem is there's effectively no way to improve single threaded performance anymore.
Developers are not going to like it one little bit but they aren't designing the hardware, like it or not everything is going multithreaded.

Terascale is Intel's proof of concept for the idea of putting many small cores on a die. It's not the replacement for OOO according to the rest of their roadmaps, which include 2-4 heavy duty OOO cores surrounded by a bunch of smaller specialized ones.

Terascale is not a single project, it's hundreds of projects and yes that includes the type of devices you mean.

Within 5 years CPUs will be completely different from what they are today and that's what they are researching. The 80 core chip is a research chip built to test memory technologies and high speed switch fabrics.

There are markets that could use something like Terascale, but it's definitely not the one where Intel makes most of its money.

Intel make most of their money on servers - exactly where Terascale will be good.

One of the problems with CPUs with tons of cores is cache coherence, when your core needs data it needs to know if any other core has a copy of the data cached, that's going to send latency through the roof killing single threaded performance, so Intel are looking at things like "speculative threading" in the compiler in place of OOO. They're also doing a lot of work on the software side as that is the biggest problem.

Chips like Cell and Niagara may look weird today but in 5 years time all processors are going to look that. OOO is the best solution for general purpose code *today* but once latency problems really start hitting OOO isn't going to help as the core will just be sitting doing nothing. It'll be hard to justify a feature which burns a lot of power but wont improve performance much.

Cell will likely prove very interesting going forward as local stores do not need to be kept coherent, I think this will turn out to be a major advantage.

pascal · Oct 2, 2006

Gubbi said:
Don't know if Power will continue with SMT, but Intel has abandoned SMT for good I think.

The new Core 2 builds on the old Active Register Record methodology from Pentium PRO/P-2/P-3/P-M which is similar in concept to the Future File approach used in K7/K8.
...

I agree with it and in fact I had thought about that before.
I read some IBM papers saying that SMT increased the POWER4 to POWER5 transistor count in 10%, and that the overall increase in issues/sec is up to 40% depending on the threads workload (with each thread sharing half of time).
My guess this means the following:
Xenos 3.2 GHz core = 50% of P4 GHz 3.0 = 1.5 GHz P4
SMT best case is monothread + 40% issues = 1.5 x 1.4 GHz or 2.1GHz (with each thread with 1GHz)
Now if you replace the SMT logic (+-10%) for the OoOE logic (20% ?) you could have up to 40% increase in current maximum monothread performance, and doubling one thread performance in multithread environment (based in IBM info of 30% of decrease without OoOE) without much increase in overall logic, core size or heat generation while kepping the same high issues/sec and reducing the Amdhal risks of serial code in the multithread application.

Now update the scenario:
- Lots of cache (at least 4MB with 27mm2)
- Mores cores, maybe 6 OoOE deep pipeline RISC at 8.4 GHz (33 mm2)
- Glue logic (25 mm2)
- 6 threads (one each core)
- 100 Giga issues/sec
- 50 Giga dot products/sec.
- 609 GigaFlops/sec
total 85mm2 at 35nm with low power dissipation (< 30 watts).

Each core with the equivalent speed of a 5.9 GHz contemporary CPU in monothread performance.
Or maybe stronger cores than that

Off course all this is just some guess, but was fun reading all the articles/paper in the net

3dilettante · Oct 2, 2006

SPM said:
For the desktop certainly - most desktop applications and games will depend on single thread performance. For other applications like servers maybe not, at least not to the exclusion of multi-thread performance, and in these cases, given that single thread performance is reaching it's limit, multi-core is the only way to go.

The time of seemingly exponential single-threaded performance growth is over, not all growth. Incremental gains will probably continue for decades. At a bare minimum, silicon scaling should continue to ~2020.

Actually this is completely wrong. With a large core, you have to multi-task since you have a limited number of cores (unless of course you are running a single tasking OS like MS-DOS). It requires time to interrupt other processes, and for the CPU to make a context change - bigtime latency there!

I'm not saying there should only be one core, I was saying that many dozens of simple single-pipeline cores are not the best solution for a lot of problems.

The performance penalty for multi-tasking is not as bad as you think. Most threads spend much of their time idling, and with a decent OS scheduler, compute intensive threads get a bigger share of the processor's time. With several cores of any type, the cost of multitasking goes from minor to negligible in most cases.

The penalty of context switching is also implementation-dependent. The Intel Montecito core can do a context switch in about 12 cycles, and it is rare that every thread in every process needs 100% attention all the time. Some threads can afford to be sidelined if they only run every other minute.

Tasks that have overlapping data sets or memory footprints can benefit, since they can use what the other has cached or stored.

This means that in the grand scheme of things, multitasking is not the biggest factor.
Overall, the individual thread latency is shorter if it is on a core that is more robust, period.
If there are a lot of threads, reduced single-threaded performance isn't too bad, but only if there are enough threads to hide the shortfall.

What will always come back to haunt a processor is that it is not always possible to spawn as many threads as one would like.

With a large number of small cores like Cell, you can dedicate a processor to a critical task, so no context switch is required.

That would be beyond overkill. I could have a chip with a quarter of the number of cores needed for that, and it would probably be several times more efficient if I allowed them to context-switch.

In a typical desktop environment, there could be over a hundred threads. Of them, probably one or two is actually doing anything most of the time. For a server, there could be hundreds of active threads, but they'll have a bunch of associated threads that also don't do anything most of the time. Direct mapping of threads to a processor means shutting down most of the cores for most of the time.

In addition to this, it would completely destroy locality for threads that share some of their memory footprint. If tasks do share data or code, they will either need their own copies (cache/local store gets wasted) or have to snoop and broadcast results on the chip. This isn't so bad, unless the design takes the many simple core idea to an absurd degree. (More on this later)

Even CELL context switches regularly, it just tries to avoid doing it too frequently, because the SPEs aren't very good at it.

As a result, the overall performance impact of multi-tasking will be drowned out by other factors past maybe four cores (even at two, system responsiveness for a desktop environment is pretty good).

Also running instructions from local store - no cache misses. Latency should be a hell of a lot lower with a large number of small cores like Cell.

Local store has nothing to do with OOE, and it has its own drawbacks that can affect performance in certain workloads. An OO core can have local store, it doesn't really matter to the local store what's in the core it's hooked to.

What begins to dominate at high numbers of cores is the cost of communications and synchronization.
Since the simple cores can't match a single heavy-duty core, they need to work together. The simpler they are, the more cores that need to be working on a common problem. That means they need to talk to each other more often.

CELL has a ring bus that serves a small number of cores. The bus only offers peak bandwidth if a transfer is between immediately adjacent cores, and it imparts a significan latency penalty. It's not impossible to manage if the problem being worked on doesn't care about the latency, or it can be easily divided up.

It gets very hard to guarantee good communication or divide up a problem well if there are 32 cores.

The more cores that are needed to match a single monolithic core, the greater the need that a core must communicate with its partners.

This means a given operation or set of operations must send a message out of its core to reach another processor, sending signals that may cross a distance as great or greater than that of a lower-level cache access.

Depending on the way the cores are hooked together, it could take longer to pass the request along.

For tasks that are not easily parallelizable, the amount of intercommunication is higher.

Inter-core communication has a cost, and it is much higher than the internal forwarding of a large core.

The key is to divide tasks only as far as it does not cause the cost of communication to outweigh the gain in throughput. Talk is not cheap at the silicon level.

If overhead becomes prohibitive at 32 threads, having 64 cores won't do a thing. Having 32 cores that can do their jobs twice as fast, however, would make a difference.

In other cases, it doesn't matter how a task is divided, since there is some critical stretch of code that can't be split up.

If that stretch of code controls how the task is handled by the hive of cores, then every core is going to sit and wait for it to finish.
Would you rather it run on a wide OOE core that can finish it in 15 cycles, or the weaker one that takes 30?

Remember, it's not just that one core taking 15 extra cycles, it's every one of the other cores waiting on it as well.

I'm of the opinion that it is better for a general chip solution to have a few powerful cores added to the mix.

If having 4 big cores means a chip can only have 50 other small cores, while a competitor has 64 small ones, it would appeal to more people if on a wider set of workloads the mixed variant can possibly do 50% better, despite an apparent core deficit of 13%.

ADEX said:
Intel make most of their money on servers - exactly where Terascale will be good.

It will work well for a given subset of servers, not all of them. Intel would be better to give up ten terascale cores out of 80 if it meant one or two Conroe-type were there to keep it from giving up when things get complicated.

One of the problems with CPUs with tons of cores is cache coherence, when your core needs data it needs to know if any other core has a copy of the data cached, that's going to send latency through the roof killing single threaded performance, so Intel are looking at things like "speculative threading" in the compiler in place of OOO. They're also doing a lot of work on the software side as that is the biggest problem.

It's not just cache coherence, it's communications overhead. Many cores means many will have to talk to each other. Speculation doesn't eliminate the problem with coherency, and there's no way Intel's going to give up on cache. That's a guaranteed drop of at least 100x in performance, period.

Chips like Cell and Niagara may look weird today but in 5 years time all processors are going to look that. OOO is the best solution for general purpose code *today* but once latency problems really start hitting OOO isn't going to help as the core will just be sitting doing nothing. It'll be hard to justify a feature which burns a lot of power but wont improve performance much.

Niagra II will double the single-threaded performance of Niagra. The magic word "parallel" can't produce work if a problem just doesn't parallelize.

Cell will likely prove very interesting going forward as local stores do not need to be kept coherent, I think this will turn out to be a major advantage.

Unless you want them to be coherent, in which case they suck. Not every workload lets each core play in its own sandbox.
It is more likely that Intel will compromise as IBM did with Xenon and allow software control over locking cache lines, perhaps even more control on the cache snoop protocols.

SPM · Oct 3, 2006

Gubbi said:
Until of course you need to run one process more than there are cores in CELL. CELL SPEs are particularly useless at timeslicing since the local store is part of the context and needs to be saved (enormous context!!).

You don't normally timeshare an SPE. You don't timeshare on any real-time OS. If you do, it is completely useless as a realtime OS - the latency kills you - on any CPU. What you do on Cell is timeshare non-critical code on the PPE and reserve SPEs for time critical tasks. You can get the same realtime performance on dual core Conroe, but you can only handle one task on the core reserved for real time processing. I can't really see how you can argue that a conventional single or dual core CPU is capable of lower latency. Anyway 7 SPE cores are plenty - you rarely need to handle more than two or three realtime tasks at once - eg. simultaneous video codec decompression, audio streaming and decription/ of data streamed from a DVD.

Please explain to me why a CPU with, say, 256KB cache needs to fetch instructions more often than a SPE from main memory?

Click to expand...

This is a silly question. The SPE running a program from local store doesn't fetch instructions for main memory at all while a CPU with a 256KB cache does? Have you considered what happens when you get a cache miss? It may that you get a cache miss intermittently, but a cache miss is a cache miss and the resulting latency kills your system as a real-time OS.. By the way, real time code generally tends to be very compact (because you can't run a huge amount of code in the short time available for response) and so will usually fit easily into the 256kB of local store.

Click to expand...

Gubbi · Oct 3, 2006

SPM said:
You don't normally timeshare an SPE.

So what do you do when you have 8 processes for 7 SPEs? Just not run it ?

You don't timeshare on any real-time OS. If you do, it is completely useless as a realtime OS - the latency kills you - on any CPU.
What you do on Cell is timeshare non-critical code on the PPE and reserve SPEs for time critical tasks. You can get the same realtime performance on dual core Conroe, but you can only handle one task on the core reserved for real time processing. I can't really see how you can argue that a conventional single or dual core CPU is capable of lower latency.

Why do you drag real time OSs into this? If you were going to run a real time OS on CELL it would idle most of the time, because deadlines for processes have to be set pessimistically.

SPM said:
This is a silly question. The SPE running a program from local store doesn't fetch instructions for main memory at all while a CPU with a 256KB cache does?

Apparently not, since you didn't get the point.

The instructions gets into the local store somehow, right ? By explicitly DMAing it in. Instructions get into the cache the first time a program/loop is executed. If your program can (well it has to) execute in 256KB local store, it can execute in 256KB cache without misses. The only thing that would cause a miss is if some other process ran and evicted some of the cache lines from the first thread.

Cheers

SPM · Oct 3, 2006

3dilettante said:
The time of seemingly exponential single-threaded performance growth is over, not all growth. Incremental gains will probably continue for decades. At a bare minimum, silicon scaling should continue to ~2020.

It will happen, but mainly with increasing processor frequency, which is not increasing as fast as one would wish due to chip technology hitting physical limits which have to be circumvented. A single core is better than a dual core any day, if you can get the same performance out of it as a dual core with the same number of transistors. Sadly you can't get this. The reason the multi-core designs out there are proliferating is because of the limits to single core performance, limits which are expanding slowly, but limits that are still there none the less.

I'm not saying there should only be one core, I was saying that many dozens of simple single-pipeline cores are not the best solution for a lot of problems.

Click to expand...

Yup. There are going to be different best solutions for different applications.

The performance penalty for multi-tasking is not as bad as you think. Most threads spend much of their time idling, and with a decent OS scheduler, compute intensive threads get a bigger share of the processor's time. With several cores of any type, the cost of multitasking goes from minor to negligible in most cases.

Click to expand...

I am not saying multi-tasking performance has to be bad, but if you are that concerned about latency in response, then you should not multi-task because that is the biggest latency problem right there - worse than almost everything else. There are a number of real-time operating OSes out there, and none of the true RTOSes are multi-tasking OSes for exactly this reason. With multi-tasking say video decoding, sound, disk i/o and say running an application at the same, you can and do get video or sound stuttering. You aren't likely to get that if you run the time critical stuff - the video and audio on SPEs for example, and for this type of use it is efficient since you utilise the SPEs fully while you are streaming, and you can use those SPEs for something else once you stop streaming.

You can of course reduce the timeslices to very short times to get better response, however if you do this, your performance drops. Also you can have time critical code interrupting time critical code, and non-reentrant code in libraries which you may need to use which require spin locks or semaphores which causes even more latency.

Click to expand...

Simon F · Oct 3, 2006

ban25 said:
Like I've mentioned several times already, go get yourself a Niagara box and try to compile something on it.

Is that with or without the -j command line option on (GNU's) make?

Future console CPUs: will they go back to OoOE, and other questions.

Fox5

ban25

SPM

Frank

Certified not a majority

Frank

Certified not a majority

ban25

Frank

Certified not a majority

pascal

3dilettante

SPM

ban25

Gubbi

Gubbi

ADEX

pascal

3dilettante

SPM

Gubbi

SPM

Simon F

Tea maker

Similar threads