Future console CPUs: will they go back to OoOE, and other questions.

3dilettante said:
I'm speaking from the point of view that Microsoft wanted to bring in the broadest possible developer base from the PC side, so it could build up a game library to buffer the PS3's arrival. This would involve the smaller teams with little multi-threading experience and little low-level scheduling experience.
Small teams and nexgen games don't exactly mesh together - regardless of underlying architecture. At any rate, I don't see how PC transitioning into console space could be much broader then it already is(and frankly, I don't think it's got all positive effects either) - unless of course MS or Sony start offering devkits to PC devs for free.

ERP said:
I'm trying to ship games today and I think we could have had comparable experiences on narrower architectures.
I'm not gonna argue with that - single threaded performance is still very valuable and having at least one good GPR core in the new consoles would have been nice.
While it's clear the new PPC cores were not built for general purpose performance - IBM still dropped the ball on that regardless IMO. In order doesn't mean things have to suck 'that' much - many of the MIPS core designs demonstrate that.

That aside - I also know there's a lot of things that I wouldn't even think about if we didn't have SPEs - and from console maker's perspective it's hard to argue that some kind of 'future proofing' is a bad idea. So I'd really ask, how will the experience differ 5 years from now, and were the tradeoffs worth it?

What percentage of your current team would you trust to write good thread safe code?
Obviously not many - and as numbers scale up... But a question back to you - if we had machines that still focused mostly on single threaded performance like last gen - do you honestly think percentage of people capable of writting good thread code would have changed much (if at all) over next 5 years?

ban25 said:
PowerPC actually strays from many of those basic design tenants. Of course many console developers are probably experienced with more traditional RISC architectures like the R3000, R4300, and R5900.
You'd be surprised how a "traditional RISC" architecture like R5900 doesn't differ a whole lot from your stray PowerPC (in terms of instruction set).
Except of course for the fact that it has a much better per-clock performance.

But I disgress - you do realize x86 hasn't been CISC internally for over 7 years now? And for that matter - a "traditional RISC" - R5900 had a higher instruction count, with more flexibility then most older CISCs.
 
Last edited by a moderator:
Well, bear in mind that the Northwood has 2 double-pumped ALUs and runs at the same base clock speed as the Cell/Xenon, in addition to having OOOE (enabling it to do useful work during cache misses and better exploit ILP), and a better branch predictor.

Quite frankly, the reason we haven't seen these kind of in-order superscalar architectures since the Pentium (P5x) is because they're slow. Now, if you are going to throw a ton of simple cores on a chip with lots of threading to run code with inherently low ILP (i.e. like Niagara), then you are going to have something of a niche. But this is hardly a general case or even the common case.
What "same base clock speed"?
I was saying that the IOE RISC at 3.2 GHz could potentially have with general purpose code the same performance of a 2.4 GHz P4 Northwood. This means like a 25% to 30% performance degradation.

Still the problem could be somewhere in the compiler like what happened to the Intel´s i860: http://en.wikipedia.org/wiki/Intel_i860
 
How does the 360's memory latencies compare to a PC's? About the same? Significantly worse? Better?

IMO it isn't the main memory latencies that are the issue, more the inability to hide the latency for L1 and L2 hits.

If you're going to main store you are hurting on any platform, no ammount of OOO execution is going to hide hundreds of cycles. But hiding the single digit latencies of L1 or the low double digit latencies of L2 OOO can make a big difference.

It should be noted though that there are a number of other issues in the X360 and PS3 architectures aside from OOO that make executing general code painful. The cost of a hit on an element in the store queue is extreme, and you can get false hits which can destroy performance.

SO it isn't the lack of orthogonality I don't like about Cell, I think SPU's are cool, and I suspect that tailored processors are a part of where we're going. But I'm not sure I want 8 of them unless they have more general access to main memory.

I think where Cell will differ from future wide designs is in the memory architecture, I think we'll see heirarchical tiered cache like systems servicing processor groups, and the logic to handle it automatically. But we'll have to see how that goes.

100+ hardware threads will be an interesting problem.
 
Obviously not many - and as numbers scale up... But a question back to you - if we had machines that still focused mostly on single threaded performance like last gen - do you honestly think percentage of people capable of writting good thread code would have changed much (if at all) over next 5 years?

Being something of a cynic, I'll say about the same number as right now, whether or not the current machines are multiprocessor.

I think some of us will learn something out of the excercise, we'll get to try various architectural ideas, and at least eliminate the ones that don't work. but I think the problem is going to radically change again in the next transition (well unless Wii is massively successfull) and I'm not sure how much of the learning will transfer directly.
 
What "same base clock speed"?
I was saying that the IOE RISC at 3.2 GHz could potentially have with general purpose code the same performance of a 2.4 GHz P4 Northwood. This means like a 25% to 30% performance degradation.

Still the problem could be somewhere in the compiler like what happened to the Intel´s i860: http://en.wikipedia.org/wiki/Intel_i860

I've heard people blame compilers for 20 years. Along with predictions of how ne "smart" compilers will solve the problem.
Compilers do somethings really well and a lot of things really badly, most modern processors are built the way they are to run compiled code fast.
 
But I disgress - you do realize x86 hasn't been CISC internally for over 7 years now? And for that matter - a "traditional RISC" - R5900 had a higher instruction count, with more flexibility then most older CISCs.

Of course I realize this. My whole point, which you just made for me (thanks :p), was that the whole RISC/CISC debate is dead and buried and hasn't mattered for years.
 
What "same base clock speed"?
I was saying that the IOE RISC at 3.2 GHz could potentially have with general purpose code the same performance of a 2.4 GHz P4 Northwood. This means like a 25% to 30% performance degradation.

Still the problem could be somewhere in the compiler like what happened to the Intel´s i860: http://en.wikipedia.org/wiki/Intel_i860

But how will it manage 2/3rds the performance with half the execution resources, worse branch prediction, and no ability to cover L1 cache miss latency? You're saying it's 2/3rds the performance, I'm saying it's 2/3rds *slower* ... kinda similar I suppose. :)
 
Looks like cache miss and pipeline stalls is what is mainly holding it.

Hey developers: time to hand code some assembly :p

Maybe I can write some libraries of hand coded assembly routines to the Xenon CPU :cool:
 
What "same base clock speed"?
I was saying that the IOE RISC at 3.2 GHz could potentially have with general purpose code the same performance of a 2.4 GHz P4 Northwood. This means like a 25% to 30% performance degradation.

Still the problem could be somewhere in the compiler like what happened to the Intel´s i860: http://en.wikipedia.org/wiki/Intel_i860

oh my gosh , i860 was used as geometry unit in SGI workstation (elan etc ? )
im not a developper. so , i have a question about SIMD.

single instruction multiple data ? as translate all vector with the same motion per example ?
can you give me an deeper example (not to deep plz ) of what SIMD do well ? and why can't you do these thing without simd ?
 
and more... what is in order and out of order execution ??
i mean, i dont understand why order is necessary

*per example. if i need to add then multiply . i cant multiply before do an add ?

is OOOE an ability to "freeze" the multiply works , to do the add before, and that , without lost too many cycle ?

(i really would know)

*ps: i think my first line is totally BS , because i imagine because order of add then multiply is anyway define in the source code itself, so this story of IOE and OOO is totally clueless for me <_>)
 
Last edited by a moderator:
Whether or not anyone can write "safe" multithreaded code isn't so much a function of the programmer as it is of the programming language, libraries, and abstractions used. Someone writing multithread code in Erlang, Occam, or Concurrent Clean/Haskell or a language with transactional memory, will have to struggle to create deadlocks or race conditions.

Whether or not someone can write safe multithreaded code isn't the only problem. The problem is whether your programmers have the skills to design their own parallel algorithms in the first place, vs people who are mostly "off the shelf" programmers who use STL or Boost and reimplement Graphic Gems, A*, and other off the shelf algorithms as opposed to more academically inclined coders.

Someone like Carmack or Sweeney has the luxury to sit around for months doing R&D on algorithms before deciding to start work on a game engine. Most developers either aren't inclined to basic R&D, or don't have the time to do it. Research then consists of searching SigGraph, ACM, IEEE, CiteSeer for useful papers.

Safe code is one thing, but to take advantage of multicores you need to write more than just bug-free code, you need to known how to take advantage of parallelism, and it isn't always so obvious as offloading monolithic parts of your app to different cores.
 
I dunno democoder, I would place myself as a fairly noobish and OTS coder (still at uni, 2.5 years (well summers really) experience in c++) however Threading and multicore programs make sense to a lot of people, in some ways its very unnatural to think of doing one "thing" at a time that a non-threaded/cored approach enforces.

Take for example the old "Make a cup of tea", most people can describe the steps and do so linearly, however if you ask them what they would do they immediately start to multitask and create jobs for each hand. Of course applying this to a program is slightly harder but the principle is there.

While the algorithms I produce may not be the best in the universe, they are functional and better than the single threaded performance... game development generally needs higher performance if it is to achieve AAA status however the coders at that level likely have much more experience so can find the appropriate and best solutions faster than myself.

I would agree however that most commercial coders don't have time to work through months of research to find the best implementation for a project, however most of us are not pushed to the same level or require the same level of efficiency as Engine designers. (Though if anyone has code to ensure I can get small Ethernet packets at > 200Hz I would be most appreciative of a pointer!)
 
I've heard people blame compilers for 20 years. Along with predictions of how ne "smart" compilers will solve the problem.
Perhaps it's also something to do with the languages they have to work with? <shrug>
 
Is the PPC really the exemplar of what we're discussing here though? I think if you look at the present day, then basically what you want is an architecture like Cell, sans the PPU and with something more 'robust' in it's place. ...and Cell itself is not some hypothetical 'what if' situation either; although most of the comparisons have brought it back to the PC, remember that this a real architecture with real design wins outside of the console realm.
It seems to me a lot of people are missing the proper argument here. It isn't really 'is OOOE better the IOE' or 'is SMP better than UMP' or 'it's lots of cache better than little cache'. It's a question of will we get rid of IOE in future console processors so that in whatever multicore system you have, it's all OOOE.

Putting a better alternative to the PPE into Cell makes sense, and is something I'm sure everyone wants - but does improving Cell also mean adding OOOE to SPEs? Or will future processors get rid of SPE like processors altogether and go back to multi-core GP processors?

There's a lot of discussion here but I think most of it isn't getting to the heart of the debate ;)
 
Darkblade,
Most of the multithread code I see produced today is lock-based worker paradigms. 1 thread per I/O channel, 1 thread per CPU, typically nothing more than serial algorithms of mutually orthogonal parts of an application run in different threads. (e.g. run rendering code in different thread overlapping disc I/O, network, and AI/game logic threads, or run 1 per HTTP request or per DB query, from a thread pool, etc)

That works for trivial parallelizable problems, but it won't work when you have a serial algorithm that you need to parallelize. How many programmers know how to parallelize list traversal, depth first search, etc? How many know about randomized parallel algorithms, pointer jumping, graph contraction, ear decomposition, etc?

There is a difference between running monolithic chunks of code within separate threads coordinated by mutexes, semaphores, and other concurrency primitives, and being able to take a serial algorithm and datastructure and properly parallelize it.

The latter is often a Phd research problem, there is no general purpose way to do it, and it is usually unique for each algorithm.

Now, programming languages can help prevent deadlocks and race conditions (e.g. software transaction memory, PI-Calculus based languages, etc) but they cannot make you turn a serial list traversal algorithm into a pointer jumping parallel traversal.
 
It seems to me a lot of people are missing the proper argument here. It isn't really 'is OOOE better the IOE' or 'is SMP better than UMP' or 'it's lots of cache better than little cache'. It's a question of will we get rid of IOE in future console processors so that in whatever multicore system you have, it's all OOOE.
..or would it be better to have SMT with IOE and rely on the muliple threads to keep the pipelines filled?

and more... what is in order and out of order execution ??
i mean, i dont understand why order is necessary

*per example. if i need to add then multiply . i cant multiply before do an add ?
The IOE and OOE really apply when the code is doing (locally) independent things, e.g.
A:= B + C * D;
E:= F + H;

In this situation there are instructions that can be reordered. An IOE relies purely on the compiler to reorder it, while an OOE can also do the reordering in hardware.
 
Last edited by a moderator:
oh my gosh , i860 was used as geometry unit in SGI workstation (elan etc ? )
im not a developper. so , i have a question about SIMD.

single instruction multiple data ? as translate all vector with the same motion per example ?
can you give me an deeper example (not to deep plz ) of what SIMD do well ? and why can't you do these thing without simd ?
One link for you about SIMD: http://arstechnica.com/articles/paedia/cpu/simd.ars

ArsTechnica is a great site :smile2:
Here more articles related to CPUs: http://arstechnica.com/articles/paedia/cpu.ars
 
It is true in general, but I don't think it's the whole story. If data structures are organized by cache locality, then prefetching will often bring the pointer destinations into the cache. If the structures are allocated regularly enough, one can even use a form of loop induction to predict pointer addresses to be fetched.

Combined with a programming language that ensures locality (e.g. heap compacting generational garbage colletion), it's a big win.

It wouldn't require any out of order execution to get the same result.

The first part concerning loading destinations into cache happens with any cached processor, and the loop induction part can just as easily be speculated in-order.

Since pointer-chasing is in fact serial, trying to do it out of order for the sake of doing it out of order just adds checks that don't do anything helpful.

OO doesn't dispatch instructions until their operands are ready, so what you are advocating is a form of value prediction for the operands. It's been bandied about in a few papers I've come across.

It's heavy-duty speculation, but it's agnostic about whether a core is in-order or not.
 
It seems to me a lot of people are missing the proper argument here. It isn't really 'is OOOE better the IOE' or 'is SMP better than UMP' or 'it's lots of cache better than little cache'. It's a question of will we get rid of IOE in future console processors so that in whatever multicore system you have, it's all OOOE.

Putting a better alternative to the PPE into Cell makes sense, and is something I'm sure everyone wants - but does improving Cell also mean adding OOOE to SPEs? Or will future processors get rid of SPE like processors altogether and go back to multi-core GP processors?

There's a lot of discussion here but I think most of it isn't getting to the heart of the debate ;)

Well, personally I think the SPE model is definitely here to stay. The lean instruction set, speed, low power draw, and low complexity lead to a sort of ideal candidate for those tasks suited for them - which also happen to be some of the tasks traditional CPUs struggle with the most. So a win on both the software and the fab side of the equation. Should the SPEs themselves stick around I guess has more to do with Cells eventual success/adoption in the industry at large. :)

Whether the 'central' core has to go OOE or not to get closer to parity with its more GP-oriented contemporaries... well, the PPU core is certainly not the standard-bearer for the ideal IOE option to begin with. I think just for the sake of versatility in dealing with different code environments, OOE would probably make the most sense.

I guess it depends on what the 'ideal' (ruling the current PPU out) IOE central-core solution could/would look like to begin with, and what the relative advantages would be in power, die-size, and complexity savings when compared to it's drop in GP code efficiency. Toshiba and Sony wanted a MIPS-based central core before IBM pushed for power, so right off the bat there was the potential for what many would have felt was a 'better' central core with Cell.
 
Both versions were compiled with VS2005 maximum optimizations including LTCG. It's not pathological case, more like it would seem the P4's bigger caches and OOOE cope better with memory latencies. The benchmark was strictly single threaded, nothing fancy, just typical usage of lists, vectors, maps, strings, shared pointers, allocations etc.

I was watching an presentation on programming Xenon the other day and according to it you should *never* hit memory directly. On the other hand if the data sets you are using they may have partially of fully fitted inside the P4s cache, that'd make a major difference to the numbers.

The compiler will also make a difference as they take a few years to reach maturity.

However your result was not unexpected, indeed if anything it's better than I'd expect. I've seen worse results.


Try a number of threads doing SIMD processing on arrays and see how they compare.
 
Back
Top