Future console CPUs: will they go back to OoOE, and other questions.

ban25 · Sep 13, 2006

Arwin said:
You are comparing the PPC970 'whole-chip' to the PPE being 1/9th of a chip?

Yes.

ban25 · Sep 13, 2006

Fran said:
I use the latest VA, with its refactoring tool, but it's not even remotely as good as Eclipse (and NetBeans as I've heard of it).

Fran/Fable2

No it definitely doesn't come close, that's for sure...but it is an improvement.

Actually, I hear Eclipse has a vi plugin, that's something NetBeans is sorely lacking (there's support for external editors, but it kinda defeats the purpose of an IDE).

ERP · Sep 13, 2006

I didn't use any version of VS before VS6, which I used purely for project management and debugging.

When I transitioned to 2003, I started using the editor mostly because I found it irritating to have different editor behavior between the debugger and my editor of choice.

I've never really understood the IDE haters, having everything integrated makes life a lot nicer. I would like to see some of the refactoring tools available in C++, MS has dropped the ball on this outside of C#. I sometimes use Understand macros to refactor major changes in projects.

VS 's debugger is damn good for C++ but it's a little underwhelming as a low level debugger, the SN tools for PS3 are pretty much the exact opposite. Unfortunately I'd estimate about 90% of my debugging time is spent at the C++ level.

Frank · Sep 13, 2006

Phew. I read all of it. Time for some comments.

I generally agree with Demo. Out of Order exists mostly because of two reasons: the extremely bad concurrency allowed by the x86 instruction set and the total lack of registers, and the focus on single thread performance over all.

Both are on the verge of extinction, so while there are still some minor improvements that can be implemented, it's end of the road for both. Single thread is an open door, and the instruction set is close behind. More on that later.

But first, let's talk about cache memory. Most current generation CPUs spend more than half the total die area on cache memory. Because most of the transistors in the functional units are idle most of the time already.

But cache memory has already passed the point of diminishing returns severely some time ago. Like this: improving the L2 cache for an Athlon from 128 KB to 256 KB gives you a maximum speed increase of a few percent. Likewise for 256 to 512, or 512 to 1024. And from 1024 to 2048 there isn't any noticeable improvement anymore.

Further, if you want to go seriously multicore, you need to store all data used in the shared L3 cache, to prevent locks and severe stalls. Essentially, you want the whole memory area that is used at any one time in local storage on the die. Or at least anything that might be locked. But how does the chip knows that in advance?

So, next to the large redundancy and inefficiency of using multiple x86 cores (that were designed as single core CPUs) on a single die, if you go beyond four of them, you need to put just about the whole memory system on the die as well to show speed increased for anything but the best cases. Add a GB of L4 cache memory, so to say. And you'll need a much larger die.

Alternatively, you can break down all the functional units into simple, independent pipes that have their own local storage and add a manager that handles all of them. You can free up a significant amount of die area (cache) as a bargain. And support virtual machines at that level. Including the additional stuff like the GPU.

And, as there is only a single data entry point to each chip, it makes sense to mix and match all the instruction models into a higher level language and have the chip interpret and execute them. Not much different than what is done for some time already, as the machine code that drives the units is nothing like the assembly language that goes in.

And, why stop there? If you're going to do something even remotely like that, start with the biggest bottleneck: the memory system.

That might have made sense long ago (von Neuman), but today you need most of your die space simply to make it workable. We need something better than the brute force (speed/bandwidth) approach, and trying to store just about all of it in an on-die cache. A transaction based database engine. Throw away the waste and implement something smart.

But to be able to operate all that, we need virtual machines. Like the old virtual mode of the 80386. And a meta mode that allows you to command the chip directly. A new instruction set, that likes nothing more than sending instructions for multiple VMs to the command parser on the chip. And gives all of them their own memory subsystem, GPU and peripherals.

Arwin · Sep 13, 2006

Nice way of bringing us back on topic, DiGuru.

Also very good, easy access summary.

darkblu · Sep 13, 2006

DiGuru said:
Out of Order exists mostly because of two reasons: the extremely bad concurrency allowed by the x86 instruction set and the total lack of registers, and the focus on single thread performance over all.

pardon my french, but what does OOOe have to do with x86?

and no matter how much you focus on macro-level parallelism/reordering, some tasks are just freakingly not susceptible to it. believe it or not, the CPU still has what to offer at the micro level, i.e. parallelism/reordering within a thread.

i don't see OOOe going anywhere until we get perfect compilers/opimisers (tm)

N00b · Sep 13, 2006

DiGuru said:
But to be able to operate all that, we need virtual machines. Like the old virtual mode of the 80386.

That statemtent makes me wonder if you have any clue at all what virtual mode really is. :runaway:

Ouch. Big Ouch.

Frank · Sep 13, 2006

darkblu said:
pardon my french, but what does OOOe have to do with x86?

and no matter how much you focus on macro-level parallelism/reordering, some tasks are just freakingly not susceptible to it. believe it or not, the CPU still has what to offer at the micro level, i.e. parallelism/reordering within a thread.

i don't see OOOe going anywhere until we get perfect compilers/opimisers (tm)

You're still thinking single task. We've just about reached the limit there. Nothing more can be done. Period.

Think about all the things we can improve.

Frank · Sep 13, 2006

N00b said:
That statemtent makes me wonder if you have any clue at all what virtual mode really is. Ouch. Big Ouch.

I was in love with the 3086 when it just emerged. I read the books, experimented with it when I got one, and did most of that in assembly. Did my own mini-OS.

Any more questions?

Think about Java, or .NET.

N00b · Sep 13, 2006

DiGuru said:
I was in love with the 3086 when it just emerged. I read the books, experimented with it when I got one, and did most of that in assembly. Did my own mini-OS.

Any more questions?

Think about Java, or .NET.

So how come you compare V86 mode with a virtual machine? They don't have so much in common. No more than a normal windows process has in common with a virtual machine. I don't understand the connection. When the V86 bit is set (18th bit in EFLAGS register int the task's TSS) the cpu simply emulates a 16-bit mode 8086 with 1 MB address space.

BTW, I know my way around the 80386, too. I wrote my own dos extender. ;-)

Frank · Sep 13, 2006

Well, the new generation CPUs still need to be backwards compatible, but offer a painless upgrade path for the new possibilities. So, you need to be able to run old code, but that won't make much sense in the new model. And the processor needs to be able to calculate the scope of the data accesses and set slow ones aside.

All that requires a virtual IO subsystem and an API. So, it's like the V86 mode in that you can run old code, in their own closed VM, that you need a mechanism to hand the API access to the drivers/OS and that you get access to (some of) the extended possibilities at the same time, like registers that control the IO and GPU subsystems.

Frank · Sep 13, 2006

N00b said:
When the V86 bit is set (18th bit in EFLAGS register int the task's TSS) the cpu simply emulates a 16-bit mode 8086 with 1 MB address space.

But you got access to the 32 bits registers, fs, gs, and most of the new instructions and some of the privileged modes as well.

darkblu · Sep 13, 2006

DiGuru said:
You're still thinking single task. We've just about reached the limit there. Nothing more can be done. Period.

Think about all the things we can improve.

i'm not saying OOOe is the the end-all-be-all. i'm asking what you would gain by giving up OOOe. what else can those transistors be used for that would give you a better gain at general-purpose code?

N00b · Sep 13, 2006

DiGuru said:
But you got access to the 32 bits registers, fs, gs, and most of the new instructions and some of the privileged modes as well.

Sic. And a lot of software took advantage of this. Actually when you loaded EMM386.EXE you were running in V86 mode anyway and no longer in real mode. Ah, those good old times. ;-)

DOS times aside, I'm not sure I understand the programming model for the cpu you are proposing. Could you describe how that would work?

ban25 · Sep 13, 2006

DiGuru said:
I generally agree with Demo. Out of Order exists mostly because of two reasons: the extremely bad concurrency allowed by the x86 instruction set and the total lack of registers, and the focus on single thread performance over all.

Both are on the verge of extinction, so while there are still some minor improvements that can be implemented, it's end of the road for both. Single thread is an open door, and the instruction set is close behind. More on that later.

Please. With this statement, your whole argument is relegated to the void along with everyone else who has predicted the death of x86 in the past 20 years. In case you haven't noticed, the Core 2 Duo the fastest CPU in SPECint and very close in FP. From the way you make it sound, SPARC would be dominating everything because it has tons of registers, and aside from the Fujitsu chips, it's in-order. I've long touted the virtues of Niagara and its TLP-focused architecture for highly-parallel server-oriented applications. It excels in those, but it doesn't venture far beyond that set of applications (try a moderate compile on a Niagara box sometime and you'll see what I mean).

Just because two console manufacturers sourced an in-order core from a single chip company doesn't mean the fat lady has sung on OOO or x86 -- quite the opposite, really.

ADEX · Sep 14, 2006

I generally agree with Demo. Out of Order exists mostly because of two reasons: the extremely bad concurrency allowed by the x86 instruction set and the total lack of registers, and the focus on single thread performance over all.

That's why it's there but I think it's having much more impact in it's ability to read ahead and fetch memory early. If this can be moved into software then OOO becomes a great deal less important and it can be potentially moved into software also.

Please. With this statement, your whole argument is relegated to the void along with everyone else who has predicted the death of x86 in the past 20 years.

Real x86 died years ago, the last company to build them was Cyrix. The internal ISA of all modern x86 processors is not the one you program, it's converted in hardware. You can't however change the number of registers directly so OOO is used to indirectly increase the number of registers.

An x86 decoder is still included in the processors but Transitive are now very successfully showing that the ISA is becoming almost an irrelevance. Remove the decoder and you end up with a Transmeta chip - an in-order VLIW processor, pretty much as far from an x86 as you can get.

With multicore single threaded performance becomes less and less important. When you get to large numbers of cores per chip your cores are going to have a pitiful power budget. Intel are working on a 32 core chip - that's 3 or less watts per core max, they will look in exacting detail at everything in those cores, anything not highly efficient will have to be removed.

When you begin to look at things in terms of 32 cores solutions such as Transmeta's look like not only a good option - but probably the *only* option.

It's when you begin to consider the issues that are designers are looking at, the Cell makes an awful lot of sense, the SPEs only use 3Watts.

ban25 · Sep 14, 2006

ADEX said:
Real x86 died years ago, the last company to build them was Cyrix. The internal ISA of all modern x86 processors is not the one you program, it's converted in hardware. You can't however change the number of registers directly so OOO is used to indirectly increase the number of registers.

An x86 decoder is still included in the processors but Transitive are now very successfully showing that the ISA is becoming almost an irrelevance. Remove the decoder and you end up with a Transmeta chip - an in-order VLIW processor, pretty much as far from an x86 as you can get.

Most architectures implement something similar. You have effectively proven my point, however.

With multicore single threaded performance becomes less and less important. When you get to large numbers of cores per chip your cores are going to have a pitiful power budget. Intel are working on a 32 core chip - that's 3 or less watts per core max, they will look in exacting detail at everything in those cores, anything not highly efficient will have to be removed.

This is only true for the subset of applications which can make use of these highly parallel architectures. Just because some people are making heavily multi-core architectures doesn't mean single-threaded performance is irrelevant. Like I said, go build a project on a Niagara and tell me how you like the speed.

archie4oz · Sep 14, 2006

Entropy said:
Original POWER started out with OOO back in 1990. (As did the first PowerPC, the 601 in, uhm, 92/93?.)
However, IBM has produced PPC cores that more suitable for embedded use, where OOO may not have been implemented. But here we are still talking produced and marketed PPC microprocessors, God knows how many CPU design experiments they have lying around at IBM as a whole.

Actually thinking a little harder, I can recall that the 403/405 (the 440 is OoOE) and many of the Moto^H^H^H^H Freescale 5xx and 8xx cores are in-order...

ERP said:
I've never really understood the IDE haters, having everything integrated makes life a lot nicer. I would like to see some of the refactoring tools available in C++, MS has dropped the ball on this outside of C#. I sometimes use Understand macros to refactor major changes in projects.

I don't think there's really any IDE haters as so much as those who hate some IDEs...

ERP said:
Unfortunately I'd estimate about 90% of my debugging time is spent at the C++ level.

Well as they used to say, in C you wrote your own bugs, in C++ you inherit them..

ADEX said:
Yes, but most are 32 bit. The POWER series are 64bit and have been for some time but these were designed using automated tools whereas the PPE was designed by hand. None of them were designed for the type of frequencies PPE / Xenon run at.

Operand width isn't relevant... Neither is manufacturing techniques... Motorola/Freescale have had long history of hand tweaked designs vs. IBM's automated design methods. Nor were all of IBMs POWER and RS64 cores designed using automated tools either.

ADEX said:
More than that, OOO was first proposed by an IBM guy back in the 1960's.

I think that credit should go to Seymour Cray while he was Control Data Corporation, as they launched the CDC 6600 several years before IBM launched the 360/91 which used Robert Tomasulo's algorithm.

Gubbi · Sep 14, 2006

ban25 said:
SPARC would be dominating everything because it has tons of registers, and aside from the Fujitsu chips, it's in-order.

And incidentally the 2160MHZ Fujitsu OOO SPARC64 is by far the highest performing SPARC CPU; SpecINT=1501, SpecFP=2094 vs Sun's 1600MHz In-order UltraSPARC IIIi's SpecINT=739, SpecFP=1209 (all base scores).

Cheers

Arwin · Sep 14, 2006

I was just browsing to see if I could find some SpecINT figures for Cell ... I came across this presentation; probably already posted somewhere, but I think still interesting and on-topic:

http://www.stanford.edu/class/ee380/Abstracts/Cell_060222.pdf#search="SpecINT Cell processor ibm"

EDIT: Maybe even more interesting, albeit specialistic, is Mercury's surprisingly detailed and open performance benchmark publications

http://www.mc.com/literature/literature_files/Cell-Perf-Simple.pdf

This one is less interesting maybe but has nice graphics on second page showing an example of processing through Cell:

http://www.mc.com/literature/literature_files/CellPerfAndProg-02May06.pdf

EDIT: And they have lots more great stuff, like a promotion video with some details on the medical imaging, and webcasts that go into detail on how the Cell is being programmed

http://www.mc.com/cell/webcast.cfm

Future console CPUs: will they go back to OoOE, and other questions.

ban25

ban25

ERP

Frank

Certified not a majority

Arwin

Now Officially a Top 10 Poster

darkblu

N00b

Frank

Certified not a majority

Frank

Certified not a majority

N00b

Frank

Certified not a majority

Frank

Certified not a majority

darkblu

N00b

ban25

ADEX

ban25

archie4oz

ea_spouse is H4WT!

Gubbi

Arwin

Now Officially a Top 10 Poster

Similar threads