Future console CPUs: will they go back to OoOE, and other questions.

Fox5 · Sep 12, 2006

ERP said:
My point wasn't apples decision, more that neither the X360's core nor Cell are as custom as some people would believe.

They were both basically customised parts that IBM already had lieing around.

I do recall hearing about a PowerPC chip similar to Cell, but not one in design from IBM. (though I guess it stands to reason that if one of the other Power supporting companies was developing such a chip, it may have been based on IBM's work) I'm sure Sony got in on the Cell idea very early though, even if it wasn't there at the beginning.

Personally, I'd point to the communications/data flow aspects and the corresponding resources as the core of the architecture. Not a FLOP to be found there though, and its features are rather technical, maybe that's why it doesn't get much attention.
But then, I'm a data flow kind of guy.

I'm not very knowledgable about the inner workings of cpus (or hardware in general), but the data flow in Cell was one of the things that initially stuck out to me (and probably many others) as possibly making it superior to OOOE cores, provided the programmer can manage it properly. I expect most of the FLOPs will be burned though, but it doesn't need 10x the performance to show itself as superior. It's still unproven though, so I'll stay with the OOOE camp for now. Anyhow, if Cell had just been 7 cores with a traditional cache design, it wouldn't be even half as interesting and I'd guess wouldn't have shown any noticable advantage over Xenon.

Indeed, this seems to be one very important aspect of the 'Cell Broadband Engine', implied in the name even. If you look at the stuff processors do today that really bring them down, it comes to handling and modifying large streams of data.

So what did the Emotion Engine's name relate to?

nAo · Sep 12, 2006

Fox5 said:
So what did the Emotion Engine's name relate to?

I know the answer! The name comes from the fact that writing optimized code for it (for its VUs) with no decente compiler whatsoever and a bugged preprocessor/optmizer was giving you a lot of different emotions..mostly pain, anger and rage

Carl B · Sep 12, 2006

Fox5 said:
I do recall hearing about a PowerPC chip similar to Cell, but not one in design from IBM. (though I guess it stands to reason that if one of the other Power supporting companies was developing such a chip, it may have been based on IBM's work) I'm sure Sony got in on the Cell idea very early though, even if it wasn't there at the beginning.

Yes, Sony and Toshiba had a plan for Cell before and independent of IBM - and the Power architecture - whose seed originated in 1999. Whether IBM had a similar design in the works, I certainly don't know... but as far as it went with regard to STI's Cell, if an IBM Cell-like vision existed at the time, it doesn't seem it ever played a role in this particular design.

Seriously, read this thread and it will explain the chain of events in the early days of Cell's development:

http://www.beyond3d.com/forum/showthread.php?t=20563

one · Sep 12, 2006

Regarding Apple's decision, it's a very appropriate decision as far as I can see in a benchmark like this

http://www-128.ibm.com/developerworks/forums/dw_thread.jsp?forum=739&thread=135031&cat=46
This is the same binary running on PowerPC and Cell PPE. PPE runs 5-6 times slower than PowerPC. This slowness may be due to compiler options (http://www-128.ibm.com/developerwor...reeDisplayType=threadmode1&forum=739#13868067) but it's easy to guess it's intolerable for a platform like Mac where you can't recompile all apps.

3dilettante · Sep 12, 2006

Gubbi said:
That makes sense. So it ignores data dependecies and speculates control dependencies to load as much as possible. It'll be interesting to see if it's actually a win since it's effectively using two threads to do one thread's work.

As long as memory latencies continue to remain stagnant compared to clock frequencies, it can be a big win in a lot of areas. OO cores can speculate past branches, but they can only speculate as far as they have space in load/store buffers and queues. If the processor just fires off loads for cache to pick up, its speculation would reach much farther than an OO core that goes for maybe 20 instructions and then sits there.

From your description it's clear it'll help single thread performance in most cases, since the actual executing thread will have a higher hit rate in caches than without the scout. The only exception would if the scout speculates off on a wild goose chase wasting bandwidth, bandwidth that would have been used by loading needed data (ie. will only happen when the memory bus is saturated).

In a single-threaded instance, both the OO core and the scouted core would be speculating past a long-term event like a cache miss, and both would be waiting on a critical memory access to go through. The difference is that the OO core is going to stop speculating and warming the cache much sooner. The OO will pick up the misses that stay in cache, the scout won't. But OO can't beat the memory wall.

A niave scout would trash the cache, but a good implemenation would capture a lot of the low-hanging memory-level parallelism that OO cores harness. An OO core would probably get minimal benefit, since it already does a lot of what the scout does, but a simpler, higher-clocked in-order or weaker OO would benefit tremendously.

A ROB should scale linearly with size. It will of course be slower and in order to run it at the same latency, power would go up. But, look at two wildly different approaches:
1.) P4 with one, big, fat 128 entry instruction ROB.
2.) K8 with multiple smaller ROBs.

In a standard Tomasulo OO core, the ROB may scale linearly in terms of rename registers and even remain fixed in terms of register ports and result buses.

What does not scale linearly is the cost of dependency checking, which can be done with hardware coupled closely with the ROB or in scheduling hardware. That will scale quadratically. N^2-N is the trend in the number of necessary checks, though it usually is less by some fixed factor.

OO is brute-force, every entry in the ROB must check every other entry for register dependencies. That is a lot of wires, a lot of silicon, and a lot of switching around in the critical loop. This is why modern cores have stagnated in terms of ROB size.

K8 is particularly interesting IMO. It groups instructions in three, and thereby effectively has 3 smaller ROBs in parallel for it's global scheduler.

Originally, K7 would stall if an instruction got stuck in the wrong slot. K8 dedicates an entire pipeline stage to avoiding this, but that's why the pipeline is longer, and instruction scheduling is in the critical timing path.

This is the same technique used by IBM in Power4/5, PPC970 which have 5 (4+branch) instructions in a group. From the global scheduler instructions are issued to the int and fp schedulers which are smaller and faster.

Not exactly, the instruction grouping scheme saves on scheduling sources by tracking instruction groups instead of instructions. As long as the instruction stream meets the rules of full issue. When it doesn't the chip either stalls or inserts some kind of empty slot.

There's no reason this principle of hierarchical ROBs couldn't be extended (indefinately as we see with caches).

Besides physics and mathematics, sure.

Fafalada · Sep 12, 2006

Ban25 said:
I wouldn't be surprised if the dual PPC970 X360 devkits yielded better overall CPU performance than the final machine.

Afaik stock single threaded performance was considerably higher - but that much should be obvious. The 360 PPC requires specialized coding practices and lots of hand tuning if you want your "general purpose code" to run well. IBM's take on general purpose computing I guess...

nAo said:
I know the answer! The name comes from the fact that writing optimized code for it (for its VUs) with no decente compiler whatsoever and a bugged preprocessor/optmizer was giving you a lot of different emotions..mostly pain, anger and rage

Yeah, I still remember writting my first VU clipper - all in hand written assembly with no compiler/optimizer at all. Ironically the code hasn't changed much after I ported it to VCL, it just got a milion times easier to maintain.

xbdestroya said:
Yes, Sony and Toshiba had a plan for Cell before and independent of IBM - and the Power architecture - whose seed originated in 1999.

The milion dollar question here is if that plan involved "real man's" SIMD as well. Yea, I just want someone to blame (I mostly blame IBM at the moment anyway, but this would give me definate proof I'm right).

DemoCoder · Sep 12, 2006

BTW, for those that are interested I found a non-$$$ IEEE publically accessible version of the Out of Order Commit (OoOC) architecture: http://www.cs.utah.edu/classes/cs7940-010-rajeev/spr04/papers/oooc.pdf

I'd be interested in hearing 3dilettante's opinion on this, and perhaps a combination of this with TLP/scouting architectures.

It seems to me an OoOC with a checkpoint table size of 1 entry and no load/store queue/commit is sort of like the scouting case. (all commits are rolled back to checkpoint) It seems to me that if you have a TLP already which can snapshot register map state, if you simply increase the number of checkpoint entries, and add a load/store queue, you can have scouting behavior, or you can have OoOC behavior depending on how you choose to handle the end point of a transaction.

1) if all loads of interest have succeeded since last checkpoint, and you have executed some more loads after a new checkpoint, and you rollback to last checkpoint, then you have effectively "scouted" forward.

2) if instead, you enable a load/store queue, and commit all memory writes in order since last checkpoint, you have OoOC/OoOe

3) and for a bonus, if you allow *software control* of where to begin/end transaction boundaries in addition to the automatic heuristic mode described in the OoOC paper, one can gain Software Transactional Memory up to the limit of the load/store queue size.

Already, there is sort of a user control over this, as exceptions/traps cause a rollback and retry since the last checkpoint. However, imagine adding 2 new instructions: CHK_BEGIN (force a new checkpoint), CHK_COMMIT valid_register (ok to commit memory writes when all loads are finished since last checkpoint, or abort, depending on value of register). Essentially an instruction which tells the CPU a register to check before commiting. The trick would be how to calculate validity efficiently, which would seem to require write barrier detection of regions of interest or some kind of scan of the load/store queue.

Might be workable on a single core where you might have a unified load/store queue between threads, but if you have an SMP situation, you'd need to snoop across buses, so you'd still need to fallback to the more expensive techniques.

Shifty Geezer · Sep 12, 2006

one said:
Regarding Apple's decision, it's a very appropriate decision as far as I can see in a benchmark like this
http://www-128.ibm.com/developerworks/forums/dw_thread.jsp?forum=739&thread=135031&cat=46
This is the same binary running on PowerPC and Cell PPE. PPE runs 5-6 times slower than PowerPC.

This poses an interesting question. If OOOe adds only 8% die to that PPU core, for a 5x improvement in speed in non-optimized code, and you want to sell this processor to as many customers as possible including Apple, why wasn't it added? Especially given these were supposed to be existing parts IBM shoe-horned into a Cell? I don't understand chip designing so have no idea how much effort it is to add that or not, but if you're already producing processor with OOOe, why not take the same designs for that part and add them to the PPU?

Arwin · Sep 12, 2006

one said:
Regarding Apple's decision, it's a very appropriate decision as far as I can see in a benchmark like this
http://www-128.ibm.com/developerworks/forums/dw_thread.jsp?forum=739&thread=135031&cat=46
This is the same binary running on PowerPC and Cell PPE. PPE runs 5-6 times slower than PowerPC. This slowness may be due to compiler options (http://www-128.ibm.com/developerwor...reeDisplayType=threadmode1&forum=739#13868067) but it's easy to guess it's intolerable for a platform like Mac where you can't recompile all apps.

Yes. For completeness sake, let's add IBM's reply.

Dan Greenberg/IBM said:
You are making an excellent start. We appreciate your efforts and your posting here. Thank you. Here's a bit of information that will hopefully help:

First, please remember that the simulator is cycle-accurate only for the SPUs. Therefore, performance tests run on a simulated Cell as a whole... including the PPU, DMA, etc.... are not accurate, full stop. We're working to make broader performace measurement available on the simulator, but the version in SDK 1.1 does not support the test you did.

Second, I believe you answered your own question:
"...please keep in mind that none of the software I discuss runs on the SPU's."
"The binary executed on the JS20 and the Cell Blade were identical -- I did not recompile the code."
Unlike the more conventional G5, Cell was designed to use different and new techniques to attain high performance. Therefore, while the G5 (more properly, the PowerPC 970) uses many modern techniques like OOO, branch prediction, etc. to speed up its performance, Cell uses a simplified Power core and frees up silicon space and power budget for synergistic processors -- the SPUs -- to accelerate performance. (The price you pay is that you need a lot more memory bandwidth to keep all of the threads fed... which Cell has.) Therefore, it's a matter of course that a binary that does not tap the SPUs will run faster on a 970 than a Cell. So add an SPU or two, and watch your code accelerate!

Finally -- on gcc. The IBM compiler (XL C) has a great deal of technology built in to take advantage of the VMX units on the 970 and Cell. Much of this technology has not been duplicated in gcc yet. Consider switching compilers and adding two more lines to your table for Power processors using XL C. The 970 will still outpace the Cell until you use the SPUs, but the performance you achieve may improve for both.

ADEX · Sep 12, 2006

Yes. For completeness sake, let's add IBM's reply.

If you keep reading the thread someone points out you get very similar results on G4-G5 comparisons if you're not careful. The G5 has a hardware pre-fetcher and the G4 doesn't, if you don't add pre-fetch code to the G4 it's performance dives. Conversely adding pre-fetch instructions to the G5 actually slows it down.

The original poster also said the Cell in question is a DD2, this chip was known to be limited by TLB thrashing and there were changes in DD3 to rectify this. One benchmark it made a big difference to is FFTs - exactly what this guy was testing.

The PPE was designed to be a single threaded monster and as pointed out above it needs manual optimisation. That a G5 beats it should not be surprising, that said once optimised on clock or bandwidth limited tasks the PPE should pull ahead.

As for the Apple connection, there were rumours of a PowerPC 350/360 to be based on the Cell, I assume this is the same PPE core. It's not however a purely Apple thing, the PPE itself is derived from an older PPC design started in 1997/98 which was designed to test high frequencies, it was only 90% of the PPC ISA but managed to do 1GHz 3 years before anyone else did. Look up "GuTS".

ADEX · Sep 12, 2006

This poses an interesting question. If OOOe adds only 8% die to that PPU core, for a 5x improvement in speed in non-optimized code, and you want to sell this processor to as many customers as possible including Apple, why wasn't it added? Especially given these were supposed to be existing parts IBM shoe-horned into a Cell? I don't understand chip designing so have no idea how much effort it is to add that or not, but if you're already producing processor with OOOe, why not take the same designs for that part and add them to the PPU?

The only other design with OOO is the 970 but it's a completely different design produced in a completely different way. You'd have to design a new OOO section for this chip.
As for why, it was most likely not added due to power concerns.

Consider that Xenon with 3 cores at 3.2GHz is burning 85W, the 970MP with 2 cores at 2.5GHz takes up to 125W.

The two CPUs in the alpha kit are IBM PowerPC 970 CPUs. These chips do not have the new features that will be in the Xbox 360 CPUs. There are also some performance differences to be aware of:
Â· The two CPUs in the alpha kits donâ€™t support simultaneous multithreading (SMT), whereas the final hardware will have three SMT CPUs.
Â· The two CPUs in the alpha kit are separate chips, each with its own independent 512-KB L2 cache, whereas the final hardware will have three CPUs on one chip with 1 MB of L2 cache on the same chip.
Â· The PowerPC 970 can dispatch up to five instructions per clock cycle with aggressive out-of-order execution. In contrast, each Xbox 360 CPU core will be able to issue a maximum of two instructions per clock cycle in a more linear fashion, meaning that the Xbox 360 CPU may execute fewer instructions per clock cycle. However, the Xbox 360 CPU will run at a much higher clock speed and will have much higher peak vector performance.
Â· The instruction latencies on the alpha kit CPU are typically fewer clock cycles compared to the Xbox 360 CPU, but the clock cycles will be shorter on the Xbox 360 CPU.
Â· The vector units in the alpha kits have fewer registers and are missing some important instructions, such as dot product.

There are other differences as well:
The cache features cache locking to allow streaming data between cores and the GPU.
The cache can be read directly by the GPU.
The alpha kits will have used single core 970s so each would have their own memory bus, Xenon has a fast memory bus but the memory pool is shared with the GPU.

Shifty Geezer · Sep 12, 2006

ADEX said:
The original poster also said the Cell in question is a DD2, this chip was known to be limited by TLB thrashing and there were changes in DD3 to rectify this. One benchmark it made a big difference to is FFTs - exactly what this guy was testing.

I've never seen any info on DD3. Can you provide more detail on the changes, and any links?

pjbliverpool · Sep 12, 2006

Shifty Geezer said:
This poses an interesting question. If OOOe adds only 8% die to that PPU core, for a 5x improvement in speed in non-optimized code, and you want to sell this processor to as many customers as possible including Apple, why wasn't it added? Especially given these were supposed to be existing parts IBM shoe-horned into a Cell? I don't understand chip designing so have no idea how much effort it is to add that or not, but if you're already producing processor with OOOe, why not take the same designs for that part and add them to the PPU?

There are many other differences between the PPC and the PPU than just the lack of OOOe. I doubt the 5-6x speed advantage can be totally attributed to that or even a significant percentage of it.

Arwin · Sep 12, 2006

pjbliverpool said:
There are many other differences between the PPC and the PPU than just the lack of OOOe. I doubt the 5-6x speed advantage can be totally attributed to that or even a significant percentage of it.

I think my earlier quote from IBM's Dan Greenberg makes this point rather clearly.

Carl B · Sep 12, 2006

Shifty Geezer said:
I've never seen any info on DD3. Can you provide more detail on the changes, and any links?

They're on or past DD3.1 by now at a minimum; Barry Minor (I think it was him) sort of revealed the DD3 revision to the public (or at least for the purposes of this forum discussion) when he let it be known that he was using a DD3.1 revision chip in his ray-casting comparisons against that NVidia GPU.

...ok yeah, here it is: http://gametomorrow.com/blog/index.php/2005/11/30/gpus-vs-cell/

Anyway when asked on the IBM/Cell boards it was stated that DD3.1 was a revision for yield improvements, but I personally think there's more to it than that, because yields alone don't warrant a jump in an entire revision digit.

And indeed Adex's above post speaks a little to some of the 'real' changes I guess.

Arwin · Sep 12, 2006

I thought it was somehow exciting to see the source code:

http://www.continuousphysics.com/ftp/pub/test/physics/source/ibm_cellsim_1.1_samples_src.zip

I don't know. Probably it's just that it's the first spe targetting code that I've seen, but there's something magical about it.

pjbliverpool · Sep 12, 2006

Arwin said:
I think my earlier quote from IBM's Dan Greenberg makes this point rather clearly.

Yes, I didn't read that before making my post. The rather large difference in excecution units between the two cores should also be considered. A PPU isn't close to the G5 in this regard, aside from the VMX unit of course.

ADEX · Sep 12, 2006

I've never seen any info on DD3. Can you provide more detail on the changes, and any links?

They've never posted any details! You can't even get a die photo past DD2.

You can however read between the lines ;-)

When the first Cell ISA book was released it included a section describing "things we're thinking about adding in the future", one of these was support for large pages.

Then a while later, suddenly they start talking about how to use large pages and why you'd want to.

It's obvious that there's been changes other than tweaks for yield, aside from large pages I don't know what else has changed though.

Shifty Geezer · Sep 12, 2006

ADEX said:
They've never posted any details! You can't even get a die photo past DD2.

You can however read between the lines ;-)

It's obvious that there's been changes other than tweaks for yield, aside from large pages I don't know what else has changed though.

So it's down to Sherlock Holmes-ing it, huh? As DD3's are available to buy (right? no point producing lower-yield DD2s) it seems odd they don't provide full tech specs, unless they do but under NDA :???:

SPM · Sep 12, 2006

ERP said:
We run all our benchmarks on PPU, SPU, X360 CPU and usually a PC, and I've yet to see the PC not totally dominate on a single processor benchmark.

Are you saying the PC beats Cell and Xenon on FP intensive benchmarks? Are you saying that a PC beats a de/compression program running in SPU local store? Or are you just running standard PC benchmarks recompiled to run on a single Xenon or PPE core without utilising the Xenon's or Cell's strengths?

Future console CPUs: will they go back to OoOE, and other questions.

Fox5

nAo

Nutella Nutellae

Carl B

Friends call me xbd

one

Unruly Member

3dilettante

Fafalada

DemoCoder

Shifty Geezer

uber-Troll!

Arwin

Now Officially a Top 10 Poster

ADEX

ADEX

Shifty Geezer

uber-Troll!

pjbliverpool

B3D Scallywag

Arwin

Now Officially a Top 10 Poster

Carl B

Friends call me xbd

Arwin

Now Officially a Top 10 Poster

pjbliverpool

B3D Scallywag

ADEX

Shifty Geezer

uber-Troll!

SPM

Similar threads