Arstechnica Write Up

Acert93 · May 26, 2005

I am not sure "flexible" is the right word, at least at how I look at the processors. It really depends what you are looking at in the designs.

Per the caches, I would not necessarily disagree. Looking at the caching setup on the XeCPU I can see where it would be more limiting, but that I think it is tradeoffs of a multicore PPC design versus a streaming processor. The flexibility with the SPEs is a necessity of the CELLs streaming design as you need to do a lot of operations on a small amount of data to benefit from a stream processor. So l look at it more as design feature that the multicore PPC does not necessarily need. The XeCPU does seem to be "less flexible" with how it can "lock" for procedural synthesis (compared to how the CELL PPE and SPEs look to use the FlexIO pretty openly... btw the FlexIO and low latency XDR I think are often overlooked... I think they will have a big impact on performance). But even then Xbox programmers can determine how and how much (even if) the L2 cache is used for this special task. Similarly the XeCPU can bypass L1 and L2 cache. So for a general processing core they do seem to be fairly flexible within their design architecture.

But if you define "flexible" as to the type of information the processor can handle then I think XeCPU is much more flexible than the FP/Streaming focused CELL.

Up to this point games have been written, and done well, using "typical" processors. And that is because games do require processing tasks that are NOT FP based. You have load, store, integer, branch, random memory accesses, etc... to take into consideration. These are areas where 3 PPC cores should have an edge and are much more flexible than the SPEs. That XeCPU has more processing power in these areas and with its 115GFLOPs of floating point performance, I would call it more "flexible" or "balanced" design because it can deal with more types of processing tasks at a better overall level.

CELL, on the other hand, is a streaming design aimed primarily at chewing through floating point oriented code. Instead of a good general processor Sony went with a single PPC core and loaded up on Floating Point processing in a streaming design. Since CELL requires the developer to really focus on using the SPEs (i.e. code that is small and floating point friendly) to get the most out of them I would consider that LESS flexible. Asking developers to make their integer based code FP friendly because CELL has 1/3 of the interger performance of the XeCPU is less flexible in my book. Similarly the VMX units are said to be a lot more flexible than the SPEs, but it is a tradeoff of flexibility compared to pure brute power.

But less flexible does NOT mean "not better" or "not as good". Multithreading is forcing developers to face new hurdles they have not had in game design before. Sony's solution is to have dedicated physical units in a stream processing array. MS has gone the more traditional route of going multicore. If flexible is a reference to how multithreaded apps are dealt with they both pose hurdles and will have to wait and see. CELL may very well be better suited, and more flexible, for *multithreaded games*. The question is what will game developers find best/easist to use.

Or we might see that each design excells in different areas and genres.

I do wonder which is the best gamble: ~50% of the FP performance and ~300% of the general processing power or ~200% of the FP performance and ~33% of the general processing power (*Yes, I know these are all peak and multicore CPUs rarely get anywhere near the theoretical jump in performance compared to a single core.) So far games have done pretty well with typical designs, and the XeCPU has ~5x the FP power of a top end PC chip. But in the reverse a streaming processor may be a better way to deal with threads. Personally I think the first couple years (especially launch titles) developers will really struggle, and only as the tools MS and Sony make available improve and developers find new and unique ways to thread their apps will we know if one or the other was a better design.

Until then I expect 3rd party cross platform apps to mainly use the PPC cores and take it slow. 1st parties will lead the charge, and we hopefully will get a glimpse at what these monsters can do soon.

Anyhow, that is my take on the flexibility issue

oli2 · May 26, 2005

Very interesting topic and link ...

When i read all the documentations about the CPUs (multi-thread oriented) and the GPUs (unified shaders for example) to keep on the simplyest, i have a question : when ?

From a developpement POV, there is really a shift. When will we see games starting to use a significant part of the power that resides potentially in the machines ? 2nd generation, third one ? 2 years, ... more ?

That makes me afraid, because if you look at current gen the graphics have often "grabbed down" to the least powerfull machine (and it's been stated that VU's in PS2 have not been often used to their fullest for example).

My 2 cents ...

Jawed · May 26, 2005

Nice post Acert93.

Jawed

Shifty Geezer · May 26, 2005

Acert93 said:
I do wonder which is the best gamble: ~50% of the FP performance and ~300% of the general processing power or ~200% of the FP performance and ~33% of the general processing power (*Yes, I know these are all peak and multicore CPUs rarely get anywhere near the theoretical jump in performance compared to a single core.)

Is that a fair estimate of FP and General Purpose capabilities though? People talk of Cell's SPE's as being only capable of FP chugging. But they are proper processors, not just VMX units, and they can work as quickly on integers as FPs and can do out of order. They take a big hit with OOO as they lack the specific designs to cope with that tech, but the doesn't make them totally useless. If a SPE is half as good at general purpose as a PPE, Cell has 1+(7x0.5)= 4.5 PPC cores worth. Even if a SPE is only 25% as efficient at genral purpose code, that's 1+(7x0.25)= 2.75 PPC cores worth.

I don't know what the capbilities of SPE's are in regards genral purpose, but from what I can gather people totally overlook that and think of them as only good for FP crunching which is misrepresenting what Cell can and can't do.

edit : Also, aren't the XeCPU cores in-order? In which case don't they lose a lot of their advantage for general purpose code? When people say XeCPU is better than Cell in this respect, what's the criteria they're comparing?

X-AleX · May 26, 2005

Shifty Geezer said:
Acert93 said:

I do wonder which is the best gamble: ~50% of the FP performance and ~300% of the general processing power or ~200% of the FP performance and ~33% of the general processing power (*Yes, I know these are all peak and multicore CPUs rarely get anywhere near the theoretical jump in performance compared to a single core.)

Click to expand...

Is that a fair estimate of FP and General Purpose capabilities though? People talk of Cell's SPE's as being only capable of FP chugging. But they are proper processors, not just VMX units, and they can work as quickly on integers as FPs and can do out of order. They take a big hit with OOO as they lack the specific designs to cope with that tech, but the doesn't make them totally useless. If a SPE is half as good at general purpose as a PPE, Cell has 1+(7x0.5)= 4.5 PPC cores worth. Even if a SPE is only 25% as efficient at genral purpose code, that's 1+(7x0.25)= 2.75 PPC cores worth.

I don't know what the capbilities of SPE's are in regards genral purpose, but from what I can gather people totally overlook that and think of them as only good for FP crunching which is misrepresenting what Cell can and can't do.

edit : Also, aren't the XeCPU cores in-order? In which case don't they lose a lot of their advantage for general purpose code? When people say XeCPU is better than Cell in this respect, what's the criteria they're comparing?

Someone noted in another topic that the cores are not in-order

Gubbi · May 26, 2005

Shifty Geezer said:
I don't know what the capbilities of SPE's are in regards genral purpose, but from what I can gather people totally overlook that and think of them as only good for FP crunching which is misrepresenting what Cell can and can't do.

edit : Also, aren't the XeCPU cores in-order? In which case don't they lose a lot of their advantage for general purpose code? When people say XeCPU is better than Cell in this respect, what's the criteria they're comparing?

The biggest deficiencies the SPEs have in regard to execution of GP workloads are the lack of demand loaded caches to exploit spatial and temporal locality automatically and a branch prediction unit.

OOO wasn't implemented in either CELL or XeCPU where focus seem to have been on SIMD throughput instead.

Out-of-order execution helps you execute past data-dependency stalls and hence helps hide latency (but is normally limited to level 2 cache latencies). It also helps you schedule across function calls which makes for more compact code (and higher performance since you'd make better use of your i-cache)

Cheers
Gubbi

Shifty Geezer · May 26, 2005

Does XeCPU have those demand loaded caches and branch prediction units then?

Gubbi · May 26, 2005

Shifty Geezer said:
Does XeCPU have those demand loaded caches and branch prediction units then?

Yes, as does the PPE in CELL.

Cheers
Gubbi

Shifty Geezer · May 26, 2005

Okay. How much advantage does that bring to conventional general purpose code? How well/badly do you expect SPE's to fair with it, assuming the code isn't optimised for them?

Gubbi · May 26, 2005

Shifty Geezer said:
Okay. How much advantage does that bring to conventional general purpose code? How well/badly do you expect SPE's to fair with it, assuming the code isn't optimised for them?

Every 7th instruction is a branch in the SpecINT suite. A state of the art branch predictor like the one in Prescott guesses wrong less than once every 100 instructions, so less than 1 in 14 branches is guessed wrong. Which is *very* important with big mispredict penalties (20+cycles for the SPEs).

Cheers
Gubbi

inefficient · May 26, 2005

Acert93 said:
I do wonder which is the best gamble: ~50% of the FP performance and ~300% of the general processing power or ~200% of the FP performance and ~33% of the general processing power (*Yes, I know these are all peak and multicore CPUs rarely get anywhere near the theoretical jump in performance compared to a single core.) So far games have done pretty well with typical designs, and the XeCPU has ~5x the FP power of a top end PC chip.

5x the FP power of a top end PC chip? Are you actually counting the XeCPU as 6 CPUs?

Even though the XeCPU might be able to 'claim' 6 logical cpus you really only have power of 3 units there.

The performance increases caused by HyperThreading on Intel CPUs is marginal at best. Why do you think it will be any better in IBMs implementation?

These solutions only help you more efficently use the resources in a single core, it's nothing near to doubling performance.

Shifty Geezer · May 26, 2005

The 100+ GFlops of XeCPU were official specs, not calculated from assuming 6 cores. If 100+ GFlops isn't to be believed you need ot question the source (IBM?)

blakjedi · May 26, 2005

Very nice post Acert.

wco81 · May 26, 2005

That was one of the earliest public reactions from MS, that their CPU does better on integer or GP code. That is what Ballmer said in the interview the day after the press conferences.

So where does GP code come into being in gaming?

Or will MS port Office to the X360 and PS3 and say, "Our machine is faster!"

AlNom · May 26, 2005

wco81 said:
Or will MS port Office to the X360 and PS3 and say, "Our machine is faster!"

Now possible to type 9 billion words a second! Faster coding! More games coming out sooner than ever before!

bbot · May 26, 2005

Has anyone noticed that hannibal thinks the Xecpu has two vector units per core. Wouldn't that increase the flops from 76gflops to 152gflops?

Zeross · May 26, 2005

bbot said:
Has anyone noticed that hannibal thinks the Xecpu has two vector units per core. Wouldn't that increase the flops from 76gflops to 152gflops?

No the second one is the vector permute unit like in any other VMX implementation.

bbot · May 26, 2005

Zeross said:
bbot said:

Has anyone noticed that hannibal thinks the Xecpu has two vector units per core. Wouldn't that increase the flops from 76gflops to 152gflops?

Click to expand...

No the second one is the vector permute unit like in any other VMX implementation.

please explain

Acert93 · May 26, 2005

inefficient said:
5x the FP power of a top end PC chip? Are you actually counting the XeCPU as 6 CPUs?

No, I am not

XeCPU is rated at 115GFLOPs peak performance. The mid 20's is what I have seen reported on websites for top end P4/AMD64 chips for their FP performance. Others have found similar numbers but I have yet to find anything official from AMD/Intel. Maybe their peak is higher than that but I have not seen anything official yet. I bet it is much lower than 115GFLOPs though.

The performance increases caused by HyperThreading on Intel CPUs is marginal at best. Why do you think it will be any better in IBMs implementation?

They are different designs. I am not saying it is better than an Intel chip; but the claimed peak FLOPs for the XeCPU is substantially higher than that of modern PCs. Part of this may have to do with the refined VMX units in the PPC cores. Obviously some extra FP power would be helpful for games that use a lot of procedural synthesis, physics, collision detection, etc... So the extra FP power is a design decision. Intel/AMD chips are first and foremost general processing chips, gaming being a niche market compared to total PC sales. So a custom multicore PPC design for gaming (which may need more FP) vs. a standard desktop chip. So the difference in FP performance should not be surprising necessarily.

Zeross · May 26, 2005

bbot said:
please explain

Look at this diagram :

The VMX unit can issue two vector instructions per cycle but only one can be a floating point operation so the floating point throughput is just 8 operations per cycle (one vec4 FMAC) per VMX unit : 3.2 * 3 * 8 + 3.2 * 3 * 2 * 2 = 115.2GFLops

What Hannibal illustrated with his diagram of the Xbox CPU is just that the VMX unit is dual issue, not that there are two vector units.

Arstechnica Write Up

Acert93

Artist formerly known as Acert93

oli2

Jawed

Shifty Geezer

uber-Troll!

X-AleX

Gubbi

Shifty Geezer

uber-Troll!

Gubbi

Shifty Geezer

uber-Troll!

Gubbi

inefficient

Shifty Geezer

uber-Troll!

blakjedi

wco81

AlNom

Moderator

bbot

Zeross

bbot

Acert93

Artist formerly known as Acert93

Zeross

Similar threads