Strengths and weaknesses of GameCube relative to its peers *spawn

Looking at that vid I didn't see the deformable terrain you speak of.
The particles wouldn't be an issue they're just polys.
Lots of particles can mean lots of polygons, certainly means lots of fill, and can mean lots of physics calculations to animate them all. Look at how PS3 struggled with particles and contrast that with how PS2 used them to amazing effect in so many titles. That was one of the things I most lamented about PS3, how it lost the classy particles of its predecessor.
 
Looking at that vid I didn't see the deformable terrain you speak of.
Lots of particles can mean lots of polygons, certainly means lots of fill, and can mean lots of physics calculations to animate them all. Look at how PS3 struggled with particles and contrast that with how PS2 used them to amazing effect in so many titles. That was one of the things I most lamented about PS3, how it lost the classy particles of its predecessor.

When you run into the emblem and the terrain changes.

Problem is Xbox does the same particles in that particular game and with Wii's clock boosts I can't think it would be an issue. GC had more available bandwidth than Xbox and Wii has additional gddr3. But the Ps2 still outclasses Wii in particles. For me ZOE 2 is the best use of Ps2's fill rate.

But yeah Ps3 sucked at alpha and particles.
 
Last edited:
When you run into the emblem and the terrain changes.

Problem is Xbox does the same particles in that particular game and with Wii's clock boosts I can't think it would be an issue. GC had more available bandwidth than Xbox and Wii has additional gddr3. But the Ps2 still outclasses Wii in particles. For me ZOE 2 is the best use of Ps2's fill rate.

But yeah Ps3 sucked at alpha and particles.

Excite Truck looks really nice. What's going not may not be as demanding as physics based deformation of structures though.

If you're moving terrain in a predetermined manner (I'm guessing that's what the game is doing but I don't know for sure), then the calculation could be as simple as interpolating between two predetermined Y positions for each vertex, perhaps with a time based element (a curve I suppose you'd call it) to make it speed up and slow down near the start and end of the movement.

Hitting the emblem would trigger a simple script that would animate the Y values for model space of the affected vertices as they were fed into the T&L unit. Some like that, if you get the idea.

Wii offered a nice boost in performance to the GC. I think if Nintendo had been prepared to spend more on power, cooling, and accepting lower yields, the GC could have clocked rather higher than it did. Bu that's more of a guess than anything based on the 750 lineup and the speed of contemporary GPUs of similar size, features and process node.
 
Excite Truck looks really nice. What's going not may not be as demanding as physics based deformation of structures though.

If you're moving terrain in a predetermined manner (I'm guessing that's what the game is doing but I don't know for sure), then the calculation could be as simple as interpolating between two predetermined Y positions for each vertex, perhaps with a time based element (a curve I suppose you'd call it) to make it speed up and slow down near the start and end of the movement.

Hitting the emblem would trigger a simple script that would animate the Y values for model space of the affected vertices as they were fed into the T&L unit. Some like that, if you get the idea.

Wii offered a nice boost in performance to the GC. I think if Nintendo had been prepared to spend more on power, cooling, and accepting lower yields, the GC could have clocked rather higher than it did. Bu that's more of a guess than anything based on the 750 lineup and the speed of contemporary GPUs of similar size, features and process node.
Yeah I thought the terrain change maybe wasn't as intensive because there's no physics involved. But still I haven't seen anything like that on the cube.

Nintendo could've clocked GC higher but they wanted the best reliability and increasing the clocks would've required a bigger console with more expensive cooling, and worse reliability so they made the right choice at the time. If anything I think Wii should've been clocked higher, (we know the cpu could've been double GC's speed looking at the 750 arch on PC, not sure about the gpu though) they could've made it slightly bigger and decreased heat with die shrinks later on.
 
All right so this is what i've (not) found - I could barely find any direct benchmark comparisons between pentium 3 and the 750CXe (Which Gekko is a 750CXe + increased registers and floating point performance) Initially I thought I had looked at a few game benchmarks of quake III arena but checking them again they were either running at different resolutions on each system (lower res on the pentium hence me thinking it outperformed) or used different graphics cards. I've found benchmarks of individual systems that are as close as possible to the Xbox and gc, but then I couldn't find the same benchmark for the other cpu for apples to apples comparison. Lots of dead links and data that makes no sense.

These are the most concrete comparisons (because they're the only direct comparisons I could even find that weren't just forum posts) i've found which point to the power pc beating pentium III by quite a bit but only show one or two benches so it's far from comprehensive -

http://www.infohq.com/Computer/appleG4-pentiumIII-showdown.htm


^ The above video makes sense given what i've read about the 750's superior floating point performance (which gamecube would have even more per clock), even if this is surely a best case scenario that apple cherry picked. And I understand FP doesn't tell the whole story. After all if it did i'd imagine the gamecube's fixed t&l pipeline would've been a non issue.

http://www.anandtech.com/show/858/13

In anandtech's 2001 comparison of the GC and Xbox consoles, they claim "In terms of raw performance, the Celeron 733 (4-way set associative L2) will outperform the PowerPC 750 running at 500MHz in any of the synthetic benchmarks we've seen. We can only assume that a 733MHz CPU with a 133MHz FSB and 8-way set associative L2 cache would only be faster than the Gekko giving the Xbox the CPU performance advantage." I couldn't find anything to prove this claim.

Besides floating point performance the other gekko advantage would be 4x the amount of general purpose registers the Xbox has, which to my understanding has a lot to do with memory efficiency. No surprise there looking at the rest of the gamecube's memory setup. Not to mention the faster front side bus of the gamecube (166 v. 133).

So yeah, again it seems like the power pc 750 commands a per clock advantage, though to what point (realistically) I can't tell. It's probably safe to say both gekko and the Xcpu had advantages given the type of task and they were overall pretty comparable.

If anyone has any more concrete comparisons or developer statements, that'd be appreciated. Besides factor 5's (apparently PR) statements, the only other dev I remember commenting on gekko were the guys behind gladius ; they were really impressed with the speed of the chip and how painless it was to get things running. Can't find the comment now though.
 
Last edited:
Are you sure you're looking at comparisons of the right parts? That video isn't looking at a 7xx-family device at all. G4 is 74xx, a different microarchitecture.
Hmm yes you're right. Same goes for the other linked comparison. In fact the only comparison I could find was the G3 (though it's not the 750CXe) vs. the pentium II.

bytemark-9801-320x361.png


Kind of at a dead end here :p
 
All right so this is what i've (not) found - I could barely find any direct benchmark comparisons between pentium 3 and the 750CXe (Which Gekko is a 750CXe + increased registers and floating point performance) Initially I thought I had looked at a few game benchmarks of quake III arena but checking them again they were either running at different resolutions on each system (lower res on the pentium hence me thinking it outperformed) or used different graphics cards. I've found benchmarks of individual systems that are as close as possible to the Xbox and gc, but then I couldn't find the same benchmark for the other cpu for apples to apples comparison. Lots of dead links and data that makes no sense.

These are the most concrete comparisons (because they're the only direct comparisons I could even find that weren't just forum posts) i've found which point to the power pc beating pentium III by quite a bit but only show one or two benches so it's far from comprehensive -

http://www.infohq.com/Computer/appleG4-pentiumIII-showdown.htm


^ The above video makes sense given what i've read about the 750's superior floating point performance (which gamecube would have even more per clock), even if this is surely a best case scenario that apple cherry picked. And I understand FP doesn't tell the whole story. After all if it did i'd imagine the gamecube's fixed t&l pipeline would've been a non issue.

http://www.anandtech.com/show/858/13

In anandtech's 2001 comparison of the GC and Xbox consoles, they claim "In terms of raw performance, the Celeron 733 (4-way set associative L2) will outperform the PowerPC 750 running at 500MHz in any of the synthetic benchmarks we've seen. We can only assume that a 733MHz CPU with a 133MHz FSB and 8-way set associative L2 cache would only be faster than the Gekko giving the Xbox the CPU performance advantage." I couldn't find anything to prove this claim.

Besides floating point performance the other gekko advantage would be 4x the amount of general purpose registers the Xbox has, which to my understanding has a lot to do with memory efficiency. No surprise there looking at the rest of the gamecube's memory setup. Not to mention the faster front side bus of the gamecube (166 v. 133).

So yeah, again it seems like the power pc 750 commands a per clock advantage, though to what point (realistically) I can't tell. It's probably safe to say both gekko and the Xcpu had advantages given the type of task and they were overall pretty comparable.

If anyone has any more concrete comparisons or developer statements, that'd be appreciated. Besides factor 5's (apparently PR) statements, the only other dev I remember commenting on gekko were the guys behind gladius ; they were really impressed with the speed of the chip and how painless it was to get things running. Can't find the comment now though.

The G4 was essentially a PowerPC with AltiVec units. That gave it 4xFP32 FMAs as well as full 128-bit integer SIMD. These functions were very useful for image processing and it's no secret that Adobe heavily optimized their software for PowerPC. This was probably true even outside of AltiVec usage, and they most likely gave it a lot more attention than their x86 ports. Apple (and Mac enthusiasts) exploited this heavily and constantly featured it in their marketing. Needless to say it was extremely cherry picked.

Pentium III has weaker SIMD in comparison. There are only 8 128-bit registers and while they support 4x32-bit FP operations it only completes 2x32-bit per cycle. Its integer SIMD is limited to the original MMX instruction set over a separate set of 8 64-bit registers. Both MMX and SSE were pretty deficient compared to AltiVec.

But that's the G4. Gecko doesn't have AltiVec. Instead it just has a fairly limited set of "paired singles" operations which can do 2x32-bit FMAs per cycle. So Gecko certainly doesn't have the level of FP performance that G4 has, let alone more like you claim. And it has no integer SIMD. All things considered, a good programmer can do comparably well with SSE (per cycle) and can do a lot more with MMX if the algorithms can use packed 8-bit or 16-bit data types within the confines of its instruction set.

There's a lot of other little nits to consider when comparing the two CPUs. P3 is wider and has more reordering capabilities than Gecko. XBox's version may only have 128KB of L2 cache, but it's 8-way associative vs 2-way on Gecko. Conventional wisdom is that going from 2-way to 4-way associativity yields a similar hit rate improvement as doubling capacity; this will vary a lot from case to case, but there's a good chance that XBox's L2 cache was more effective despite its size. On the other hand, GC's T1-SRAM was probably significantly lower latency than XBox's DDR SDRAM, although both had to go through the GPU chip so I don't know how much of a difference it made for CPU stuff. Gecko has more/more associative L1 cache. P3 probably has better branch prediction. And so on.
 
But that's the G4. Gecko doesn't have AltiVec. Instead it just has a fairly limited set of "paired singles" operations which can do 2x32-bit FMAs per cycle. So Gecko certainly doesn't have the level of FP performance that G4 has, let alone more like you claim. And it has no integer SIMD. All things considered, a good programmer can do comparably well with SSE (per cycle) and can do a lot more with MMX if the algorithms can use packed 8-bit or 16-bit data types within the confines of its instruction set.

I see. Reading a bit more and comparing all the numbers it seems as if the chips are actually pretty comparable floating point wise per clock. Gekko seems to have a 1.9 Gflops rating while i've found this http://www.alternatewars.com/BBOW/Computing/Computing_Power.htm saying a 500mhz pentium 3 would be around 1 Gflop, meaning the Xcpu would be about 1.5. Don't know how accurate this is.

There's a lot of other little nits to consider when comparing the two CPUs. P3 is wider and has more reordering capabilities than Gecko. XBox's version may only have 128KB of L2 cache, but it's 8-way associative vs 2-way on Gecko. Conventional wisdom is that going from 2-way to 4-way associativity yields a similar hit rate improvement as doubling capacity; this will vary a lot from case to case, but there's a good chance that XBox's L2 cache was more effective despite its size. On the other hand, GC's T1-SRAM was probably significantly lower latency than XBox's DDR SDRAM, although both had to go through the GPU chip so I don't know how much of a difference it made for CPU stuff. Gecko has more/more associative L1 cache.

*P3 probably has better branch prediction. And so on.

By this are you saying it's more of an out of order chip than the gekko?

Now that you told me that I read a little bit about cache, and it seems having more l1 (and more associative) would be preferable to a more associative l2 cache but less l1. Because L2 takes a lot more cycles to find the data, and the more associative the cache is the more time is spent looking in it. Coupled with the low latency memory in the gamecube with the faster bus it wouldn't be as necessary to have a more associative l2 as it is on Xbox.

*I've also read that the power pc 750 has a bigger branch misprediction penalty. Which i've read the shorter pipeline in the gekko would be preferable because of that, in case of a mispredict it takes less time to start over. Since the Pentium has better branch prediction a wider pipeline would be preferable.
 
I see. Reading a bit more and comparing all the numbers it seems as if the chips are actually pretty comparable floating point wise per clock. Gekko seems to have a 1.9 Gflops rating while i've found this http://www.alternatewars.com/BBOW/Computing/Computing_Power.htm saying a 500mhz pentium 3 would be around 1 Gflop, meaning the Xcpu would be about 1.5. Don't know how accurate this is.

You've got to be careful comparing numbers from different sources because they can mean different things. For instance, one could be a peak number from a simple micro-benchmark while the other could be from a more real world benchmark that has components that are dragging the number down.

On Gekko the ps_madd family of instructions execute in one cycle and perform two FP32 FMA operations so the overall throughput is four FLOPs per cycle.

On Pentium III addps and mulps perform four FP32 operations in two cycles. They can execute in parallel on ports p0 and p1, so the overall throughput is also four FLOPs per cycle.

By this are you saying it's more of an out of order chip than the gekko?

Yes. Gekko has very limited OoOE. It can decode two instructions per cycle (plus a branch instruction, sometimes) to a six instruction completion buffer, and instructions get dispatched to one of five execution units: two integer, one load/store, one FPU and one for system/branch registers. The load/store and FPU have two reservation stations and the other units have one. So the reordering window is six cycles and one or two instructions can be scheduled ahead of others of the same type. And the two reservation stations for the load/store unit are more like a FIFO since the load/store instructions all start executing in-order.

The P6 family (PPro, P2, P3) can decode three instructions per cycle to a 40 entry reorder buffer which feeds into a 20 entry unified scheduler connected to five ports (that each have varying types of execution units). Newer loads can execute ahead of older loads. And it can execute a load and store in parallel.

So on paper P3 is both wider and has a lot more reordering capability. But there are many caveats that can limit this depending on the code, for instance several fetch and decode bottlenecks and depending on the register forwarding network to get enough register access bandwidth. So a lot of code has to be carefully optimized to get the most out of the throughput of the system despite the reordering.

Now that you told me that I read a little bit about cache, and it seems having more l1 (and more associative) would be preferable to a more associative l2 cache but less l1. Because L2 takes a lot more cycles to find the data, and the more associative the cache is the more time is spent looking in it. Coupled with the low latency memory in the gamecube with the faster bus it wouldn't be as necessary to have a more associative l2 as it is on Xbox.

It really depends on a lot of other factors and the code.

Since the clock speeds were a lot lower back then the L2 cache latency, in cycles, was also a lot lower. On Gekko it's 5 cycles and on P3 it's 7 cycles. So Gekko has an L2 latency advantage too (probably why it's only two-way set associative), but it has less ability to reorder around L1 misses.

The larger and more associative L1 cache certainly helps, although going from 4 way to 8 way helps a lot less than going from 2 way to 4 way.

I don't have actual numbers on the main RAM latency. While the 1T-SRAM is rated at 10ns there's no way you get anywhere close to that after taking into consideration end to end factors going through the external

*I've also read that the power pc 750 has a bigger branch misprediction penalty. Which i've read the shorter pipeline in the gekko would be preferable because of that, in case of a mispredict it takes less time to start over. Since the Pentium has better branch prediction a wider pipeline would be preferable.

PPC750 has a smaller branch misprediction penalty than P3 because it has a shorter pipeline, and it has better correctly predicted branch latency because it uses a BTIC which helps fetch faster. But it uses a simple one-level predictor with 512 entry 2-bit history table, while P3 has a much larger and much higher associativity BTB (512 vs 64 entry BTIC) and uses local 4-bit history into a for a second level predictor (with a much larger 4096 entry GHT). So P3 should mispredict less.

Really they're two very different CPUs which both have different strengths and weaknesses.
 
Last edited:
You've got to be careful comparing numbers from different sources because they can mean different things. For instance, one could be a peak number from a simple micro-benchmark while the other could be from a more real world benchmark that has components that are dragging the number down.

On Gekko the ps_madd family of instructions execute in one cycle and perform two FP32 FMA operations so the overall throughput is four FLOPs per cycle.

On Pentium III addps and mulps perform four FP32 operations in two cycles. They can execute in parallel on ports p0 and p1, so the overall throughput is also four FLOPs per cycle.

Is that for Gekko specifically or the PowerPC 750cxe? Because it's stated the gekko has some extra simd functions.

I've also seen the number 3 gflops thrown around multiple times for the Xcpu, and this post I found here seems pretty sound - https://community.futuremark.com/forum/showthread.php?27475-Original-Xbox, all the other info there seems to be correct.

The post rates it at 2.9 but says the sse has drawbacks which limit real world performance. If we assume the number is true that would put the Xcpu and gekko pretty much even per clock and then it would just be down to comparing efficiency of the architectures which I wouldn't know where to start. But I don't know if the gekko's 1.9 gflops rating is real world or best case scenario.

Also are you a programmer?
 
Is that for Gekko specifically or the PowerPC 750cxe? Because it's stated the gekko has some extra simd functions.

It's for Gekko. PowerPC 750cxe can perform one FMA per cycle, or two FLOPs. Therefore it has half the FLOP throughput as Gekko. The paired single instruction set IS the extra SIMD functionality you refer to.

I've also seen the number 3 gflops thrown around multiple times for the Xcpu, and this post I found here seems pretty sound - https://community.futuremark.com/forum/showthread.php?27475-Original-Xbox, all the other info there seems to be correct.

That's what I said. Four per cycle. 733MHz clock rate. 733 * 4 = 2932 MFLOP/s.

The post rates it at 2.9 but says the sse has drawbacks which limit real world performance. If we assume the number is true that would put the Xcpu and gekko pretty much even per clock and then it would just be down to comparing efficiency of the architectures which I wouldn't know where to start. But I don't know if the gekko's 1.9 gflops rating is real world or best case scenario.

485MHz * 4 = 1940 MFLOP/s. It's the best case scenario.

Look, this isn't some great mystery. Gekko is meticulously documented in its user manual.

http://datasheets.chipdb.org/IBM/PowerPC/Gekko/gekko_user_manual.pdf

From section 1.2.2.4.2:

"The multiply-add array allows Gekko to efficiently implement multiply and multiply-add operations. The FPU is pipelined such that one single-, paired single- or double-precision instruction can be issued per clock cycle."

This is reinforced throughout various other places in the document.

Pentium III's FP capability is noted in Intel's optmization guide.

http://download.intel.com/design/PentiumII/manuals/24512701.pdf

Table D-1 states addps (which is 4xFP32 FADD) can issue once per two cycles on port 1, and mulps (which is 4xFP32 FMUL) can issue once per two cycles on port 0. Like I said already.

It's also spelled out in this article:

"Since the Pentium III is a superscalar imple-mentation (that is, it has multiple executionports), it can perform four floating-pointoperations every clock cycle."

And probably many other sources. Agner Fog's microarchitecture descriptions and cycle tables are another good source.

Are there limitations that prevent how much you can realistically achieve this peak throughput? Sure, and they're pretty different for P3 and Gekko. They're also very dependent on what you're doing and how you're doing it, and therefore isn't very easy to actually measure.

Also are you a programmer?

Yes, and I've written a lot of assembly with SIMD, mainly for ARM (NEON) and x86 (MMX, SSE). But I would encourage you to better learn to understand official documentation and how claims are derived, rather than relying on the alleged credentials of people making claims.
 
I've just understood what PS2's CPU have same power as all CPUs in Dreamcast, Gamecube and Xbox.
Dreamcast - 1.4 Gflops, Gamecube - 1.9 Gflops, Xbox - 2,9 Gflops. 1.4 + 1.9 + 2.9 = 6.2 same as EE. :D
 
I owned all 3 systems during that generation. After playing them all daily for around 4 years straight, I would place them in the following order Graphic performance wise.

1. Xbox
2. GCN
3. PS2

I am basing this only from my personal play experience on an old CRT 720p HD TV. They all have some wonderful exclusives with amazing graphics, but image quality was always a major issue for me on Ps2 games. Still games like GoW 1/2 and twisted metal black were beautiful to me.
 
Back
Top