Geko vs. EE vs. Pentium3-733

Glonk · Mar 5, 2003

Tagrineth said:
Basically, Gekko > XCPU, hands down. The only things XCPU has going for it are ace compilers and SSE instructions... other than that Gekko has the upper hand.

I'm wondering why you think this at all?

From what I've heard from theoretical MIPS and FLOPS benchmarks, a 733 Celeron is faster.
From what I've heard from developers in real world performance, the XCPU is generally faster but certain ops are faster on Gekko...

randycat99 · Mar 5, 2003

He did say other than "ace compilers and SSE instructions".

Crazyace · Mar 5, 2003

EE misconceptions...

To compare CPU only for EE/Gecko/and Pentium just disregard the VU1 - which performs the TCL in almost exactly the same manner as the Xbox or Flipper... You are then left with the EE core + VU0..

In simple integer ops all 3 run 2 alu's, so the clock speed is the comparision.. FPU wise the Gekko and Pentium support double precision and extended double precision in scalar mode - a big plus if you use doubles ( which wouldn't be recommended for EE )
In SIMD the VU0 peaks at 8 flops per cycle ( fused FMAC ), the gekko 4 flops per cycle ( paired FMAC ) and the Pentium 8 flops per 2 cycles ( alternative 4 FMUL then 4 FMADD ) - that's where the EE grabs performance back..
The big difference is the caches though - the bigger ( and L2 ) caches make a massive difference in general code.. and as someone mentioned before, having better compiler technology improves the practical performance..

Gubbi · Mar 5, 2003

randycat99 said:
Gubbi said:

I think that a 3GHz P4 will beat the EE in T&L quite handily. With a peak performance of 12GFLOPS (as compared to 6.4) and the memory subsystem to boot.

Click to expand...

Note- that's 10x the clockspeed just to gain 2x the GFLOPs. I'd say the EE hangs in there quite admirably. Now compare die sizes. Can you imagine how easy it would be to scale a prototype EE 2x (maybe not even the entire core, but just throw in 2 more vector units and an a$$load of local cache) and still not match the die size of a P4? You might get equal GFLOPs at only 1/10 the P4 clockrate on this EE.

Didn't you read the previous paragraph in my post? P4 is a general purpose CPU, the EE is a special purpose CPU.

If I wanted to make the EE look bad, I could make a similar ignorant comparison:

I couldn't find any Spec2K results for any R5xxx (the cpu core in the EE, VU0/1 not included).

However in Spec95 a 150MHz SGI Indy R5000 with no lvl 2 cache (ie. like the EE) scores 3.07/4.20 in Spec95 int/fp (both peak). So lets double that (150MHz->300MHz) and we get 6.14/8.40.

No P4 figures, but a 1GHz Athlon (the old one with external lvl 2 cache) gets 42.9/29.4 in int/fp. A 3GHz P4 will be somewhere between 2 to 3 times the performance of this. You then get 85.8-128.7 / 58.8-88.2 for int/fp.

Cheers
Gubbi

Gubbi · Mar 5, 2003

Tagrineth said:
How is PowerPC more efficient?

Click to expand...

Research.

x86 is one of the most poorly-designed ISA's ever; pretty much every x86 CPU so far has tried desperately to 'patch over' all the inherent pitfalls. Even the ISA's creator, Intel, is working to get rid of it, slowly but surely.

Actually you can argue that the x86 ISA wasn't designed at all, which is only slightly better than someone putting an effort into making it look like it does.

However, modern x86 implementations (PPRO and onwards) are quite similar to their RISC counterparts; all are built around a dynamically scheduled microengine (restricted dataflow). This reduces the disadvantages of the x86 to:

1.) decoding instructions for the execution backend.
2.) Low number of register
3.) 2-address instruction format.

x86 implementations since the PPRO has been quite aggressive and use different techniques to overcome these restrictions. The P4 decodes instructions into a trace-cache, completely removing decoding stages from the branch misprediction flow (while the complete pipeline of the P4 is rather long, the fetch->schedule->execute->writeback pipeline is not).

Register pressure is a problem. But againg aggressive implementation using either low latency lvl 1 cache (P4) or dual ported lvl 1 cache (Athlon), store->load forwarding and extensive register renaming reduces the penalties that would otherwise be incurred.

Cheers
Gubbi

Tagrineth · Mar 5, 2003

randycat99 said:
He did say other than "ace compilers and SSE instructions".

She ^_~

zurich said:
Considering that Flipper's T&L is fixed function, I'd think that quite a lot of transform and lighting ops are done on Gekko. I remember reading an interview with the IBM guys that make the chip, and they commented that every time they played Luigi's Mansion they could spot the lighting effects that were being done on the host CPU.

Like I said, except in specific cases.

randycat99 · Mar 5, 2003

Gubbi said:
Didn't you read the previous paragraph in my post? P4 is a general purpose CPU, the EE is a special purpose CPU.

No reason for you to take my remarks as a challenge (certainly no reason for you to start throwing terms like "ignorant" around). You said the P4 soundly beats the EE in GFLOPs and memory subsystem. I only gave topical reference to what it had to do to pull off the feat, so to speak.

randycat99 · Mar 5, 2003

Tagrineth said:
She ^_~

My apologies for the confusion.

overclocked · Mar 5, 2003

You should also consider transistor count..
EE was has 10million as i can remember, donÂ´t recall what Gekko and Celeron has.

Gubbi · Mar 6, 2003

randycat99 said:
No reason for you to take my remarks as a challenge (certainly no reason for you to start throwing terms like "ignorant" around).

Fair enough, that came out wrong, my bad.

Didn't mean that you are ignorant, just that comparing two so complex chips on peak MFLOPS ignoring all other parameters is.

Cheers
Gubbi

Jabjabs · Mar 7, 2003

For overclocker the transistor numbers are 10 million for the EE, 15 million for the Gekko (I'm not sure about this figure) and 27 million for the XCPU.

Tagrineth, you kind of skipping the gun a bit there what I was saying is that the EE does not do the final rendering I didn't meantion th T&L of which I know the EE does, I was talking about applying and rendering all the texture to the polygons not T&L. Just a little mis understanding there.

randycat99 · Mar 7, 2003

I was to understand the XCPU (PIII/Celeron with 128 kB cache) was more around 19 M transistors. 27 sounds like a big jump, no? A P4 with 512 kB cache weighs in around 55 M transistors! (That's just the number I have on file. Feel free to correct, if necessary.) Amazingly, the L2 cache constitutes nearly as many transistors as the actual CPU in the aforementioned Celeron and P4 designs. So if you cut the value in half, that is pretty close to what the actual CPU is.

Gubbi · Mar 7, 2003

!9 million sounds right.

Coppermine P-III, which has 256KB lvl2 cache, has 27 million.

Back of the envelope calculations would suggest that 128KB would take up: 128KB*8b/B*6t/b = 6.3Mtransistors, add a little extra for control logic etc.

Cheers
Gubbi

jvd · Mar 7, 2003

just wondering , what does the athlon xp get in flops ? Also its been a long time since i read about it but i remember the sh-4 modded for the dreamcast was quite a beast for its time . How does that chip compare with the gekko and celeron ?

MfA · Mar 7, 2003

Just about all modern x86 processors do 4 FLOPS per cycle peak with SIMD.

Fafalada · Mar 7, 2003

Athlons are 4Flops SIMD too... but their raw FPU performance is already fast enough that makes 3dnow redundant in most cases.
SH4 was 7 or 8 Flops/cycle (I forget if there was a Madd instruction, dot product is 7).
Though I would add the actual efficiency of these instruction sets also varies greatly, much more then peak numbers alone would suggest.

jvd · Mar 7, 2003

at 200 mhz the sh-4 does 1.4gflops . Not bad for a chip designed in 97. It does .9gflops sustained. Wonder what sonys future wonder chip can do sustained ?

randycat99 · Mar 7, 2003

jvd said:
at 200 mhz the sh-4 does 1.4gflops . Not bad for a chip designed in 97. It does .9gflops sustained. Wonder what sonys future wonder chip can do sustained ?

With the amounts of local cache/memory being talked about in the supposed PS3 design, it will be a far brighter situation than any FSB-constrained computer example that has preceded it (which includes just about any PC and even the heralded Dreamcast).

jvd · Mar 7, 2003

randycat99 said:
jvd said:

at 200 mhz the sh-4 does 1.4gflops . Not bad for a chip designed in 97. It does .9gflops sustained. Wonder what sonys future wonder chip can do sustained ?

Click to expand...

With the amounts of local cache/memory being talked about in the supposed PS3 design, it will be a far brighter situation than any FSB-constrained computer example that has preceded it (which includes just about any PC and even the heralded Dreamcast).

Mabye mabye not . We will have to see. Nothing is perfect . Btw every chip/mb combo i have ever seen has constraints , same with every console. Its what happens. Remember I'm not trying to take away from the cell chip. I call it wonder chip because i am really impressed with it. Just like i call the athlon 64 the come back kid ... But anyway I was just trying to compair every thing.

Oh btw In 97 we had a 1gflop chip at 200mhz. In 2005/6 we will have a 3ghtz chip doing a tflop. Thats not bad at all .

MfA · Mar 7, 2003

Actually a bit of a miracle is what it is, it means that the chip has to achieve about 5 times the number of FLOPS per gate (assuming average gate dimensions scaled linearly with feature size and the chip was the same size). Oh, and they have to pull off a bigger miracle not to increase the needed power by an order of magnitude ...

Geko vs. EE vs. Pentium3-733

Glonk

randycat99

Crazyace

Gubbi

Gubbi

Tagrineth

murr

randycat99

randycat99

overclocked

Gubbi

Jabjabs

randycat99

Gubbi

jvd

MfA

Fafalada

jvd

randycat99

jvd

MfA

Similar threads