PS3 vs X360: Apples to Apples high level comparison...

j^aws

Veteran
ps3_r02.jpg


We've recent mis-information flying around, I' thought I'd *try* to 'normalise' available metrics for both systems to give an apples to apples high level architectural comparison so you can make your own conclusions.

I'm only going to provide 'normalised' total system metrics compared to the above image as this is all we can compare across both systems at the moment until more details are released.

1) Shader ops

Shader ops in isolation are not very meaningful, but I'll try to compare to the above

Earlier discussion on a shader op,

http://www.beyond3d.com/forum/viewtopic.php?t=23169


-PS3

claimed PS3 ~ 100 billion shader ops per second

Cell ~ 8 shader ops per cycle (7 SPU + VMX)

8*3.2GHz ~ 25.6 billlion shader ops per second

RSX ~ 136 shader ops per cycle

136*0.55GHz ~ 74.8 biilion shader ops per second

total= 74.8+25.6 ~ 100 billion shader ops per second

PS3 ~ 100 billion shader ops per second


-X360

xGPU ~ 96 Shader ops per cycle

96*0.5 GHz ~ 48 billion shader ops per second

xCPU

6*3.2~ 19.2 billion shader ops per second (3 VMX + 3 FPU)

total= 48+19.2~ 67.2 billion shader ops per second

X360 = 67.2 billion shader ops per second



2) Dot products


-PS3

claimed PS3 ~ 51 billion dot products per second

Cell ~ 8 per cycle (7 SPU + VMX)

8*3.2GHz~ 25.6 billion dot products per second

RSX ~ 51-25.6 ~ 25.4* billion dot products per second

* deduced from claim

PS3 ~ 51 billion dot products per second



-X360

claimed xCPU ~ 9 billion dot products per second

xCPU~ 3 dot products per cycle (3 VMX)

3*3.2 GHz ~ 9.6 billion dot products per second

xGPU ~ 48 dot products per cycle (48-way vec4)

48*0.5 GHz ~ 24 billion dot products per second

total ~ 9.6 + 24 ~ 33.6 billion dot products per second

X360 ~ 33.6 billion dot products per second



3) TFLOPS

Some theory to the madness,

http://www.beyond3d.com/forum/viewtopic.php?p=523362#523362


PS3 ~ 2 TFLOPS

X360 ~ 1 TFLOPS

Cannot derive these figures but both companies have used peak total system flops which cannot be compared with single/double precision programmable flops. On their own they do not mean much but they are apples to apples between X360 and PS3, IMHO.


4) Memory

FYI, earlier bandwidth discussion,

http://www.beyond3d.com/forum/viewtopic.php?t=23011

I'm going to normalise bandwidths and memory so that they are more comparable. What I mean by this is that 25 GB/s access to 256 MB is equivalent to 50 GB/s access to 128 MB or equivalent to 100 GB/s access to 64 MB etc etc...and assuming the same latencies apply...

Currently AFAIK,

* The 256 GB/s is not a physical inter-connect bandwidth, it's the intra-EDRAM module bandwidth *within* the EDRAM module. The inter-connect bandwidths between xGPU and the EDRAM module are 32 GB/s write and 16 GB/s read. These are the numbers from the 'leak' and the 256 GB/s is the 'effective' bandwidth. Since both systems will use compression/ bandwidth saving techniques, I'm using physical inter-connect bandwidth to a better apples to apples comparison.


Starting point,


[X360: CPU<==21.6 GB/s==>GPU]----48 GB/s* ----[10 MB]
|
|
22.4 GB/s
|
|
[512 MB]



[PS3: CPU<==35 GB/s==>GPU]----22.4 GB/s ----[256 MB]
|
|
25.6 GB/s
|
|
[256 MB]


>>>>>memory b/w and memory amounts normalise for PS3 to match X360<<<<<<<<


[X360: CPU<==21.6 GB/s==>GPU]----48 GB/s* ----[10 MB]
|
|
22.4 GB/s
|
|
[512 MB]



[PS3: CPU<==35 GB/s==>GPU]----48 GB/s----[119.5 MB]
|
|
22.4 GB/s
|
|
[293 MB]


>>>>>FSB, CPU-GPU normalise for X360 to match PS3<<<<<<<<


[X360: CPU<==35 GB/s==>GPU]----48 GB/s* ----[10 MB]
|
|
22.4 GB/s
|
|
[316 MB]



[PS3: CPU<==35 GB/s==>GPU]----48 GB/s ----[119.5 MB]
|
|
22.4 GB/s
|
|
[293 MB]


It's now easier to compare physical bandwidths and memories across both PS3 and X360 to give a better sense of data flows and data access. If the 256 GB/s* effective bandwidth of the EDRAM replaces the 48 GB/s* physical bandwidth, then it's easier to map and compare both architectures data flows IMHO.


[X360: CPU<==35 GB/s==>GPU]----256 GB/s* ----[10 MB]
|
|
22.4 GB/s
|
|
[316 MB]


>X360 normalised total system + VRAM = 326 MB


[PS3: CPU<==35 GB/s==>GPU]----48 GB/s ----[119.5 MB]
|
|
22.4 GB/s
|
|
[293 MB]


>PS3 normalised total system + VRAM =412.5 MB


5) Summary


ps3_r02.jpg


So normalising and apples to apples figures for the above total system spec for PS3 are,

PS3 vs X360

PS3 ~ 100 billion shader ops per second
X360 = 67.2 billion shader ops per second

PS3 ~ 51 billion dot products per second
X360 ~ 33.6 billion dot products per second

PS3 ~ 2 TFLOPS
X360 ~ 1 TFLOPS

PS3 normalised total system + VRAM =412.5 MB
X360 normalised total system + VRAM = 326 MB

Normalised,

Code:
[PS3: CPU<==35 GB/s==>GPU]----48 GB/s ----[119.5 MB] 
| 
| 
22.4 GB/s 
| 
| 
[293 MB] 



[X360: CPU<==35 GB/s==>GPU]----256 GB/s* ----[10 MB] 
| 
| 
22.4 GB/s 
| 
| 
[316 MB]

This is as close an apples to apples comparison that can be made with available info.

No flames please, if they're are any mistakes or inconsistencies, then please let me know and I'll amend the data above. Also, I'm assuming equal efficiency across both systems with compilers, code etc.

I'll re-iterate, it's a peak, apples to apples comparison, or as close to what we can get with available info at the moment without isolating any single components like CPUs, GPUs, bandwidths, total RAM etc...it's a total system vs system.

IMHO, they'll both have their strenghs and weaknesses and will both be great systems but the PS3 has overall balance and power suited to a games console.

Hopefully this helps and you can make your own conclusions...
 
Vaan said:
I wonder if your RAM normalisation is correct at all :?

As I've mentioned, it's based on the assumption that 25 GB/s to 256 MB is equivalent to 50 GB/s to 128 MB or 12.5 GB/s to 512 MB etc...keeping latencies the same...
 
Well you've just "proven" that caching and data compression don't work.

You can't normalise memory bandwidths like that.

Jawed
 
Why would that be? Surely a listed figure as 25 GB/s is 25 GB/s for the RAM amount it connects to. Neither part has provided bandwidth/pin or bandwidth/megabyte RAM figures.
 
one said:
Why not add the bandwidth to Local Store in Cell ;)
Well..we could also add EIB bandwith and registers file bandwith too! :)
register files : 4*16*7*3.2 Ghz = 1.4 TByte/s
local store : 16*7*3.2 Ghz = 360 GByte/s
EIB : 96 * 3.2 Ghz = 300 GByte/s

Dont' worry guys, I'm kidding :)
 
one i dont think that that is accurate because we aren't counting all physical interconnects within the processing system but only in teh rendering system. I think you do have to account for the x360 edram bandwidth somewhere in order to normalise it to PS 3 bandwidths.

From a laymen's point of view (and as usually I need someone to correct if I'm wrong)whereas PS has to use its 35GB/s bandwith to do all of its backbuffer work, the x360 eliminates that need by doing almost all of its back buffer work before it goes back to the system ram pool and cpu for display. You have to account for all the internal buses where that work is done to compare it to the PS3 35GB/s where that work is done...
 
Could someone also compare the shading units in the GPUs? I'm still confused about it to an extent...
Like, ATI has 48 shader ALUs, but what does it use to do texture adress calculations? Are those separate from these 48, or is it like Nvidia's architecture where the pixel pipes' shader ALUs have to do it as well? And just how many ALUs are there per pixel pipe in the RSX?
 
Anyways, the thing should be something like this...

X360.png




So I don't find any method to measure or compare memory bandwidths between the two systems. Let's see in a couple of months with the full final specs.
 
blakjedi said:
From a laymen's point of view (and as usually I need someone to correct if I'm wrong)whereas PS has to use its 35GB/s bandwith to do all of its backbuffer work, the x360 eliminates that need by doing almost all of its back buffer work before it goes back to the system ram pool and cpu for display. You have to account for all the internal buses where that work is done to compare it to the PS3 35GB/s where that work is done...
But the connection to Xenos's backbuffer, held in eDRAM, is a 32 GB/s bandwidth. The super-fast stuff is the LOGIC on the eDRAM. What can this logic do?

I guess that's the ultimate question. What work is done in eDRAM, not by the conventional GPU unit? How much does that logic contribute to the rendering?
 
Laa-Yosh said:
Could someone also compare the shading units in the GPUs? I'm still confused about it to an extent...
Like, ATI has 48 shader ALUs, but what does it use to do texture adress calculations? Are those separate from these 48, or is it like Nvidia's architecture where the pixel pipes' shader ALUs have to do it as well? And just how many ALUs are there per pixel pipe in the RSX?

There are three threads of shader programs being processed on those 48 ALU's at any one time, but only 16 texture processors - the texture pipes are clients to the shader pipelines, so it makes no sense for the shader pipes themselves to be handling the texture adress processing, but rather the texture pipelines.
 
Vaan said:
Anyways, the thing should be something like this...

X360.png




So I don't find any method to measure or compare memory bandwidths between the two systems. Let's see in a couple of months with the full final specs.

vaan-

i know the xenos is also the system memory controller... the cpu can access the gddr3 directly, no? otherwise its bottlenecked at 11gbs/11gbs, reads/ writes?
 
blakjedi said:
Vaan said:
Anyways, the thing should be something like this...

X360.png




So I don't find any method to measure or compare memory bandwidths between the two systems. Let's see in a couple of months with the full final specs.

vaan-

i know the xenos is also the system memory controller... the cpu can access the gddr3 directly, no? otherwise its bottlenecked at 11gbs/11gbs, reads/ writes?

I suppose it is "bottlenecked" at this bw, yes :?
 
No, that's not right AFAIK. The CPU can directly access memory of course, as they are on the same bus.
 
Vector 1: x1 x3 x5 x7
Vector 2: x2 x4 x6 x8
Vector 3: y1 y3 y5 y7
Vector 4: y2 y4 y6 y8
Vector 6: z1 z3 z5 z7
Vector 7: z2 z4 z6 z8
Vector 8: w1 w3 w5 w7
Vector 9: w2 w4 w6 w8

Assuming one can issue a Vector MUL/ADD/MADD each cycle (throughput being 1 and that the pipeline is full), we can draw the following for these 4 dot products.

(x1*x2) + (y1*y2) + (z1*z2) + (w1*w2)

(x3*x4) + (y3*y4) + (z3*z4) + (w3*w4)

(x5*x6) + (y5*y6) + (z5*z6) + (w5*w6)

(x7*x8) + (y7*y8) + (z7*z8) + (w7*w8)

We have two ways of doing this:

1st way... 4 vector MUL's and 3 vector ADD's (7-8 cycles)


So, we first do 1 vector MUL (1 cycle):

(x1*x2) = A0

(x3*x4) = B0

(x5*x6) = C0

(x7*x8) = D0


We do now 1 vector MUL (1 cycle):

(y1*y2) = A1

(y3*y4) = B1

(y5*y6) = C1

(y7*y8) = D1


We do now 1 vector MUL (1 cycle):

(z1*z2) = A2

(z3*z4) = B2

(z5*z6) = C2

(z7*z8) = D2


We do now 1 vector MUL (1 cycle):

(w1*w2) = A3

(w3*w4) = B3

(w5*w6) = C3

(w7*w8) = D3


We then have:

A0 + A1 + A2 + A3

B0 + B1 + B2 + B3

C0 + C1 + C2 + C3

D0 + D1 + D2 + D3


We have to do the 3 ADD's in parallel, it should be obvious how this is done: A0 B0 C0 D0 can be called vector AA, A1 B1 C1 D1 can be called vector BB, A2 B2 C2 D2 can be called vector CC and A3 B3 C3 D3 can be called vector DD.

So we add the first two pairs:

AA + BB

CC + DD

Then we sum the results:

(AA+BB) + (CC+DD)


About 7 cycles to do 4 dot products.

2nd way... 1 vector MUL's, 3 vector MADD's


So, we first do 1 vector MUL (1 cycle):

(x1*x2) = A0

(x3*x4) = B0

(x5*x6) = C0

(x7*x8) = D0

We do now 1 vector MADD:

A0 + (y1 * y2) = A1

B0 + (y3 * y4) = B1

C0 + (y5 * y6) = C1

D0 + (y7 * y8) = D1

We do now 1 vector MADD:

A1 + (z1 * z2) = A2

B1 + (z3 * z4) = B2

C1 + (z5 * z6) = C2

D1 + (z7 * z8) = D2


We do now 1 vector MADD:

A2 + (w1 * w2) = A3

B2 + (w3 * w4) = B3

C2 + (w5 * w6) = C3

D2 + (w7 * w8) = D3

This is much faster.

It should be quite fast... hopefully I did not write the slowest approach possible.

Edit: I already applied the fix ;).
 
when you look at the block diagram the northbridge is the connection between all three areas:

GPU <- 33.2GBs R/22.4GBs W -> Northbridge
(actually the read value includes a sum of read bandwidth from L2 Cache = 10.8Gbs, plus normal northbridge bandwidth = 22.4 GBs) see (7) on block diagram =55.6 GBs total

CPU <- 10.8GBs R/10.8GBs W -> Northbridge 21.6 GBs total

Northbridge <- 22.4GBs R/W -> 512MB RAM 22.4 GBs total

99.6GBs total not including edram access

The confusing thing is that the northbridge sits on the GPU... but the maximum bandwidth the northbridge supports at anytime seems to be 22.4GBs in any one direction...

Question is this the CPU Read bandwidth is only half that of the northbridge... wouldnt it have been better to have the same bandwidth for reads and writes as the GPU?

Again feel free to correct this as I may not be presenting this correctly.
 
Panajev2001a said:
It should be quite fast... hopefully I did not write the slowest approach possible.
The second approach is quite fast, it's almost as fast as it can be, you can further reduce instructions count from 5 to 4 with a sequence of fmul, fmadd, fmadd, fmadd.
Don't worry about latencies here cause in any reasonable lengthy 'shader' or inner loop you're going to do some other (non dependant) calculation that would be eventually interleaved in your dot4 product.
 
Qroach said:
Um were those calculations on how zenon would process data or Cell? You guys got me confused.
CELL, Xenon extended VMX unit would use one or more dot instructions instead.
 
Back
Top