PS3 vs X360: Apples to Apples high level comparison...

Jawed said:
For a while people were saying that the 256GB/s figure inside the EDRAM was fictional...

It was originally described as "effective".

It would appear it's real, not effective.

Though we're still waiting to get hard facts, so I'll continue to "believe" rather than treat it as a hard fact (admittedly, hard to do).

Jawed

I think people are getting really mixed up over everything. Between 32GB/s external bus throughput to 256Gb/s external bus throughput to 256GB/s internal bus throughput. People end up mixing up terms and buses. I think most people were of the opinion that the 256GB/s external bus width was probably fictional, but anything on the edram chip internally could talk much faster. Granted, I don't know if people expected the edram "processor" to be able to do blending/aa/etc.

Nite_Hawk
 
Jaws:
PS3 ~ 2 TFLOPS

X360 ~ 1 TFLOPS
Microsoft claimed 'more than 1 TFLOP'. The X360 GPU probably rates well over 1 TFLOP by itself by counting its fixed functionality as floats in the same way nVidia did. It appears Microsoft was counting this way and just rounded to the nice 1 TFLOP spec, and Sony outdid them in their announcement by not rounding down.
 
Titanio said:
Jaws said:
They're not exactly identical. However the CELL PPE and XeCPU are both Power based, 12 Flops per cycle, 2-way SMT, in-order cores...

Was there confirmation of this? Specifically the in-order bit?

No official details but the 2-way SMT and 12 flops per cycle was inferred from 115 and 218 GFlops @ 3.2 GHz for XeCPU and CELL.


blakjedi said:
Jaws said:
Xenos is capable of 24 billion dot products per second. If you allocate 37.4 billion to RSX, that's a helluva increase considering they're both on 90nm, no? :oops:

I had asked in the xenos thread whether the work output spoken of related to Xenos includes the edram or is just the shader part...

"However, using Sony's claim, 7 dot products per cycle * 3.2 GHz = 22.4 billion dot products per second for the CPU. That leaves 51 - 22.4 = 28.6 billion dot products per second that are left over for the GPU. That leaves 28.6 billion dot products per second / 550 MHz = 52 GPU ALU ops per clock."

Sorry your question and your quote don't seem related or I'm missing what your asking here? If your asking whether the fixed function logic/ALUs on the EDRAM module are included, then no...it's only shader ALUs.

The fixed function stuff would be included in the 1TFLOP number of X360 though...

Your above quote is the 51 Giga dots/sec for both CELL and RSX. I took 8 dots/cycle for CELL (VMX+7 SPU)...but the above assumes 7, excluding the VMX for CELL.

This would suggest that the '52' number is 52 vec4 units contributing to the 136 shader ops per cycle for RSX, then 136-52 ~ 84 ALUs would be scalar ALUs or ones not capable of dot products on the RSX...i.e.

52 Vec4 units + 84 vec?/scalar units?

Vec4 + scalar units can be paired,


RSX

52 Vec4 + 52 Scalar + 32 vec? units?

:?

rwolf said:
http://www.extremetech.com/article2/0,1558,1818127,00.asp

The 48 ALUs are divided into three SIMD groups of 16. When it reaches the final shader pipe, each of the 16 ALUs has the ability to write out two samples to the 10MB of EDRAM. Thus, the chip is capable of writing out a maximum of 32 samples per clock. At 500MHz, that means a peak fill rate of 16 gigasamples. Each of the ALUs can perform 5 floating-point shader operations. Thus, the peak computational power of the shader units is 240 floating-point shader ops per cycle, or 120 billion shader ops per second at 500MHz
...
8)

I agree Xenos is cool! 8)

But some of these sites are really just confusing all these numbers.

It's 48 Billion shader ops per second for Xenos in the *official* specs,

http://www.xbox.com/assets/en-us/xbox360downloads/FactSheets.zip

Also the "240 floating-point shader ops per cycle" they mention can be easily confused with single precision 240 floating-point ops per cycle (flops)! Which is not accurate as that would be 480 flops per cycle with FMADD! :p

Anyway, the numbers on the first page of this thread are accurate from the info we have...and these random sites are throwing all sorts of conflicting numbers around...


Lazy8s said:
Jaws:
PS3 ~ 2 TFLOPS

X360 ~ 1 TFLOPS
Microsoft claimed 'more than 1 TFLOP'. The X360 GPU probably rates well over 1 TFLOP by itself by counting its fixed functionality as floats in the same way nVidia did. It appears Microsoft was counting this way and just rounded to the nice 1 TFLOP spec, and Sony outdid them in their announcement by not rounding down.

IIRC, from official specs,

RSX ~ 1.8 TFlops
CELL ~ 0.218 TFlops

X360 is still quoted at system total ~ 1 TFlops
XeCPU ~ 0.115 TFlops
Xenos ~ 0.885 TFlops

Not sure why one would 'round down' and the other 'round up' given the oportunity. But it could well be that the RSX has alot of fixed function logic on-board that counts to that number whilst the Xenos transistor count has 10 MB of eDRAM which wouldn't contribute to that number...
 
Same way Xenos rates at 900 GFLOPS... it's called misleading the consumer. For instance, RSQ could be counted as 1 FLOP, but not in marketing-land. Instead, we'll count the lookup as one FLOP, and count all the FLOPs used in the NR refinement, and then you'd get something like 15-odd flops in a single shader instruction. Or perhaps you can imagine that it does SIN/COS using the first 4/5 terms of the Maclaurin Series and geometrically mirroring the results. That would amount to... what... 30 FLOPs per instruction? So all you have to do is consider how much computing power the GPU would have if you did nothing but SIN and/or COS and/or RSQ for every single instruction you'll ever execute. There's a few TFLOPs for you.
 
Nite_Hawk said:
Jawed said:
For a while people were saying that the 256GB/s figure inside the EDRAM was fictional...

It was originally described as "effective".

It would appear it's real, not effective.

Though we're still waiting to get hard facts, so I'll continue to "believe" rather than treat it as a hard fact (admittedly, hard to do).

Jawed

I think people are getting really mixed up over everything. Between 32GB/s external bus throughput to 256Gb/s external bus throughput to 256GB/s internal bus throughput. People end up mixing up terms and buses. I think most people were of the opinion that the 256GB/s external bus width was probably fictional, but anything on the edram chip internally could talk much faster. Granted, I don't know if people expected the edram "processor" to be able to do blending/aa/etc.

Nite_Hawk


"ATI: The 2-terabit (256GB/sec) number comes from within the EDRAM, that’s the kind of bandwidth inside that RAM, inside the chip, the daughter die. But between the parent and daughter die there’s a 236Gbit connection on a bus that’s running in excess of 2GHz. It has more than one bit obviously between them."

http://firingsquad.com/features/xbox_360_interview/page3.asp


also, old diagram:
http://www.xbitlabs.com/misc/picture/?src=/images/news/2004-04/xbox2_scheme_bg.gif&1=1
 
Nite_Hawk said:
Jawed said:
For a while people were saying that the 256GB/s figure inside the EDRAM was fictional...

It was originally described as "effective".

It would appear it's real, not effective.

Though we're still waiting to get hard facts, so I'll continue to "believe" rather than treat it as a hard fact (admittedly, hard to do).

Jawed

I think people are getting really mixed up over everything. Between 32GB/s external bus throughput to 256Gb/s external bus throughput to 256GB/s internal bus throughput. People end up mixing up terms and buses. I think most people were of the opinion that the 256GB/s external bus width was probably fictional, but anything on the edram chip internally could talk much faster. Granted, I don't know if people expected the edram "processor" to be able to do blending/aa/etc.

Nite_Hawk

FiringSquad: What types of operations do the EDRAMs 192 processors perform?

ATI: Well they do z-compares, they do alpha blends, they do blends of samples to make a pixel. That kind of thing. They do stencil operations also. And this is the first time memory has access to something like this, right in the memory, so it never leaves the memory die. The memory and the logic is all built into one die. And it’s also a power savings by the way.

http://firingsquad.com/features/xbox_360_interview/page3.asp
 
Jaws:
IIRC, from official specs,

RSX ~ 1.8 TFlops
CELL ~ 0.218 TFlops

X360 is still quoted at system total ~ 1 TFlops
XeCPU ~ 0.115 TFlops
Xenos ~ 0.885 TFlops
I don't think the total "targeted" FLOPS "power" of the Xenos graphics chipset has ever been disclosed. The PR rough guideline for total system performance is too vague to consider it an absolute quantity useful in deriving 885 GFLOPS for the GPUs. Considering the NV40 was already rated around 1 TFLOP by similar nVidia accounting, I suspect X360's next generation graphics chipset probably delivers something comparable and more.
Not sure why one would 'round down' and the other 'round up' given the oportunity.
Microsoft probably felt claiming the magical TFLOP barrier would be spoiling enough, and Sony was left in the position to be more exact in order to show that there would still be some improvement in power for their system.
 
blakjedi said:
How in the world does the Nvidia rate at 1.8 Teraflops? Nomatter what I've read it just doesnt add up.
NVidia claims 360 Gflops for NV40, counting PS, VS, texturing and blending. That figure is a bit on the high side, but probably not too far off.
If we take the 136 to 53 shader ops comparison as RSX being "2.57 times NV40", we arrive at 920 Gflops. And btw, it could very well mean RSX has 28 of 32 pixel pipelines (a parallel to Cell ;))
 
Xmas said:
NVidia claims 360 Gflops for NV40, counting PS, VS, texturing and blending. That figure is a bit on the high side, but probably not too far off.
Do you remember where you read those numbers? some official nvidia document?
BTW, you have a PM :)
 
jvd said:
", we arrive at 920 Gflops
which is 880gflops less than they claim
That's where they counted the other parts: triangle setup, the whole Z subsystem, LOD calculation, interpolators, whatever.
Given the emphasis on HDR, they have probably doubled the capabilities of the TMUs handling FP textures, so sampling a FP16 texture is very likely single clock. And texturing is more than 40% of that NV40 figure.
 
Isolating XeCPU and CELL isn't strictly a total system, apples to apples comparison but I've noticed a few peak metrics missing alongside GFlops. Namely integer and scalar meterics. I haven't seen official numbers on these yet but here's some peak numbers from what we know so far (please feel free to correct me),

-XeCPU, integer, 32bit

1 core ~ 1VMX + 1 IU ~ 4 + 1 ~ 5 integer ops per cycle

3 cores ~ 3*5 ~ 15 integer ops per cycle
15*3.2 GHz ~ 48 Billion integer ops per second

-XeCPU, scalar

1 core ~ FPU + IU ~ 2 scalar ops per cycle

3 cores ~ 3*2 ~ 6 scalar ops per cycle
6*3.2GHz ~ 19.2 Billion scalar ops per second

-XeCPU, FP, 32 bit

115 GFlops


-CELL, integer, 32 bit

PPE ~ 1VMX + 1 IU ~ 4 + 1 ~ 5 integer ops per cycle

7 SPUs ~ 7*4 ~ 28 integer ops per cycle

CELL ~ 33 integer ops per cycle
33*3.2GHz ~ 105.6 Billion integer ops per second

-CELL, scalar

PPE ~ FPU + IU ~ 2 scalar ops per cycle

7 SPUs ~ 7*1 ~ 7 scalar ops per cycle

CELL ~ 9 scalar ops per cycle
9*3.2 GHz~ 28.8 billion scalar ops per second

-CELL, FP, 32 bit

218 GFlops


CELL vs XeCPU

CELL~ 105.6 Billion integer ops per second, 32bit
XeCPU~ 48 Billion integer ops per second, 32bit

CELL~ 28.8 Billion scalar ops per second, 32bit
XeCPU ~ 19.2 Billion scalar ops per second, 32bit

CELL~ 218 GFlops, 32bit
XeCPU~ 115 GFlops, 32bit

Off course these are peak numbers...
 
Jaws your integer numbers are all over the place and basically off on some points.

Among other things, SPEs are dual issue - so if you want to make sweeping generalizations about performance you need to count them as 2 integer instructions per clock (scalar or vector for that matter :p ).
 
The things that consume the most integer execution time are generally not the integer math ops, but things like store, fetch, and branch.

Comparing the # of instructions per second peak doesn't give you a meaningful number at all.
 
Comparing the # of instructions per second peak doesn't give you a meaningful number at all.
Of course not, but if you do go writing it out at least you should make it accurate.

For that matter the idea that dual-issue will double your instruction throughput couldn't be farther from the truth on in-order CPUs either. Especially in any kind of general purpose code.

Actually the places where dual issue makes the most difference is what SPEs tend to be optimized for.
 
archie4oz said:
Should be 96GFlops unless they've got a sneaky instruction that adds another 19Gflops...
Some people have speculated that XCPU FPU could possibly have Gekko-esque 2-way SIMD mode in single precision, adding 2 more flops/cycle to peak numbers.

Personally I would find it ironic if that's the case, given how little use that would have outside specsheets and how they harp on Sony all the time about pushing peak numbers.
 
Back
Top