PS3 vs X360: Apples to Apples high level comparison...

So, that means that Xenos actually does have a PPP for creating/modifying/deleting vertices, then?

As a layman, I wanted to ask a question in one of the other threads where DeanoC spoke of Xenos' ability to read/write anywhere in main RAM...what purpose would that have for game visuals, other than the academic possibilities? Physics?
 
DaveBaumann said:
Hopefully I'll have some more up later next week.

So much time, yet :cry:
Well at least it will worth the wait :D ( I hope that I am able to understand the article).

BTW thanks in advance by the article :D
 
Jawed said:
We're still waiting for a detailed comparison of the theoretical shader performance profiles of NV40 and R420.

And we've had them in our "labs" now for a year...

Curious, who's "we"? B3D or...?

Anyway, just stating the obvious but those comparison's would be for a SM3.0 and SM2.0 architecture's and even more disparity...

Jawed in the earlier link said:
...The leak for XB360 claims 96G ops per second. It seems to me that the leak is counting ops in the same way that NVidia does, rather than how ATI does.

So...

So...

Leak said:
The Xenon GPU is a custom 500+ MHz graphics processor from ATI. The shader core has 48 Arithmetic Logic Units (ALUs) that can execute 64 simultaneous threads on groups of 64 vertices or pixels. ALUs are automatically and dynamically assigned to either pixel or vertex processing depending on load. The ALUs can each perform one vector and one scalar operation per clock cycle, for a total of 96 shader operations per clock cycle. Texture loads can be done in parallel to ALU operations. At peak performance, the GPU can issue 48 billion shader operations per second.

http://www.beyond3d.com/forum/viewtopic.php?t=13470

So...you've read the 'leak' wrong. It seems 'Billions' and 'cycles' and 'seconds' and 'numbers' are being confused...

What makes this more frustrating is that we've discussed these leaks many times in the other threads...

Jawed said:
This talk by ATI in London "confirms" a number of things:

http://www.driverheaven.net/showthread.php?t=75843

- Xenos cannot create vertices (no tesselation) (13:20)
- the ALUs are not organised into quads (or any other group size) (14:30)
- 120 Gsops (12:23)

See above. That's a peak 'component' operation and NOT a 'shader' operation as mentioned earlier.

Please read my first post on the first page...

If you have, then it would be obvious that, CELL + RSX ~ 100 Billion shader operations per second.

And your quoting ONLY the Xenos GPU ~ 120 Billion shader operation per second ??

Lets keep this logical here... ;)
 
Jaws, we have reasonably detailed architectural diagrams for NV40 and R420 plus explanations on how they work. Care to explain, in detail, how they perform against each other, based purely on theory?

In other words, can you convert the theoretical capabilities of these two architectures into a realistic prediction of the performance of them?

NV40 has 2x the SM2.0/3.0 ALU capability of R420, which should overhaul its core-clock disadvantage. But it doesn't. etc.

What I've learnt over the last few days is this is a road to nowhere. I'm aghast that you still think it's worth pursuing this.

I'm quite happy to speculate on the architectures, but I'm going to stick to throwing around stupid performance numbers for the sake of taking the piss out of the marketing. ATI's now counting 120Gsops for Xenos. It's now time for NVidia to counter that.

Jawed
 
Oh yeah, you're right, I confused the "96 ops per cycle" and "96 Gops per second" numbers.

Sigh.

Jawed
 
NV40 has 2x the SM2.0/3.0 ALU capability of R420, which should overhaul its core-clock disadvantage. But it doesn't. etc.

Saying something has "2 ALU's" doesn't mean anything in either cases. R420's primary ALU has all the instructions that is supports, whilst the 2 ALU's of NV40 has a distribution of instructions between the two - this means that it can opportunistically dual-issue some cases, but not necessarily two instructions of the same type.
 
DaveBaumann said:
NV40 has 2x the SM2.0/3.0 ALU capability of R420, which should overhaul its core-clock disadvantage. But it doesn't. etc.

Saying something has "2 ALU's" doesn't mean anything in either cases. R420's primary ALU has all the instructions that is supports, whilst the 2 ALU's of NV40 has a distribution of instructions between the two - this means that it can opportunistically dual-issue some cases, but not necessarily two instructions of the same type.

Dave why you posting? You should be working on your XGPU article :LOL: . J/K
 
Jawed said:
Jaws, we have reasonably detailed architectural diagrams for NV40 and R420 plus explanations on how they work. Care to explain, in detail, how they perform against each other, based purely on theory?

In other words, can you convert the theoretical capabilities of these two architectures into a realistic prediction of the performance of them?

If you've read the first post in this thread then you'd know that,

Jaws said:
...
I'm only going to provide 'normalised' total system metrics compared to the above image as this is all we can compare across both systems at the moment until more details are released.
...

I'm trying to put perspective to these numbers from *both* sides. The 'peak' metrics have been derived on the first page and are valid for both as they have been consistently derived. So when some random 'numbers' come along from different, conflicting sources, pulling numbers out of context, there's some reference and a persepective. I've made clear they are 'peak' numbers and as close to an apples to apples comparison as you can get with the info we have from *offcial* docs from E3 PR.

This is essentially no different that putting the *offcial*, released spec metrics side by side. At least it has context and discussion in this thread.

And NOBODY has *real* world numbers. My point? FACTS not BS that's been flying round recently. Even if these facts are *purely* theoretical but nevertheless can be derived logically.

I've read some threads recently with people STILL crying foul at PS2 specs and yet conveniently forgetting to cry foul at XBOX specs and vice-versa. I smell fanb**s...

Jawed said:
What I've learnt over the last few days is this is a road to nowhere. I'm aghast that you still think it's worth pursuing this.

As I've just mentioned above, I'm after FACTS, be they theoretical or not but nevertheless authentic and not conflicting. Once these FACTS can be agreed on, then we can have sensisble discussions on whether they can be realisied or not or how realistic they can be or where potential weaknesses may lie. But unless that's agreed on, all we'll see are pointless, WRONG numbers spreading misinformation...


Jawed said:
I'm quite happy to speculate on the architectures, but I'm going to stick to throwing around stupid performance numbers for the sake of taking the piss out of the marketing. ATI's now counting 120Gsops for Xenos. It's now time for NVidia to counter that.

I don't care if technology is from ATI, nVIdia, Sony, MS, Nintendo or ACME...I care about interesting technology no matter who it's from. If the PR can't agree on consistency, then surely so called *smart* people on this forum can?

Have you read my derivations of the 'peak' numbers on the first page? If you can dispute them then please feel free as I want FACTS that can be consistently and independently derived from *offcial* info. I've used a consistent method that's held solid so far with *official* numbers. It's explaind all the other, conflicting numbers, including yours. If you can't accept/dispute those numbers then I've learnt something new today...
 
Maybe you want to look at page 13 of the PDF I linked:

- Pixel shader operations/pixel 8
- Pixel shader operations/clock 128

These are the claimed numbers for NV40.

51.2Gsops. Roughly half of what's claimed for RSX.

How much more black and white do you want?...

If only we could talk in terms of pixel shader instructions, comparisons would start to get meaningful. This example shows SM3 executing 102 instructions in 46.75 cycles, 2.2 instructions per cycle:

http://www.beyond3d.com/forum/viewtopic.php?p=327176#327176

It's also interesting to ask about the effect of RSX's likely SIMD pixel shader architecture. NV40 appears to be SIMD across all 16 pipelines, i.e. only one shader can be executing at a time:

http://www.beyond3d.com/forum/viewtopic.php?t=23295

R420 is 4-way MIMD across 16 pipelines, i.e. each quad can execute a different shader. Counting transistors, this means that R420 has prolly got a greater overhead in instruction decode logic than NV40.

I wonder if Xenos will be 48-way MIMD, i.e. each ALU can be running a different shader. I'm sorta doubtful, to be honest, because that's an awful lot of decode-logic overhead - though I admit to not knowing what that amounts to in percentage terms. I aint got the foggiest!

RSX and Xenos are looking as incomparable as NV30 and R300 did a few years ago.

All of this still leaves us high and dry on Cell versus XB360 CPU.

Jawed
 
DaveBaumann said:
NV40 has 2x the SM2.0/3.0 ALU capability of R420, which should overhaul its core-clock disadvantage. But it doesn't. etc.

Saying something has "2 ALU's" doesn't mean anything in either cases. R420's primary ALU has all the instructions that is supports, whilst the 2 ALU's of NV40 has a distribution of instructions between the two - this means that it can opportunistically dual-issue some cases, but not necessarily two instructions of the same type.

Dave, that's precisely my point. That's why I highlighted that specific nonsense comparison.

Jaws is determined to compare architectures with absolutely no regard for their respective architectures.

Jawed
 
Jawed said:
Maybe you want to look at page 13 of the PDF I linked:
- Pixel shader operations/pixel 8
- Pixel shader operations/clock 128

Let see...128 * 0.4 GHz ~ 51.2 GSops/sec

It's another component operation specific to pixels as I've already pointed out to you earlier in the thread with your 'page 3' reference. And if your going to use 'components' again, you've missed out the 'vertices' too for the total...

Jawed said:
These are the claimed numbers for NV40.

51.2Gsops. Roughly half of what's claimed for RSX.

RSX ~ 136 shader ops per second ~ 136 *0.55 ~ 74.8 GSops/sec

Considering you've also missed out 'vertex' ops too from the '51.2 GSop', it's nothing near "half" of what was claimed and would infact be similar.

Jawed said:
How much more black and white do you want?...

Perhaps you should try disputing my numbers on the first page instead of clutching at straws and throwing random numbers into the mix. But it doesn't really matter now because you've answered my question from my previous post....


Jawed said:
...
I wonder if Xenos will be 48-way MIMD, i.e. each ALU can be running a different shader. I'm sorta doubtful, to be honest, because that's an awful lot of decode-logic overhead - though I admit to not knowing what that amounts to in percentage terms. I aint got the foggiest!

From papers I've read, they suggest nVidia would move to a complete MIMD architecture. Also for Xenos, MIMD would suit it's GPGPU nature and would make sense.

Jawed said:
...
RSX and Xenos are looking as incomparable as NV30 and R300 did a few years ago.

Yep. But we have no low level details for RSX yet...they could still share similarities...

Jawed said:
...
All of this still leaves us high and dry on Cell versus XB360 CPU.

Been discussed to death on these forums...but I'm pretty clear on them...

Jawed said:
Jaws is determined to compare architectures with absolutely no regard for their respective architectures.

I suggest you read/understand the whole thread first before making any further ignorant comments! :rolleyes:
 
dukmahsik said:
I appologize for being such a noob here, but where is this article from Dave? thanks much. :LOL:

Not out yet.

I have a feeling you'll have no way of not knowing once he actually posts it. ;)
 
136 shader operations per cycle is what, exactly?

24 pixel pipelines doing 4 operations?

plus

10 vertex pipelines doing 4 operations?

Should we be making allowances for texture blending? Texture address calculation? What else?

Unluckily we have two different claims from ATI for Xenos, 48Gsops (two ops per cycle) and 120Gsops (five ops per cycle).

Which are you going to use in your comparison?

Why?

Jawed
 
In the code I linked to earlier:

http://www.beyond3d.com/forum/viewtopic.php?p=327176#327176

which in SM3 is 102 instructions, at an average of 2.2 instructions executed per cycle. A 6800 Ultra would shade 137 million pixels per second.

Assuming RSX operates in the same way, at 550MHz across 24 pipelines, this shader would shade 282 million pixels per second.

The same shader executed on Xenos would need to operate at 1.2 instructions per cycle to shade 282 million pixels per second.

But I have no idea if Xenos could run this shader at more than 1 instruction per cycle.

Jawed
 
Jawed said:
136 shader operations per cycle is what, exactly?

That metric represents exactly for Xenos, what it represents for RSX. You will find this in the Xenon 'leak' and my calculations/links on the first page, i.e..

Jaws in the other thread said:
1 shader op per cycle ~ 1 shader execution unit

1 shader execution unit ~ vector unit or scalar unit

e.g. ALU = 1 scalar unit + 4-way SIMD unit ~ 2 shader ops per cycle

Or

Leak said:
The Xenon GPU is a custom 500+ MHz graphics processor from ATI. The shader core has 48 Arithmetic Logic Units (ALUs) that can execute 64 simultaneous threads on groups of 64 vertices or pixels. ALUs are automatically and dynamically assigned to either pixel or vertex processing depending on load. The ALUs can each perform one vector and one scalar operation per clock cycle, for a total of 96 shader operations per clock cycle. Texture loads can be done in parallel to ALU operations. At peak performance, the GPU can issue 48 billion shader operations per second.

RSX ~ 136 shop/cycle
Xenos ~ 96 shop/cycle

These numbers/ metrics on there own are meaningless without further parameters. But both numbers also cross-reference with other metrics that I calculated on the first page without any conflicts. So they are consistent but need further analysis.

All this is essentially telling us (with the per CYCLE) is the number of execution units that run shaders, i.e. the number of shader execution units. It is not telling us the amount of work/computation being done per clock cycle nor the precision of the data being worked on.

E.g. it's not differentiating between 1-way, 2-way, 3-way or 4-way execution units. i.e. all of those shops/cycle can be from 136 scalar units or 136 vector units or a combination. Also these vector units can be vec(2-4) units! So we can't go into any further detail without further information.

However, from *official* MS spec from xbox.com,

MS spec said:
48-way parallel floating-point dynamically scheduled shader pipelines"

Dave also mentions that they are 48 5D ALUs,

Dave said:
ALU's are 5D - Vec4+Scalar

http://www.beyond3d.com/forum/viewtopic.php?p=526112#526112

The 'leak' above also mentions the 48 ALUs consisting of a vector + scalar unit,

===> Xenos> 48 vec4 + 48 scalar units> 96 Shop/cycle

The 4-way vector components of vec4 units are not included in the definition.

===> RSX> x + y units ~ 136 Shop/cycle*

* we need more info to determine more detail...and this is deduced from the DOT products information below.


Jawed said:
24 pixel pipelines doing 4 operations?

plus

10 vertex pipelines doing 4 operations?

From this information, we get, 24+10~ 34 Dot products per cycle ~ 18.7 GDot/sec*

*Vec4 unit is assumed to provide a 1 Dot/cycle and this means Dot product per cycle is an 'integer' number, e.g.. 34 Dot/cycle.


And more importantly, falls way short of the claimed CELL+RSX ~ 51 GDot/sec

Taking the contribution of DOT products from CELL, either from 7SPUs or 7SPUs+1VMX, we get,

RSX ~ 25.4 OR 28.6 GDot/sec

Which one is accurate?

25.4/0.55 GHz ~ 46.18 Dot product/Cycle?

or

28.6/0.55 GHz ~ 52 Dot product/Cycle?


The 46.18 Dot/cycle is rejected in favor of the 52 Dot/Cycle because it's not an 'integer' from above assumption.

From our earlier definition of a Shop/cycle, this then suggests 52 Vec4 units contribute to RSX's 136 Shops/cycle.

RSX~ 52 Vec4 units + 84 units not contributing DOT products.

http://www.beyond3d.com/forum/viewtopic.php?t=23228&start=0

http://www.beyond3d.com/forum/viewtopic.php?p=531473#531473

===>RSX~ 28.6 GDot/sec

Jawed-RSX'~ 18.7 GDot/sec is way short of my (Jaws*) RSX~ 28.6 GDot/sec and does not have enough Dot product computation to match the CELL+RSX claim. Therefore 18.7 GDot/sec and it's pipeline arrangement is unlikely.

* Yes, as if we don't have enough confusion, Jaws and Jawed is now officially confusing the shit out of me too(Jaws)! :p


Jawed said:
Should we be making allowances for texture blending? Texture address calculation? What else?

From the above definition of a Shader operation, we don't include these metrics in Shop/cycle numbers. However, more than likely, these metrics have been included in the 'total system TFLOP' metric.

Jawed said:
Unluckily we have two different claims from ATI for Xenos, 48Gsops (two ops per cycle) and 120Gsops (five ops per cycle).

The 48 GShop/sec is used for Xenos here and I've cross-referenced that for valididty in my calculations for 96 Shop/cycle on the first page.

The 120 GShop/sec for Xenos is greater than BOTH CELL+RSX ~ 100 GShop/sec. We can reject the 120 GShop/sec number for Xenos here for being inconsistent. Even though that '120' number is a valid number, the 'unit' of the metric is not consistent. It would be more accurate to call it Xenos~ 120 Billion component (5D) operations per second and leave out 'shader' from the metric. And also, 120*2FMADD ~ 240 GFlop/sec, (32bit because of SM3.0).

Jawed said:
Which are you going to use in your comparison?

If it's not obvious still, it's Xenos ~ 48 GShop/sec from *offcial* spec. doc.! :p

Jawed said:

See above. You have to use consistent units of measurement when comparing throughout.

Taking this consistency, the following was derived,

RSX ~ 136 Shop/cycle ~ 52 Vec4 units + 84 units NOT contributing Dot products.

Xenos ~ 96 Shop/cycle ~ 48 Vec4 + 48 Scalar units.


Those 84 units for RSX can ALL be scalar for all we know or ALL be Vec3. So the measure of computation performed per cycle can vary. In that sense the aforementioned, 'component operation per cycle' metric will give more detail. But we don't have that for BOTH systems.


P1010276.jpg


From what I've derived above,

RSX ~ 136 Shop/cycle ~ 52 Vec4 units + 84 units NOT contributing Dot products.

I'd be guessing now on the following, usually, scalar units are paired with Vec units so,

RSX ~ 136 Shop/cycle ~ 52 Vec4 + 52 Scalar + 32 Other units

32 Pixel Shaders ~ 32 Vec4 + 32 Scalar + 32 Other units*
20 Vertex Shaders ~ 20 Vec4 + 20 Scalar

*Other units can be Vec3 or Scalar etc...

Jawed said:
In the code I linked to earlier:

http://www.beyond3d.com/forum/viewtopic.php?p=327176#327176

which in SM3 is 102 instructions, at an average of 2.2 instructions executed per cycle. A 6800 Ultra would shade 137 million pixels per second.

Assuming RSX operates in the same way, at 550MHz across 24 pipelines, this shader would shade 282 million pixels per second.

The same shader executed on Xenos would need to operate at 1.2 instructions per cycle to shade 282 million pixels per second.

But I have no idea if Xenos could run this shader at more than 1 instruction per cycle.

Looking at that RSX image above, I don't think we can extrapolate pipelines and ISA from what we know of NV40 to RSX, any more than what we know of R420 to Xenos. In addition both Xenos and RSX are likely to have all their 'legacy' PC logic removed for consoles. E.g. SM1.0, SM2.0, etc. as they don't need to support the extra code paths like PC games. On second thoughts, not sure about Xenos now with B/C with Xbox-NV2a and PC games with XNA?

In any case, both Xenos and RSX will have assembly level, to-the-metal access on both consoles, irrespective of whether Xenos uses SM3+ or RSX uses OpenGL|ES.

Looking at the Xenon 'leak' text above, it suggests that one Xenos ALU ~ vec4 + Scalar, and those ALUs can dual issue to a Vec4 and a scalar unit. So,

Xenos ~ each ALUs(vec4+ scalar) can dual issue per cycle
48*2~ 96 instructions per cycle
96*0.5 Ghz ~ 48 Billion INSTRUCTIONS per second*

* Not SHADER ops per second and so another number to get confused with! I'll stop right here! :p
 
Back
Top