XGPU real shader ops

SentinelQW

Newcomer
The XGPU (xenos/C1) shader operations=48 billion/sec
And real value?

http://media.teamxbox.com/dailyposts/xbox360/specs/gpu_sops_xbox360vsps3.jpg

"Todd Holmdahl: The unified shader model is definitely a win due to the flexible architecture. Most of the rendering will be done on the GPU, and that’s where we have a clear advantage. Also, if you take into account the simultaneous texture fetches, control flow operations and programmable vertex fetch operations, you get 80 billion operations per second. And this doesn’t even take into account work that could be done on our 3 CPU cores."

What's your opinion?
 
I believe some of the devs (specifically nAo) have noted shader ops are NOT a relevant benchmark.

What power it has and how effeciently the chip uses that power is most important. This is why there are benchmarks on the PC. A P4 should kick an Athlon 64's butt... but it doesn't. The devil is in the details, you can have a chip like NV30 that really has some great features but have a couple bad misques and the entire thing falls apart.

There are no benchmarks on consoles unfortunately, so the real measure will be how the final products look like. And due to the nature of launches (cutting edge hardware provided near the launch window) it wont be another year before we get a real feel on the power of Xenos (i.e. we will have to wait until fall 2006, and most likely fall 2007, before we see a game designed from the ground up with Xenos in mind from day one... it will take a time for devs to find the sweet spot and seek out all the pitfalls and bottlenecks and to design a game/ART around the strengths and avoiding weaknesses).

Overall the two GPUs look pretty competitive. The question is how does the USA really affect the end product. If it is 95% effecient as ATI claims it could be a little faster, if it is 60-70% it will probably be slower... but there will be a lot of exceptions and special cases... and devs will design with that in mind.

Edit: Ps- The spec revising gets old after a bit IMO.
 
Acert93 said:
There are no benchmarks on consoles unfortunately, so the real measure will be how the final products look like.

Is that really a bad thing? I mean really, think about it. What matters in the end? That PS3 might push more polygons? That X360 might do AA faster?

When people will really start measuring those things, consoles won't be consoles anymore, they will be PCs. And we already have PCs, with ATI and NVIDIA to test out and Intel and AMD to benchmark.

Why can't we just leave consoles alone? They are closed systems, and incompatible with each other therefore any benchmark would not only be invalid (as no "multiplatform" benchmark would ever get top performance from both, rendering the result invalid), but completely useless since in the end, consoles are for pushing the disc in, start playing, have fun. Over.
All this mumbo jumbo about what's better and what's worse is SO last millennium, it reminds me of those dumb SNES vs GENESIS wars. In the end, who really cares what's the best hardware? Each had advantages and disadvantages compared to each other, and both had amazing games available to them.

This technical forum shouldn't be about "benchmarking consoles". It should be about analysing each architecture within itself and compare how it works with other very different architectures.

There will never be a 3DMarkConsoles. And there shouldn't be. It already sparks immense flamewars on PCs due to the way it tests different GPUs - which are compatible with each other - I can only imagine what would happen to a console version of it.
 
In theory the number of Shaders Ops of the NV2A (not NVFLOPS) are 48 Operations per clock/cycle. 16 for the VS and 32 for the PS.
 
Ouch! I have seen that you are talking about the Xenos and not the NV2A.

1. All the Shader ALU on the Xenos are equal.
2. Every Shader ALU on the Xenos can do 2 escalar FP Ops and 8 Vectorial FP Ops. This is 10 Ops per clock cycle and ALU.
3. This is only the Shader part, the non-programmable part aren´t here

The number is 240GFLOPS.
 
london-boy said:
Why can't we just leave consoles alone? They are closed systems, and incompatible with each other therefore any benchmark would not only be invalid (as no "multiplatform" benchmark would ever get top performance from both, rendering the result invalid), but completely useless since in the end, consoles are for pushing the disc in, start playing, have fun. Over.
Comparing platforms will always be a pastime of fora, but thankfully all this background 'hostility' goes unknown to the masses you buy and play consoles. Most people never walk into a shop to buy a gaming system with a view to what it's performance is; the deciding factors are numerous and varied.

Though a console benchmark would be fuel to the f-b's fire and cause lots of arguments, and such a benchmark might be impossible anyway, it would be nice to see how the different architecture cope with different problems from a learning perspective. On the screen we might not see any difference on two platforms but behind the scenes it'd be good to see how some efficiencies are achieved and how they contriobute. eg. A screen full of enemies (something ubiquitous to next-gen games I fear...). If one platform enchmarks two times the baddies as the other, it makes no odds to the visuals as there's already more than anyone can count, but it'd show maybe how one architecture can cope with AI of LOD or procedural animation or something better than the other architecture. And that knowledge would hopefully go on to build the next-gen of systems. As it is systems are designed on theories it seems, each generation considering the pitfalls of their predecessor, and not the successes of their rivals. Maybe.
 
Freaking forum ate my reply again...

Anyhow, what Shifty said. While I agree with you L-B (that is why I offered the "wait for the games" option and not something else... like "lets look at the memory bandwidth!") from a development standpoint it is important to benchmark to identify any bottlenecks and to really map out where it is strong, where it is weak and then 1. design an engine to the strengths and 2. pic an art/theme that maximizes the look within those limits.

But none of this changes the brand. And the fact is ATI and NV are the best at what they do. All we are seeing is different philosophies... with 10th generations (I exaggerate but not much) GPUs they are pretty familiar with what works and what does not. The gap will be a lot less than the GS and NV2a and we saw what devs could do there. So in the end it comes down to the games... which is a good thing :D
 
I think "simultaneous texture fetches, control flow operations and programmable vertex fetch operations" are the great hidden features of Xenos that guild the lily, as it were.

You can't quantify these things so easily.

For example texture fetches in RSX will always be painfully slow in comparison - but how slow depends on the format of the textures.

Also, control flow operations in RSX will be out of bounds because they are impractically slow - whereas in Xenos they'll be the bread and butter of good code because there'll be no performance penalty.

Dependent texture fetches in Xenos (I presume that's what the third point means), will work without interrupting shader code - again RSX simply can't do this, dependent texturing blocks one ALU per pipe.

I don't think it's worth trying to codify these differences in shader ops - that's just a crowdpleaser technique. But these three points above all represent serious advantages for Xenos, and are all founded in the way that shader operations "are good to their neighbours" in Xenos, but not so in RSX.

Jawed
 
This exact topic has been discussed quite a few times those last months.
Before posting anything, remember that the Search Function is your friend.
 
Acert93 said:
(i.e. we will have to wait until fall 2006, and most likely fall 2007, before we see a game designed from the ground up with Xenos in mind from day one... it will take a time for devs to find the sweet spot and seek out all the pitfalls and bottlenecks and to design a game/ART around the strengths and avoiding weaknesses).

indeed. I agree with you Acer93.

we should not expect to see what Xenos can really do, in games that are coming out in 2005 and H1 2006, that's just not realistic.
 
RSX VS: 2-issues * 8VS = 16 vertex shader ops
RSX PS: 4-issues * 24 PS = 96 pixel shader ops

RSX VS + PS = 112 shader ops * 550MHz = 61.6 billion shader ops/second

vertex fetch : RSX = 8; C1 = 16

texture fetch : RSX = 24; C1 = 16

and the SPEs of CELL seems can do texture/vertex fetch .
 
shadere ops are evil

shaders ops are a meaningless performance metric ;)

and the SPEs of CELL seems can do texture/vertex fetch .
SPEs can do anything, this doesn't mean SPEs are good/fast at every task.
 
RSX PS: 4-issues * 24 PS = 96 pixel shader ops
It's 5-issues for the PS, if we're to believe nVidia. 2x vector + 2x scalar + 1x normalize.

and the SPEs of CELL seems can do texture/vertex fetch.
Mmmmm.......... I'd look upon this in a more skeptical light. Yeah, it can fetch the data from memory, but it's not like it's got specialized hardware to fetch with filtering and unpacking from several texel formats, or have the SMT to use TLP to cover for the latencies of having to read from whatever storage. SPEs doing more basic vertex processing that doesn't involve any sort of texture fetches so that you can save the GPU some effort would certainly be feasible. You can just give each SPE a packet of verts and transforms and have it run a job.
 
For example texture fetches in RSX will always be painfully slow in comparison - but how slow depends on the format of the textures.
AFAIK, the FP texture fetch speed of G70 is much faster than the NV40, and the 1D FP texture fetch bandwidth of NV40 is 4 times of R420 , the 2D/3D/4D FP texture fetch bandwidth is same to R420 .

we don't know how fast will C1's texture fetch speed reach , it can fetch texture from L2 cache, 128bit UMA and the eDRAM , but i think it can not comparison with RSX once want to fetch the texture from main memory because RSX can fetch texture from the GDDR3 local memory and the XDR main memory .

Also, control flow operations in RSX will be out of bounds because they are impractically slow - whereas in Xenos they'll be the bread and butter of good code because there'll be no performance penalty.
the latency of G70 pixel shader dynamic branch is not too bad , in fact, it depends you how to use it .

in the NVIDIA's gpu programming guide :

"Use dynamic branching when the branches will be fairly coherent. As mentioned in Section 4.1.3, dynamic branching can make code faster and easier to implement. But in order for it to work optimally, branches should be fairly coherent (for example, over regions of roughly 30 x 30 pixels)."

the large number threads of c1 may help on this, but we have not any performance number on this , it is doubt that "there'll be no performance penalty" .

Dependent texture fetches in Xenos (I presume that's what the third point means), will work without interrupting shader code - again RSX simply can't do this, dependent texturing blocks one ALU per pipe.
if the application is stress on texture operation, i think the bottleneck will be more like on the 128bit UMA . if the the application is stress on ALU opteraion, it is not a big problem .
 
Last edited by a moderator:
we don't know how fast will C1's texture fetch speed reach , it can fetch texture from L2 cache, 128bit UMA and the eDRAM
It can't fetch textures from eDRAM - this is basically only a write space. If its rendering to a texture then the surface is written to eDRAM, then resolved out to main memory before it can be read back as a texture.

the large number threads of c1 may help on this, but we have not any performance number on this
Xenos works with batches of 64 pixels, so thats the maximum branch cost.
 
cho said:
AFAIK, the FP texture fetch speed of G70 is much faster than the NV40, and the 1D FP texture fetch bandwidth of NV40 is 4 times of R420 , the 2D/3D/4D FP texture fetch bandwidth is same to R420 .
There's as much as a 4-cycle overhead for texture fetches in NVidia hardware, as far as I can tell.

the latency of G70 pixel shader dynamic branch is not too bad , in fact, it depends you how to use it .

in the NVIDIA's gpu programming guide :

"Use dynamic branching when the branches will be fairly coherent. As mentioned in Section 4.1.3, dynamic branching can make code faster and easier to implement. But in order for it to work optimally, branches should be fairly coherent (for example, over regions of roughly 30 x 30 pixels)."
1000-pixel batches aren't small enough, it seems :)

if the application is stress on texture operation, i think the bottleneck will be more like on the 128bit UMA . if the the application is stress on ALU opteraion, it is not a big problem .

You should read-up on the stall-less (both for ALUs and TMUs), maximum utilisation scheduling of Xenos.

Jawed
 
Jawed said:
There's as much as a 4-cycle overhead for texture fetches in NVidia hardware, as far as I can tell.
Where did you get this?

1000-pixel batches aren't small enough, it seems :smile:
G70/RSX use smaller batches
 
Back
Top