Running the Toy Shop demo on the 360

> "it's so EASY to understand"

Yes, and games are written to take advantage of architectural strengths. If a scene is going to cause a 70% stall/waste of resources, things are designed especially in the console sector to avoid such a heavy drop.

Since developers will be writting code to each GPU's strengths, the whole efficiency argument becomes less convincing.

In the end this argument is meaningless without game comparisons running on each machine.

If Xenos so awesome, how come PGR runs at 30 fps, or you have games with aliasing? Some very real and in your face limits there. Try to UNIFIED SHADER that away. Unified shaders are NOT the catch all solution to all rendering problems, especially if the game is CPU bound!!! Games are about the whole, and not just one specific component like the GPU.
 
Last edited by a moderator:
SynapticSignal said:
to understand the numbers, the logic can be enought
in a classic SSA in situation where the pixel computing is the bound, vertex computing stall
in situation where the vertex computing is the bound, pixel computing stall

the hit can go from a 30% to a 70% of units and time depending of the stalls of the shaders

ex. with a SSA [6 VS, 24 PS] if the 6 VS are not enoght to process data, the 24 PS can stall until the VS finish their work (70% hit)

with a USA [30 US] if 6 VS are not enought, 6 or 18 or all 24 remaining Shaders will be added to the first 6, so there's no stall, full efficiency (near 100%, with 0-5% to decide how many shaders to convert to vertex computing)

it's so EASY to understand

Sorry but you're not telling me anything new here. I understand how SSA and USA work. I've posted a whole bunch of Xenos patents here. It's the 30, 70, 95 etc. numbers that are being pulled out of the hat that I don't buy. Your explanations don't detail 'where' these numbers are being derived from, nor do ATI.


SynapticSignal said:
so, here you're the king of the truth just because you have tot-post and because I post here from today?
if this is your logic, the I know why you don't understand the USA

and if one say that you seems very biased, he is trolling?

please grow up, and stop here the discuss, I'm not interested in personal attack or "words fight"

No. A word of advice, you don't join a new board and start accusing me of ATI hate, MS hate and being truly biased without any foundation, implied or otherwise. If you want any constructive discussion here, that's the worst introduction you can make.
 
EDGE said:
If Xenos so awesome, how come PGR runs at 30 fps, or you have games with aliasing? Some very real and in your face limits there. Try to UNIFIED SHADER that away. Unified shaders are NOT the catch all solution to all rendering problems, especially if the game is CPU bound!!! Games are about the whole, and not just one specific component like the GPU.
It's not really fair to attribute such shortcomings to hardware when we are talking about launch titles. This naturally applies to both Xbox360 and PS3 titles.
 
Jawed said:
Oh, what magic would that be? Why is the RCP irrelevant?

Jawed

It's not irrelevant. The total is 4 FP32 instruction slots. What's irrelevant is the 'type', as the total is till 4, whether RCP is included or not.
 
Jaws said:
The peak would still be 4 as with G70.


Guys you really need to understand what the real world scheduling restrictions are before you start throwing meaningless number around.

How many input and output registers can you reference in a single clock on NV40 or X1800 for that matter?

Their will be a limited number of register ports/instruction slot.

It's not very useful if you can do 5 ops on 3 registers (not that I know the real number).
 
ERP said:
Guys you really need to understand what the real world scheduling restrictions are before you start throwing meaningless number around.

How many input and output registers can you reference in a single clock on NV40 or X1800 for that matter?

Their will be a limited number of register ports/instruction slot.

It's not very useful if you can do 5 ops on 3 registers (not that I know the real number).
That's why that code snippet I linked earlier runs 35% slower in FP32 rather than mixed FP16/32 (or pure FP16). That's the point I'm making.

I'm just showing how real world code falls far short in utilisation of the pipeline. Whether it's because the pipeline can't issue certain combinations of instructions, or because the register bandwidth isn't there to support all the operands that could be issued.

Instead of running in 21 cycles if you take "5 instructions per clock as the peak rate", it runs in 47 in mixed mode, or 65 if FP32.

Jawed
 
Jawed said:
From GPU Gems, since you're hard of reading:

An independent reciprocal operation can be performed in parallel with the multiply, MAD, and fp16 normalization described previously.

Page 16:

http://www.hotchips.org/archives/hc16/2_Mon/13_HC16_Sess3_Pres1_bw.pdf

Jawed

Sorry, couldn't find that quote on that page 16.

It would be something like vec3+scalar, vec3+scalar, i.e. 4D/alu and RCP would be part of scalar op. Sorry, don't see 5 FP32 still...which would be a feat...
 
Last edited by a moderator:
According nvshaderperf on G70 using half or full float registers most of the time doesn't make any difference at all.
I think from this point of view G70 is way better than NV40..
 
ERP said:
Guys you really need to understand what the real world scheduling restrictions are before you start throwing meaningless number around.

How many input and output registers can you reference in a single clock on NV40 or X1800 for that matter?

Their will be a limited number of register ports/instruction slot.

It's not very useful if you can do 5 ops on 3 registers (not that I know the real number).

Well true, I started a thread on this a while ago and no one got to the bottom of it, so I suppose this search continues...
 
Jaws said:
Sorry, couldn't find that quote on that page 16.

That's because it comes from:

http://download.nvidia.com/developer/GPU_Gems_2/GPU_Gems2_ch30.pdf

Which I posted earlier, referring to page 17.

It would be something like vec3+scalar, vec3+scalar, i.e. 4D/alu and RCP would be part of scalar op. Sorry, don't see 5 FP32 still...which would be a feat...
What are you going on about? An instruction is an instruction, whether it's 1D, 2D, 3D or 4D.

Jawed
 
nAo said:
According nvshaderperf on G70 using half or full float registers most of the time doesn't make any difference at all.
I think from this point of view G70 is way better than NV40..
So on G70 can you dual-issue two MADs each of which has independent FP32 source operands (6 in total)?

Can you, as a matter of interest, take the HLSL from:

http://www.beyond3d.com/forum/showpost.php?p=294006&postcount=99

And report the shader performance stats for SM3 on G70?

Jawed
 
Jawed said:
That's because it comes from:

http://download.nvidia.com/developer/GPU_Gems_2/GPU_Gems2_ch30.pdf

Which I posted earlier, referring to page 17.


What are you going on about? An instruction is an instruction, whether it's 1D, 2D, 3D or 4D.

Jawed

Vec3+scalar, vec3+scalar (4D/alu) are the execution units with the instruction slots, with 4 FP32 slots.

Or

vec2+vec2, vec2+vec2 (4D/alu) are the execution units with the instruction slots, with 4 FP32 slots.

The RCP would be a scalar instruction.
 
Edge said:
> If Xenos so awesome, how come PGR runs at 30 fps, or you have games with aliasing? Some very real and in your face limits there. Try to UNIFIED SHADER that away. Unified shaders are NOT the catch all solution to all rendering problems, especially if the game is CPU bound!!! Games are about the whole, and not just one specific component like the GPU.

30 FPS with Full HDR and full AA, cities with size of NYC and London. Full Pixel Motion Blur on the sides + being a 1st generation game. Unified Shaders would get a little time getting used to and the results will come out by next year.
 
But why is a RCP irrelevant? Just because it's scalar doesn't mean that it's not part of the fragment pipeline's capabilities.

The vec3+scalar instructions (that's two instructions executed by one ALU) count, so what's the problem with counting RCP?

Jawed
 
in simple terms, sony says With Cell and RSX, the PS3 can do theoritical peak of 51 Billion Shader ops per second. and thats with a G70 architecture efficiency of maximum 60-70%, that would account for real-life performance of 30-35 billion shader ops per second. ATI has said it achieves with the GPU alone 48 billion shader ops her second at 95% efficiency, thats 45 billion shaders ops per second in real-life performance.
 
Back
Top