Running the Toy Shop demo on the 360

nAo · Oct 20, 2005

Jawed said:
So on G70 can you dual-issue two MADs each of which has independent FP32 source operands (6 in total)?

I believe it can't

Can you, as a matter of interest, take the HLSL from:

http://www.beyond3d.com/forum/showpost.php?p=294006&postcount=99

And report the shader performance stats for SM3 on G70?

You don't need a nvidia card to do this, I'm sitting in a internet cafe' now

Here the results about NV40:

Code:

-------------------- NV40 --------------------
Target: GeForce 6800 Ultra (NV40) :: Unified Compiler: v77.72
Cycles: 38.75 :: R Regs Used: 5 :: R Regs Max Index (0 based): 4
Pixel throughput (assuming 1 cycle texture lookup) 168.42 MP/s
=========================================================================

Shader performance using all FP16
Cycles: 37.75 :: R Regs Used: 4 :: R Regs Max Index (0 based): 3
Pixel throughput (assuming 1 cycle texture lookup) 172.97 MP/s
=========================================================================

Shader performance using all FP32
Cycles: 55.50 :: R Regs Used: 6 :: R Regs Max Index (0 based): 5
Pixel throughput (assuming 1 cycle texture lookup) 116.36 MP/s

Newer drivers are doing a much better work...

G70:

Code:

-------------------- G70 --------------------
Target: GeForce 7800 GTX (G70) :: Unified Compiler: v77.72
Cycles: 36.00 :: R Regs Used: 4 :: R Regs Max Index (0 based):
Pixel throughput (assuming 1 cycle texture lookup) 286.67 MP/s
===============================================================

Shader performance using all FP16
Cycles: 33.00 :: R Regs Used: 4 :: R Regs Max Index (0 based):
Pixel throughput (assuming 1 cycle texture lookup) 312.73 MP/s
===============================================================

Shader performance using all FP32
Cycles: 50.50 :: R Regs Used: 6 :: R Regs Max Index (0 based):
Pixel throughput (assuming 1 cycle texture lookup) 206.40 MP/s

On this shader G70 is 'only' 10% faster (at full precision) than NV40 clock per clock (per pipeline) and using half precision does make a big difference anyway.
In the general case (I tested a LOT of shaders from nvidia demos and some game..) there's not that much difference going from partial to full precision since many shaders don't use more than 4 live registers.
You can do your own statistics: install nvshaderperf on your computer and have fun

!

ciao,
Marco

Powderkeg · Oct 20, 2005

scooby_dooby said:
Xenos has more shader power than 1800XT, ATI has said so. RSX is a modified G70, by the numbers we've seen it looks to be overclocked to 550mhz. Therefore, by comparing 1800XT to an overclocked G70 we can get some idea of the capabilities of Xenos compared to RSX.

And by your own admission your 1800XT numbers will be too low. Do you know precisely how low, or are you just going to take a guess?

Not only that, but this should give us insight into how well the 48 USA's compare to conventional pipes as far as shading power no?

It would if the R520 core had unified shaders, but it doesn't.

scooby_dooby · Oct 20, 2005

At least you now would be able to say that the 48unified shader are more powerful(we don't know how much) than the conventional 16pixel/8vertex shaders found in the X1800XT which also runs at over 600Mhz I believe. That's according to ATI.

So it speaks to how powerful the USA is, we know it's more powerful, not less, which is more than we knew before. We also know they're not 'alot' more powerful, they're only 'slightly'.

Jawed · Oct 20, 2005

nAo said:
You don't need a nvidia card to do this, I'm sitting in a internet cafe' now
[...]
You can do your own statistics: install nvshaderperf on your computer and have fun !

Cheers. I shall have a play tomorrow :!:

I'd downloaded RenderMonkey earlier to see if I could get any idea what ATI cards might do with this code.

But my Radeon 32MB SDR is just too dark-ages, can't create a preview (even when I set it to REF rasteriser - do I need to install something else?). And, anyway, RenderMonkey doesn't look like it'll provide any kind of performance data...

The newer drivers (do you have to have them installed?) really cut through, 17% for mixed, 14% for FP32.

G70's 7-9% improvement over NV40 implies a few dual-issues (FP32 case has same register count on NV40 and G70) where before only single issues were possible - presumably along the lines of MAD+MAD or MAD + ADD or ADD + ADD dual-issues. I will have a rummage through the code produced tomorrow...

Jawed

j^aws · Oct 20, 2005

Jawed said:
But why is a RCP irrelevant? Just because it's scalar doesn't mean that it's not part of the fragment pipeline's capabilities.

The vec3+scalar instructions (that's two instructions executed by one ALU) count, so what's the problem with counting RCP?

Jawed

I've already stated it's not irrelevant nor a problem with counting it. It would still be counted in the 4 FP32 slots if it's being issued. What's 'irrelevant' is the 'type' of FP32 instruction, whether RPC or not...

Xenus · Oct 20, 2005

I believe what your trying to say is that there are only 4 FP32 instruction slots. So it doesn't matter if it can do 5. It is limited by input not by the amount it can run simultanously.

By the way I have to idea what I'm talking about I'm just trying to translate Jaws.

Jawed · Oct 20, 2005

There are five FP32 instructions possible in one clock, including RCP - two dual-issued, co-issued combinations on the ALUs plus an independent RCP. That's the end of it. Stop saying that there are only four instructions possible.

You're flying in the face of multiple documents.

Jawed

Jawed · Oct 21, 2005

Xenus said:
I believe what your trying to say is that there are only 4 FP32 instruction slots. So it doesn't matter if it can do 5. It is limited by input not by the amount it can run simultanously.

I think he's confused by the vec2+vec2 or vec3+scalar or scalar+scalar capability of each ALU and thinks that everything has to be processed by the ALUs to the exclusion of all else.

Jawed

j^aws · Oct 21, 2005

Jawed said:
I think he's confused by the vec2+vec2 or vec3+scalar or scalar+scalar capability of each ALU and thinks that everything has to be processed by the ALUs to the exclusion of all else.

Jawed

No, the document is misleading. From the link on p17,

" In addition, a multifunction unit that performs complex operations can replace the alpha channel MADoperation."

This is the SFU with RCP. So the following,

vec3-MAD + RCP, vec2-MUL+vec2-MUL

still satisfies,

"An independent reciprocal operation can be performed in parallel with the multiply, MAD, and fp16 normalization described previously. "

...multiply, MAD and fp16+RCP still satisfied by 4 FP32 slots.

However, I can see how you can interpret that differently as 5 FP32...

Jawed · Oct 21, 2005

Yes, I think you're right, the first ALU can change from, e.g. vec3+scalar MAD (2 instructions) into vec3 MAD + RCP. Or vec2 MAD + RCP.

Well, I'm sorry about chasing the RCP to death.

Anyway, back at square one, this discussion illustrates the gap between peak capability and what actually results in more complex shaders.

Jawed

j^aws · Oct 21, 2005

Jawed said:
Yes, I think you're right, the first ALU can change from, e.g. vec3+scalar MAD (2 instructions) into vec3 MAD + RCP. Or vec2 MAD + RCP.

Well, I'm sorry about chasing the RCP to death.

No probs!

Jawed said:
Anyway, back at square one, this discussion illustrates the gap between peak capability and what actually results in more complex shaders.

Jawed

Well, like how CPUs evolved from RISC and CISC to hybrids, I'm interested to see how GPUs will evolve in this area. NV seem more CISC like whilst ATI more RISC like. But that's a discussion for another day so...

Goodnight!

Nfactor · Oct 21, 2005

pakpassion said:
in simple terms, sony says With Cell and RSX, the PS3 can do theoritical peak of 51 Billion Shader ops per second. and thats with a G70 architecture efficiency of maximum 60-70%, that would account for real-life performance of 30-35 billion shader ops per second. ATI has said it achieves with the GPU alone 48 billion shader ops her second at 95% efficiency, thats 45 billion shaders ops per second in real-life performance.

At E3 Nvidia said it's presentation on stage that Cell and RSX together can have a peak of 100 Billion Shader Ops per second, not 51. It's right there in the presentation on stage.

pakpassion · Oct 21, 2005

Nfactor said:
At E3 Nvidia said it's presentation on stage that Cell and RSX together can have a peak of 100 Billion Shader Ops per second, not 51. It's right there in the presentation on stage.

im sorry i just checked 51 Billion dot products per second not shaders(thats the other calculation). the 51 Billion Dot products per second WITH CPU. the 48 Billion Dot Products per second is with GPU only for Xbox 360 with 95% efficiency.

Lycan · Oct 21, 2005

pakpassion said:
im sorry i just checked 51 Billion dot products per second not shaders(thats the other calculation). the 51 Billion Dot products per second WITH CPU. the 48 Billion Dot Products per second is with GPU only for Xbox 360 with 95% efficiency.

The 95% claim is yet to be proven. Until confirmed, I will naturally dismiss it, and so should those who think that only real-world performance can confirm or reject provided numbers. You should take it with a grain of salt...And comparing a GPU that is on the market with one we know so little about is undoubtedly strange. I am sure people more adept at analyzing technological issues than both of us will get down to the task with great passion.. . :smile:

Jawed · Oct 21, 2005

Jaws said:
Well, like how CPUs evolved from RISC and CISC to hybrids, I'm interested to see how GPUs will evolve in this area. NV seem more CISC like whilst ATI more RISC like. But that's a discussion for another day so...

NVidia likes to describe NV30 as VLIW-like, with NV40/G70 derived (simplified) from that.

I think ATI's fragment shader pipeline will suffer similar constraints in its abilities to co-issue instructions (keeping all channels of the ALUs operating).

Additionally, the "dual-issue ADD" capability in the ATI pipelines is at risk of instruction dependency, though not at the risk of register bandwidth constraints.

Also, I think historically ATI didn't allow arbitrary swizzling of channels - though I think that's changed in R5xx. So that will have created a bottleneck.

Additionally, the dedicated ALU for texture address calculation must spend a fair amount of its time idle, since as shaders get more complex, there's less texture instruction intensity.

But I think it's fair to say that ATI doesn't spend as many transistors building "special functions" as NVidia does. So to a degree you could sum it up as "lots of simple combinations of instructions run at extremely fast clocks" versus "more complex and intense combinations of instructions run slightly slower".

R5xx and Xenos both make a play out of avoiding pipeline stalls (ALU and texturing) to gain efficiency, rather than relying upon the compiler juggling combinations of instructions, ALU channels and register re-use to extract the maximum efficiency. Though, obviously, ATI still needs a compiler that can recognise scalars, for example, and co-issue them against vectors, whenever possible.

Jawed

Shifty Geezer · Oct 21, 2005

Jawed said:
Yes, I think you're right, the first ALU can change from, e.g. vec3+scalar MAD (2 instructions) into vec3 MAD + RCP. Or vec2 MAD + RCP.

Well, I'm sorry about chasing the RCP to death.

Jawed

My God?!

Are Jawed and Jaws coming to a consensus on a GPU issue for once?

Can it really be that for all their tit-for-tat document reference for once it's actually come to an agreed conclusions?

Well kudos to both of you for arguing a disagreement and especially Jawed who's smart enough to admit he was wrong on a point. This is what debate should be about. This discussion should be award a B3D Medal of Merit.

Titanio · Oct 21, 2005

pakpassion said:
the 48 Billion Dot Products per second is with GPU only for Xbox 360 with 95% efficiency.

That's 48bn shader ops per sec, not dot products.

I've seen a figure of ~33bn dot products/sec for X360's CPU and GPU combined (and since Xenon can do 9bn/sec, that leaves 24bn for Xenos).

edit - I'd also agree with Lycan's sentiment on the seemingly constant parroting of ATi's claim on "efficiency". It's a claim by the chipmaker - which may be accurate in the general case as suggested, but then again maybe not, the point is we don't know - and furthermore refers to one factor in "efficiency". Often WRT Xenos, people seem to refer blanket-like to the concept of efficiency when really they're talking about utilisation, in this case, utilisation of the unified shaders.

Running the Toy Shop demo on the 360

nAo

Nutella Nutellae

Powderkeg

scooby_dooby

Jawed

j^aws

Xenus

Jawed

Jawed

j^aws

Jawed

j^aws

Nfactor

pakpassion

Lycan

Jawed

Shifty Geezer

uber-Troll!

Titanio

Similar threads