PCGH - Pixelshader-Shootout: RV770 vs. GT200

Got my 285 so could run some tests myself. Not sure if something is screwy with Nvidia's drivers but my SFU numbers are coming in twice as high as expected:
I'm afraid to say I think this test holds no value at all for transcendentals (actually, for anything to the right of MAD on that graph :LOL: ). Generally NVidia's driver appears to be optimising the code, so the GPU's doing less work. Depending on flavour of the month optimisations, results vary wildly.

Who was it who said NVidia's compiler should be easy because of the "scalar" architecture?

gpubench instrissue -l 64 -a -m -c 4
I think "l" and "m" are contradictory, but I doubt it matters. As a matter of interest you might like to experiment with 63 and 65 and other non-power of 2 and non-even values to see if any of the instructions change significantly in throughput.

What does a -c 1 graph look like?

The key thing is that MUL is running faster than MAD. As to why it's not much faster, well, I suspect operand/resultant bandwidth.

I wonder if it's possible to rewrite the shader (using more registers) to investigate whether this is purely register file banking getting in the way or whether it's just basic operand bandwidth in-to/out-of the ALUs.

I wonder how GTX285 would look on the AMD tests:

http://techreport.com/articles.x/12458/3

(bottom of that page).

Jawed
 
I'm afraid to say I think this test holds no value at all for transcendentals (actually, for anything to the right of MAD on that graph :LOL: ). Generally NVidia's driver appears to be optimising the code, so the GPU's doing less work. Depending on flavour of the month optimisations, results vary wildly.

I don't know, this same driver was showing expected results on my 8800GTS so it's weird that GTX285 is behaving differently on the same driver set.

I think "l" and "m" are contradictory, but I doubt it matters.
Yeah I noticed but those are the parameters used here so I just mimicked them.

As a matter of interest you might like to experiment with 63 and 65 and other non-power of 2 and non-even values to see if any of the instructions change significantly in throughput.
No difference....

What does a -c 1 graph look like?
Same as -c 4 except ADD/MAD/MUL/SUB are about 4x speed as expected

I wonder how GTX285 would look on the AMD tests:

http://techreport.com/articles.x/12458/3
Are those available somewhere?
 
I don't know, this same driver was showing expected results on my 8800GTS so it's weird that GTX285 is behaving differently on the same driver set.
Prior to GT200 the MUL is essentially unusable (is there any driver that shows MUL faster than MAD on any NVidia G8x/9x GPU in GPUBench?), so the compiler is behaving differently depending on which chip it's targetting.

Are those available somewhere?
I thought they were documented/available somewhere but I've failed to find them so far. I'm not sure if they're much use with respect to the transcendental functions, anyway...

Jawed
 
Sorry, but I've found only zero verified data wrt this issue. So I had to rerun the tests myself.

This is in Vista 64 Bit, SP1, X48-Chipset, C2D-CPU with 3,8 Ghz with recent and official Catalyst 8.12:
Code:
	                        HD 4870 (Cat. 8.12 WHQL)        HD 2900 XT (Cat. 8.12 WHQL)	
	                        Vertex	        Pixel           Vertex	        Pixel
				
- float MAD serial        	21.182.372	117.642.201	10.623.348	47.200.346
- float4 MAD parallel          21.240.413	144.876.620	10.624.156	58.129.619
			 	 
- float SQRT serial        	21.244.773	117.650.268	10.625.376	47.252.525
- float 5-instruction issue	53.114.714	570.917.384	26.562.044	225.130.502
			 	 
- int MAD serial	        21.242.146	58.579.440	10.624.713	23.669.843
- int4 MAD parallel        	21.210.878	29.480.274	10.621.545	11.475.836
 
Last edited by a moderator:
Sorry, but I've found only zero verified data wrt this issue.
Are you saying that TechReport didn't run the test on HD2900XT?

So I had to rerun the tests myself.

This is in Vista 64 Bit, SP1, X48-Chipset, C2D-CPU with 3,8 Ghz with recent and official Catalyst 8.12:
Code:
                            HD 4870 (Cat. 8.12 WHQL)        HD 2900 XT (Cat. 8.12 WHQL)    
                            Vertex            Pixel           Vertex            Pixel
 
- float MAD serial            21.182.372    117.642.201    10.623.348    47.200.346
- float4 MAD parallel          21.240.413    144.876.620    10.624.156    58.129.619
 
- float SQRT serial            21.244.773    117.650.268    10.625.376    47.252.525
- float 5-instruction issue    53.114.714    570.917.384    26.562.044    225.130.502
 
- int MAD serial            21.242.146    58.579.440    10.624.713    23.669.843
- int4 MAD parallel            21.210.878    29.480.274    10.621.545    11.475.836
Clearly the current driver has not been tweaked to make the vertex shaders look good :LOL: And the HD4870 vertex shader numbers are even less tweaked :LOL::LOL:

The pixel shader numbers for HD2900XT match the TechReport numbers.

The HD4870's pixel shader numbers are as expected.

So, any great revelations when comparing G92 and GT200 with these tests?

Are these shader tests downloadable?

Jawed
 
Hehehe - no. Obiviously, I meant zero additional data especially with newer drivers. I was under the impression, I did run these test quite recently, but it turned out, that this was not the case.


I don't know if the tests are publicly available after they were distributed for R600-Launch in may 2007. AFAIK they're based on humus' framework and do not contain any mean or overly fancy stuff (shader wise) as they're meant to show maximum throughput.

What I could do though, is to attach a dump of the shaders used - but the forums do not let me. Check your Mail, Jawed. ;)


Code:
		GF 8800 U. (181.22 WHQL)	
		                      Vertex	Pixel
			
- float MAD serial		7.966.160	173.258.010
- float4 MAD parallel		7.969.755	43.733.967
			
- float SQRT serial		7.969.775	45.048.540
- float 5-instr. issue  	19.925.096	210.127.473
			
- int MAD serial		7.969.392	36.321.509
- int4 MAD parallel		6.440.103	9.561.770

Code:
			GF 9600 GT (181.22 WHQL)	
	                      Vertex	Pixel
		
- float MAD serial	10.620.117	101.237.745
- float4 MAD parallel	10.522.113	25.721.769
		
- float SQRT serial	10.620.914	25.256.123
- float 5-inst. issue	26.552.369	116.284.367
		
- int MAD serial	10.618.908	20.424.944
- int4 MAD parallel	4.016.976	4.947.652


Code:
	GTX 285 (181.22 WHQL)	
	                        Vertex	Pixel
		
- float MAD serial	10.620.539	317.590.627
- float4 MAD parallel	10.613.723	81.331.192
		
- float SQRT serial	10.621.059	81.464.549
- float 5-instr. issue	26.547.464	365.579.803
		
- int MAD serial	10.618.403	65.521.415
- int4 MAD parallel	5.118.731	16.143.853

Interesting, umh? Obviously, vs-speed scales with PCIe-bandwith. Are some coordinates stored in system memory? Same thing can be observed when comparing the HD 2900 XT and the HD 3870 which are very similar pixel-wise (with 8.12 drivers). HD 4670 performs almost identical to HD 3870 with adjusted clocks, the number of SIMDs seems to have no influence, as is the case with RAM-bandwidth.
 
How are the vertex shaders looking? they're not becoming either fetch or setup limited?
The 4870 throughput is almost the exact double of 9600/285.
 
That was what I meant yes. Didn't notice they were dx10 specific though, which probably means they thought about that, unlike older tests. (And not that any of that would have to do with the double 4870 performance, just the general slow performance)
 
From the theoreticals I can work out, it seems the GPUs with less SIMDs are showing slightly better utilisation, i.e. 9600GT is about 98% of theoretical in comparison with 8800U at 93% and GTX280 at 92%; while HD2900XT is 98% and HD4870 is 97%.

I'm basing this on the float instructions only. Also, I'm presuming that one of the instructions in "float 5-instr issue" is a transcendental that takes 2 clocks, i.e. is 1/8 throughput, on NVidia.

I was kinda hoping to see some effect of the MUL in GT200, but I can't see it.

Jawed
 
I was kinda hoping to see some effect of the MUL in GT200, but I can't see it.

An ever so slight indicator is the 365 GInstr./sec. in the float 5-Intstruction mix. That's at least a wee bit above the theoretical maximum of of pure MADD throughput.
 
Thanks for the Dump.txt you sent me.

The "float 5-instr issue" test uses a mix of 5 instructions:

MUL
MAD
MIN
MAX
SQRT

AMD's SM4 GPUs appear to have a single-cycle implementation of SQRT:

Code:
      2  x: MULADD_e    R2.x,  PV1.w,  PV1.w,  PV1.w      
         y: MIN_DX10    R0.y,  R2.y,  PV1.y      
         z: MUL_e       R0.z,  PV1.x,  PV1.x      
         w: MAX_DX10    R1.w,  R2.y,  PV1.z      
         t: SQRT_e      R1.x,  PS1

On NVidia the float SQRT serial test runs at 1 instruction every 4 clocks, the same rate as float4 MAD parallel. But float 5-instr issue runs at 4.5x this rate, not 5x. i.e. 1.125 instructions per clock, not 1.25 instructions per clock, per fragment.

Bit of a puzzler.

Jawed
 
Normally this should also be the case with GT200:
Code:
--------------------------------------------------------------------------
                             Instruction Issue
--------------------------------------------------------------------------
512      70.9729       ADD          4         64
512      70.0217       SUB          4         64
512      94.9020       MUL          4         64
512      68.8689       MAD          4         64
[...]
512      71.4560       RSQ          4         64
 
Which matches Trinibwoy's GPUBench graph, conceptually.

According to the CUDA 2.0 guide sqrtf() has 3 ULP error whereas rsqrtf() has 2 ULP, while 1/x has 1 ULP. I'm wondering if sqrt is emulated with rsq+rcp.

Unless the compiler's being clever and counting the parity of sequential sqrt()s. A pair of sqrt()s is the same as a pair of rsqrt()s. So a shader test with 64 serial sqrts()s is the same as 64 serial rsqrts() :p

Jawed
 
Back
Top