Games and bandwidth

Ah, I see what you're getting at now (it would have been much clearer if you had specifically referred to the resolution I was using)

[...]

I need to go back and rerun my previous tests with GT3 instead of GT1 but I'd probably still cock things up...
Might also be worth your while doing 3DMk Vantage tests, preferably Extreme mode.

While I'm talking about scaling, MSAA and bandwidth, UT3:

http://www.computerbase.de/artikel/...008/bericht_was_gddr5_radeon_hd_4870/drucken/

is the game that scales most when comparing HD4850 and HD4870. I'm not sure if it's a good test case because of the hack-ish nature of AA in UT3, but with AA off there's still a 38% gain for HD4870.

Jawed
 
I'll give doing the 06 tests again a miss (mostly because I'm rather tired of looking at those tests now!) - I should have realised what was going on at 640 x 480 though: I'd mistakenly assumed that, because dropping the core clock resulted in such a notable change in frame rate, the test was still pretty GPU dependent and that the CPU wouldn't play a significant role. Although I'd twigged that it was the polygon count that was still working the GPU enough, I'd hadn't followed that all the way through back to D3D and what that would result in for the CPU (e.g. number of calls).

Mindful of this mistake in 06, I checked Vantage out again in Extreme mode but at 640 x 480 (previous 1680 x 1050 results included again for comparison):
Code:
3DMark Vantage - Graphics Test 2: New Calico (Extreme settings)
640 x 480 - No AA / Optimal			640 x 480 - 8x AA / Optimal
576 / 1350 / 900 - 42.16 fps (39.11 @ 2GHz)	576 / 1350 / 900 - 33.95 fps
288 / 1350 / 900 - 26.35 fps 			288 / 1350 / 900 - 22.94 fps
288 /  675 / 900 - 22.05 fps			288 /  675 / 900 - 17.84 fps	
576 / 1350 / 450 - 35.19 fps			576 / 1350 / 450 - 26.92 fps

1680 x 1050 - No AA / Optimal			1680 x 1050 - 8x AA / Optimal
576 / 1350 / 900 - 11.67 fps			576 / 1350 / 900 - 9.30 fps
288 / 1350 / 900 - 7.03 fps			288 / 1350 / 900 - 6.37 fps
288 /  675 / 900 - 6.22 fps			288 /  675 / 900 - 5.09 fps	
576 / 1350 / 450 - 8.18 fps			576 / 1350 / 450 - 6.09 fps
A 33% drop in CPU speed only results in a 7.2% drop in frame rate, so it's not too CPU limited at all (thank God).
 
Setup speed probably matters a lot, though. You can generally categorize graphics workloads as per-frame and per-pixel. Resolution only increases the per-pixel load, and faster SPs only process parts of this faster (though sometimes extremely long vertex shaders can burden the SPs). The per-frame load is usually either the CPU calcs, setup speed, or shadow map rendering which is also mostly setup limited.

Since you mostly eliminated the CPU (if I calculated right, 16% of the frametime is determined by CPU clock), setup speed is probably what's responsible for framerate not decreasing as fast as pixel count.
 
I seem to remember that the GDDR5 is using a different signaling scheme where the actual signalling is negotiated (don't ask me how ;)), but could underclocking the memory actually result in lower performance than expected? Do we have any fairly accurate tests to show the actually available bandwith for a given clock rate?

Well, underclocking GDDR5 to GDDR3 speeds will only have half (or was it quarter?) the command rate than the old RAM. So it might as well be slower in many situations.
 
I noticed in the Computer Base results that the "performance rating" of HD4870 at 625/993 is 88% of HD4850 at the same speeds. This is regardless of the AF/AA settings.

The single-texture test in 3DMark06 is meant to be a "bandwidth test". The stock HD4870 is 83% faster than at 750/993. The theoretical difference is 81% (pure bandwidth difference as the core clocks are identical).

It seems to me highly likely that GDDR5 is, therefore, giving 88% of the performance of GDDR3 at the same clock.

So on that basis HD4870 is effectively using much more of its available bandwidth than we've been accounting for. So with UT3 4xMSAA is 56% faster at 1800MHZ GDDR5 in comparison with 993MHz, and 8xMSAA is 57%.

Put another way, 115.2GB/s GDDR5 appears to be equal to 101.4GB/s GDDR3, 60% more bandwidth than HD4850.

Jawed
 
As imsabbel pointed out, the effective BW could easily be worse at lower clock speeds, because the chip (or at least the bios/drivers) wasn't designed for a low command rate and double the latencies. The %utilization of GDDR5 could be better for the same load at higher frequency, leading to a superlinear effect.

As for the 3DMark06 ST fillrate test, you can see that with GDDR5 at 1800 MHz, the result scales approx. linearly with clock speed, thus it is not BW limited at all and not a good test of available BW at 1800 MHz. However, with GDDR5 at 993 MHz, the results are independent of clock speed.

I don't think the 83% figure is reliable. I'm willing to bet that if you overclocked the core to maybe 800 MHz, you'd get higher results at 1800 MHz but not at 993 MHz, thus getting a number even higher than 83% and proving my superlinear scaling theory.

Interestingly, this BW heavy test is also a good case to have 8 ROPs per 64-bit channel next gen, because I think you'll see an even higher mem/core clock ratio.
 
As for the 3DMark06 ST fillrate test, you can see that with GDDR5 at 1800 MHz, the result scales approx. linearly with clock speed, thus it is not BW limited at all and not a good test of available BW at 1800 MHz. However, with GDDR5 at 993 MHz, the results are independent of clock speed.

Something doesn't make sense here. It's either my mind, your conclusion or Computerbase's tests. Between GDDR5 at 993 and 1800 MHz (eclk@625) we get +56% fillrate - so BW plays an important role. Between GDDR5@1800 and GDDR3@993 we still get 32 percent more fillrate. Only with GDDR5@1800 and different eclks there's no difference, which means, that at this point the BW matches or exceeds engine needs up to at least 750 MHz.
 
The bandwidth "requirement" for the 06 ST fill rate test is pretty clear cut, if you test it out:
Code:
G80 @ 1280 x 1024
Core/RAM	ST	Theor	%Diff
576 / 1000	6905	6912	-0.1%
576 / 950	6905	6912	-0.1%
576 / 900	6905	6912	-0.1%
576 / 850	6903	6912	-0.1%
576 / 800	6898	6912	-0.2%
576 / 750	6883	6912	-0.4%
576 / 700	6725	6912	-2.7%
576 / 650	6342	6912	-8.2%
576 / 600	5877	6912	-15.0%
576 / 550	5354	6912	-22.5%
576 / 500	4939	6912	-28.5%
576 / 450	4421	6912	-36.0%
It would be nice to see what the drop off values are like with chips that have more blenders than the G80. I wish more review sites would concentrate on analysing one variable over one test, rather than doing several over numerous.
 
Taking 450MHz memory as the baseline, G80 is linear upto around 700+MHz. So 1000MHz is around 40% "excess" bandwidth for that test.

Jawed
 
FWIW, I estimated the 4870 could get by with 96 GB/sec with minimal negative impact (although there isn't much of a point in doing that in reality), and the 4850 should benefit well from and increase up to ~80 GB/sec.
 
The bandwidth "requirement" for the 06 ST fill rate test is pretty clear cut, if you test it out:
Code:
G80 @ 1280 x 1024
Core/RAM	ST	Theor	%Diff
576 / 1000	6905	6912	-0.1%
576 / 950	6905	6912	-0.1%
576 / 900	6905	6912	-0.1%
576 / 850	6903	6912	-0.1%
576 / 800	6898	6912	-0.2%
576 / 750	6883	6912	-0.4%
576 / 700	6725	6912	-2.7%
576 / 650	6342	6912	-8.2%
576 / 600	5877	6912	-15.0%
576 / 550	5354	6912	-22.5%
576 / 500	4939	6912	-28.5%
576 / 450	4421	6912	-36.0%
It would be nice to see what the drop off values are like with chips that have more blenders than the G80. I wish more review sites would concentrate on analysing one variable over one test, rather than doing several over numerous.

Thanks for your tests - may I inquire which OS and driver was used? I just might try the same thing with my GT200 tonigt.

Your test seems to suggest, that G80 suffices with about 10,4 bytes of BW per theoretical Fillrate-Pixel, whereas RV770@GDDR is content with 9,6 and not showing an improvement when increasing that to about 11,5 - according to Computerbase.
 
Thanks for your tests - may I inquire which OS and driver was used? I just might try the same thing with my GT200 tonigt.
Vista Ultimate x64 with WHQL 175.19 drivers. It would be interesting to see how a GPU with single cycle blending per ROP compares to the G80, given that the 3DMark06 ST is really an alpha blending test first, then a fill rate/bandwidth one. I'd also like to see the same done with Vantage's ST test for a GT200 and RV770, as that's a nice FP16 blending test.

Edit: Nicely confirms that FP16 needs double the bandwidth:
Code:
Core/RAM	Vantage	Theor	%Diff
576 / 1050	4.92	6.912	-28.8%
576 / 1000	4.69	6.912	-32.1%
576 / 950	4.45	6.912	-35.6%
576 / 900	4.24	6.912	-38.7%
576 / 850	3.94	6.912	-43.0%
576 / 800	3.73	6.912	-46.0%
 
Edit: Nicely confirms that FP16 needs double the bandwidth:
Code:
Core/RAM	Vantage	Theor	%Diff
576 / 1050	4.92	6.912	-28.8%
576 / 1000	4.69	6.912	-32.1%
576 / 950	4.45	6.912	-35.6%
576 / 900	4.24	6.912	-38.7%
576 / 850	3.94	6.912	-43.0%
576 / 800	3.73	6.912	-46.0%

Seems like GT200 is not single-cycle for FP16-Blends. :( I've got Vantage installed here at work, so I could run a little test.

Code:
Core/RAM	Vantage	Theor	%Diff
602 / 1300	7.76	19.264	-59.8%
602 / 1250	7.43	19.264	-61.4%
602 / 1200	7.13	19.264	-63%
602 / 1150	6.93	19.264	-64%
[b]602 / 1100	6.67	19.264	-65.38%[/b]
[I]Note: Clockrates are set values, RT reads some small differences to that[/I]

If we assume, this [no single-cycle FP16] is correct, then the adjusted percentages would be as follows:
Code:
Core/RAM	Vantage	Theor	%Diff
602 / 1300	7.76	9.632	-19.4%
602 / 1250	7.43	9.632	-22.9%
602 / 1200	7.13	9.632	-26.0%
602 / 1150	6.93	9.632	-28,1%
[b]602 / 1100	6.67	9.632	-30.8%[/b]
[I]Note: Clockrates are set values, RT reads some small differences to that[/I]

In an interesting sidenote: Set to 8800GTX' levels of ROP-fill and memory-BW, i.e. 432e/675m, I'm getting just 4.06 GPix/s. :(
 
Why the sad face? Even 2-cycle FP16 blends seem to be bandwidth limited so what's the use of single-cycle blends?
Yeah, when you're doing only that and on a fullscreen-quad. But I'd prefer the ROPs to clear off that stuff as fast as possible to be avaiable for other tasks.
 
Yeah, when you're doing only that and on a fullscreen-quad. But I'd prefer the ROPs to clear off that stuff as fast as possible to be avaiable for other tasks.
That's not particularly compelling logic. There isn't much of a FIFO/cache of pixels after the ROPs so you get write requests processed fairly quickly. Unless you have some really bizzare repeating load of 100 pixels FP16 followed by 200 pixels INT8, your ROPs are going to get BW limited very quickly.

Even if the ROPs are available for other tasks, they won't be able to complete them because the MC is saturated.
 
Edit: Nicely confirms that FP16 needs double the bandwidth:
Looks like NVidia is not getting very high BW utilization with its memory controller in this test because even though the test is clearly BW limited, the 4850 outdoes G92 (and the 3870, for that matter) with slower RAM.

I've seen different fillrate tests report significantly different efficiencies, though, so it's not a big deal IMO.
 
As imsabbel pointed out, the effective BW could easily be worse at lower clock speeds, because the chip (or at least the bios/drivers) wasn't designed for a low command rate and double the latencies. The %utilization of GDDR5 could be better for the same load at higher frequency, leading to a superlinear effect.
I agree, this is very likely. We definitely saw something like this with GDDR4 when it was underclocked - though that had the ring bus getting in the way, too...

As for the 3DMark06 ST fillrate test, you can see that with GDDR5 at 1800 MHz, the result scales approx. linearly with clock speed, thus it is not BW limited at all and not a good test of available BW at 1800 MHz. However, with GDDR5 at 993 MHz, the results are independent of clock speed.
Yep, we need more data.

Jawed
 
That's not particularly compelling logic. There isn't much of a FIFO/cache of pixels after the ROPs so you get write requests processed fairly quickly. Unless you have some really bizzare repeating load of 100 pixels FP16 followed by 200 pixels INT8, your ROPs are going to get BW limited very quickly.

Even if the ROPs are available for other tasks, they won't be able to complete them because the MC is saturated.
No, it's definitely not. It's merely a bit disappointment, that when Nvidia touted GT200's ROPs as full-speed-blending or twice the blenders of G80 per ROP, no one* asked whether or not that'd be true for non-INT8-Formats as well.


[*]And I shamefully include myself.
 
Last edited by a moderator:
Back
Top