NVIDIA Fermi: Architecture discussion

Yes, seems to be some trend at least in Europe - HD57xx have decent availability, the rest not so (lucky if you find one in-stock).
Oh, I've also seen HD5750 with only 512MB ram appear, both sapphire and xfx seem to make one - at least the xfx was actually reasonably priced (99 Euro, the next cheapest 5750 at that store with 1GB was 125 Euro).
In any case, I thought the availability problems were supposed to get fixed in december, hopefully tsmc can ramp up production a bit more when nvidia (and amd for cedar/redwood in fact) needs some wafers too :).
 
a lot of people are going to be fooled into thinking this half height card with the tiny cooler is the latest greatest cutting edge model due to its name. won't they be surprised!
 
I'm puzzled why HD5870 is so poor at lower resolutions, to be honest. Something in VS/GS is killing ATI it seems. Or could it be Z rate for high resolution shadow maps?
I realise now with GTX275 showing the same kind of shape as ATI in GT2, its fillrate/Z rate shortfall is the key. It's not bandwidth, since HD4890 is showing no significant advantage over HD5770OC.

Jawed
 
http://www.xbitlabs.com/articles/video/display/radeon-hd5770-hd5750_13.html#sect1

Comparing HD5770 to HD4770, with 42% more FLOPS and texture rate, 13% more fillrate and 50% more bandwidth, in GT2 at 1280x1024, HD5770 is 50% faster. Curiously at 1680x1050, HD5770 is 70% faster - looks like one of those blips in the vein of HD4890 at 1920 in GT2.

In GT1 HD4770 is pixel limited, at ~19.5MP/s (until 2560 where it collapses, presumably due to only having 512MB of RAM). Here HD5770 is 33% faster at 1280 and 52% faster at 1920.

GT2 on HD4770 looks pixel limited at 1920 and 2560, with 16.8MP/s.

It seems to me that the current architecture has a decent balance of fillrate/Z rate and bandwidth. The extra bandwidth of HD4890 is almost irrelevant and a comparison of HD5770OC with HD5770 at 1280 shows no bandwidth gain either.

Shadow rendering still seems like a bottomless pit though. And with bandwidth having no big bump on the horizon (will GDDR5 even get to 7Gb per pin/s?) what can AMD and NVidia do except do something like Larrabee? NVidia will stretch it out for longer with a 512-bit bus (GF100 refresh?), I guess. With a 256-bit bus AMD's next Z rate doubling is going to be troublesome

Anyway, I expect GF100 will be notably faster than HD5870, as long as the ROPs are re-vamped. Simply adding more GT200 style ROPs is problematic as ~50% more fillrate (i.e. 48 ROPs instead of 32) requires, I estimate, ~100% more bandwidth to show 50% more performance than HD5870.

Jawed
 
There's more interesting data in this article:
http://www.xbitlabs.com/articles/video/display/asus-matrix-gtx285_12.html#sect1
• GTX285 648/1476/1242
• MatrixOC 740/1625/1461

Expanding on your picture, I've compiled the following:
Code:
			FLOPS	Tex	Pix	Zix	GB/s
	
HD 5870 vs. GTX 285	256%	131%	131%	66%	97%
HD 5770 vs. GTX 285	128%	66%	66%	33%	48%
HD 4890 vs. GTX 285	128%	66%	66%	33%	79%
	
MatrixOC vs. GTX 285	110%	114%	114%	114%	118%


			1280	1680	1920	2560	
GT1
HD 5870 vs. GTX 285	125%	125%	129%	129%
HD 5770 vs. GTX 285	65%	63%	66%	66%	
HD 4890 vs. GTX 285	75%	72%	76%	76%

MatrixOC vs. GTX 285	116%	113%	115%	112%

			1280	1680	1920	2560	

GT2
HD 5870 vs. GTX 285	120%	118%	128%	137%
HD 5770 vs. GTX 285	64%	65%	68%	75%
HD 4890 vs. GTX 285	68%	74%	72%	77%

MatrixOC vs. GTX 285	115%	117%	117%	115%

In GT2 there seem to be no bandwidth problems for the HD-cards; in GT1 however the HD 4890 can do something with their extra bytes per whatever.

Plus, what we must not forget is the different approach and the different capabilities with regard to texture attribute interpolation in HD5k vs. HD4k: Where 4K cannot utilize their TMUs fully, HD5k consumes some of the excess shader power for doing attribute interpolation.
 
There's more interesting data in this article:
http://www.xbitlabs.com/articles/video/display/asus-matrix-gtx285_12.html#sect1
• GTX285 648/1476/1242
• MatrixOC 740/1625/1461
There isn't much to add based on those numbers, I dare say. A slight preference for the extra bandwidth, at best, it seems.

In GT2 there seem to be no bandwidth problems for the HD-cards; in GT1 however the HD 4890 can do something with their extra bytes per whatever.
63% extra bandwidth and 15% performance gain is quite a gap. Though it's perhaps reasonable to argue that since these tests are bound by "fixed-size" rendering, e.g. writing shadow buffers, that HD4890's extra bandwidth would see more use in game tests. So things like:

http://www.xbitlabs.com/articles/video/display/radeon-hd5770-hd5750_7.html#sect2
http://www.xbitlabs.com/articles/video/display/radeon-hd5770-hd5750_9.html#sect2

show a huge advantage for HD4890's minimum frame rates. Gotta wonder about drivers :cry:

Plus, what we must not forget is the different approach and the different capabilities with regard to texture attribute interpolation in HD5k vs. HD4k: Where 4K cannot utilize their TMUs fully, HD5k consumes some of the excess shader power for doing attribute interpolation.
Yes that's true. HD4770 doesn't have the interpolation-rate shortfall, so in theory it's better balanced. But its absolute performance is hobbled by bandwidth in GT2, less so in GT1.

Jawed
 
Shadow rendering still seems like a bottomless pit though. And with bandwidth having no big bump on the horizon (will GDDR5 even get to 7Gb per pin/s?) what can AMD and NVidia do except do something like Larrabee? NVidia will stretch it out for longer with a 512-bit bus (GF100 refresh?), I guess. With a 256-bit bus AMD's next Z rate doubling is going to be troublesome

The tech is out there:

http://www.rambus.com/us/products/xdr2/

Past PC implementations aside, RAMBUS has shown with both PS2 and PS3 of providing very high class technology (Direct RDRAM for PS2 and FlexIO + XDR for PS3).
 
The tech is out there:

http://www.rambus.com/us/products/xdr2/

Past PC implementations aside, RAMBUS has shown with both PS2 and PS3 of providing very high class technology (Direct RDRAM for PS2 and FlexIO + XDR for PS3).

The rambus solution is just a stop gap towards what jawed is hinting. The problem is that we are getting close to the upper limits of copper and a change in gpu architecture and the *horror, the horror* the rendering pipeline is needed to switch over to more bandwidth efficient methods. LRB and IMG are already there, and now it is the turn of amd and nv to go that way.
 
The rambus solution is just a stop gap towards what jawed is hinting. The problem is that we are getting close to the upper limits of copper and a change in gpu architecture and the *horror, the horror* the rendering pipeline is needed to switch over to more bandwidth efficient methods. LRB and IMG are already there, and now it is the turn of amd and nv to go that way.

True, but RAMBUS allows for AMD and nVIDIA to ease out the transition to such a new radically different, for them, system (which would be painful and needs lots of resources).

Between the sub 6 Gbps GDDR5 provides to the 12.8 Gbps XDR2 can provide there is more than 2x the speed difference... a 512 bit XDR2 solution that pushes for the peak 12+ Gbps target would be giving them quite a bit of years before forcing them to deploy the new architecture (which IMHO they are already working on...).
 
True, but RAMBUS allows for AMD and nVIDIA to ease out the transition to such a new radically different, for them, system (which would be painful and needs lots of resources).

Between the sub 6 Gbps GDDR5 provides to the 12.8 Gbps XDR2 can provide there is more than 2x the speed difference... a 512 bit XDR2 solution that pushes for the peak 12+ Gbps target would be giving them quite a bit of years before forcing them to deploy the new architecture (which IMHO they are already working on...).

xdr2 could help a bit, but not that much. The 12.8Gbps parts are only on the roadmap, hence would better compare to 7Gbps gddr5 (or whatever the top speed is supposed to be). Current xdr2 should "only" be good for 9.6Gbps which is still almost twice as fast as 5Gbps however. rambus also claims a power advantage, though I think future gddr5 will use only 1.35V hence the advantage might not be that big.

I just have to bring this up, what about using edram? Sure putting full depth buffers in there isn't really feasible, but z buffers are compressed nowadays so if you'd put for instance only the parts in there which are fully compressed (8x ratio?) you'd "only" need 16MB for 2560x1600 with 8xAA, which doesn't look unreasonable. You'd still need high-bandwidth memory (to fetch other parts of z buffer, color buffers, textures etc.) but surely this should help.
 
Back
Top