NVIDIA Fermi: Architecture discussion

A recurring observation across the websites doing in depth analyses of Cypress was the surprising density of transistors per square mm achieved.
I wouldn't say it was surprising.
And in this regard Fermi's deficit in density is much less than the deficit the previous generations experienced.

Seemingly a new benchmark for graphics chips. This was bourne out in the watts/performance benchmarks which also set new highs. I posit that qualifies for the appellation 'extremely efficient'. :oops:
Some of the efficiency arguments stem from the fact that Cypress has doubled its resources without a commensurate gain, though such ideal scaling would be impossible in any situation.

A naive adjustment of RV770's die size to the new process node to attempt at correcting for the change in density does not show a stupendous gain in perf/area.
 
I actually meant that increasing total core count , would increase their minimum core utilization which helps drive performance up , for example HD4870 has a minimum of 160 core utilization , now HD5870 has a minimum of 320 .

Umm..., that is not really the best way to look at things. There is no guarantee that more cores mean lowest utilization increases overall. By tweaking (in a weird manner if you must) rest of the chip's architecture it is very much possible to have higher utilization on a 4870 than on a 5870. Whether such tweaks are desirable is another matter overall.

A VLIW architecture, if running suitable workloads and paired with a decent compiler (as is the case with amd gpu's) can be certainly a win over purely serial architectures.

By that do you mean : Areal efficiency , or design efficiency , or instruction per clock efficiency ?

Area efficiency. Since all of these things are in order cores, so instruction per clock is not really a useful metric. It is useful for CPU's no doubt, which have massive amounts of area given to oooe and scaffolding to support it.
 
Talking about density ... does anyone have any good reason why the crossbar+L2 is so fucking huge on Fermi, it's a bloody third of the die right?
 
Since all of these things are in order cores, so instruction per clock is not really a useful metric. It is useful for CPU's no doubt, which have massive amounts of area given to oooe and scaffolding to support it.
One of the reasons for the atrocious FLOPS/mm² in NVidia's GPUs is the out-of-order instruction issue.

Jawed
 
Talking about density ... does anyone have any good reason why the crossbar+L2 is so fucking huge on Fermi, it's a bloody third of the die right?
Rasterisation, Setup, ROPs, MCs, Global Scheduling, Tessellation and other bits and pieces.

Jawed
 
Talking about density ... does anyone have any good reason why the crossbar+L2 is so fucking huge on Fermi, it's a bloody third of the die right?

I am guessing that all the ff hw+rops+mc are concentrated in that region. 768K on 40 nm should be positively tiny wrt a ~550mm2 chip. It is a wee more than a third of the non-pad area on fermi's die.

Rv770 also has a lot of non-alu/tex stuff out there in it's die. Off the top of my head, the rops+mc+L2 area is >70% of alu+tmu area.
 
Extremely efficient? Not in the least when compared to RV770. It seems like you have an opinion and are trying to convince yourself of its veracity :)
Efficiency is not that bad if we consider CPU limitation in the benchmarks.

Crysis Warhead, for example, was 27% faster on an HD5870 than on a GTX285 in 1920 with 4x AA in "gamer" quality with 8.66 RC6, but with "enthusiast" quality at the same resolution the lead increased to 38%, same trend for 2560 with a 33% lead in "gamer" quality.

Many games show this, and I strongly doubt NV could show way higher numbers. In fact, many games seem to be so limited that we see better performance on HD5770 CF than on one HD5870, by hiding latency via AFR while in fact games are less playable.

There are still some issues with Cypress scaling, for example highly tessellated scenes in Heaven not rendering much faster than on Juniper at stock clocks (only about 40% faster despite almost everything is doubled, pointing to triangle setup limitation) and not that much slower at half clocks (only ~40% slower but already ~10% slower without heavy tessellation, whereas half-clocked Juniper shows perfect scaling, nullifying the first assumption), but there are opposites too, STALKER CoP for example, shows nearly perfect scaling.

So, Cypress seems to be limited internally, perhaps by its shared memory which could have the same bandwidth as Juniper, but we have proofs this can be as significant as Fermi's high DP throughput and we'll have to wait to see if this has any effect on DX11 games, for which the only hardware currently available to experiment on is AMD's. By the way, does Fermi have a way higher throughput for this shared memory? As far as I know, its accumulated L1+shared throughput is at most 50% higher.
 
So we aren't allowed to use games to benchmark graphics cards any more? :cry:
When discussing "effeciency" they are less useful as you need to double every element of a system to truely be able to understand the difference between RV770 and Cypress; conversely lok at some of the stuff going on in the GPGPU forum andyou see that there are already goo examples that show the full engine scaling, and also cases that show greater than the "numbers of" due to specific architecture changes. Typially, with games, to get something that can show "2x" performances you need to scale the engine >2x if you are not going to get a 2x jump in all areas of the system.
 
There are still some issues with Cypress scaling, for example highly tessellated scenes in Heaven not rendering much faster than on Juniper at stock clocks (only about 40% faster despite almost everything is doubled, pointing to triangle setup limitation) and not that much slower at half clocks (only ~40% slower but already ~10% slower without heavy tessellation, whereas half-clocked Juniper shows perfect scaling, nullifying the first assumption), but there are opposites too, STALKER CoP for example, shows nearly perfect scaling.
Ooh, that's some nice data.

So, Cypress seems to be limited internally, perhaps by its shared memory which could have the same bandwidth as Juniper, but we have proofs this can be as significant as Fermi's high DP throughput and we'll have to wait to see if this has any effect on DX11 games, for which the only hardware currently available to experiment on is AMD's.
LDS is used for attribute data so that it can be interpolated by shader code. Distributing parameters to the SIMDs might be a bottleneck separate from the triangle-rate bottleneck.

Jawed
 
It's not bogus when what we want is to compare the performance advantage of one of the new cards, over one from the previous generation :)
If, as Jawed points out, its CPU limited then the benchmarks are going to have limited use as each of the architectures you are comparing are going to be limited by the CPU. Games are fairly CPU bound (even ones that people often associate as GPU killers, like Crysis, are very CPU sensitive).
 
That is because GTX 295 is memory limited , it has 896MB per GPU , whereas HD5870 has 1 GB .
Have you ever seen benchmarks for overclocking HD5870 to 1.0GHz ? the difference is barely 10% .

I recently saw benchmarks for the Powercolor watercooled version. My 20-30% guesstimate was maybe overblown but 10-20% is certainly within the realm of the possible. An optimised stepping, upgraded cooling, upgraded components - with the 5870 ~ $400 and the 5970 ~ $650, there is a slot for a $500 5890 which leaves space for some premium component upgrades. Again, AMD will have the time to tweak the design to a fare thee well and stockpile binned silicon for the purpose.

AMD has every incentive to sucker punch Fermi as hard as possible when it finally releases and Nvidia is not doing much to mitigate that as far as I can discern.
 
If, as Jawed points out, its CPU limited then the benchmarks are going to have limited use as each of the architectures you are comparing are going to be limited by the CPU. Games are fairly CPU bound (even ones that people often associate as GPU killers, like Crysis, are very CPU sensitive).

I don't think anyone here is dismissing that. However, comparisons of new and old generations of graphics cards, are done through the use of games (whatever they are, usually the popular ones). Most people in need of a new graphics card, do not care about the architecture used and what are their advantages over other architectures in specific tasks (like GPGPU or a selection of GPU bottlenecking apps). All they want are higher fps in their favorite games and one or more handy features. And if new graphics cards appear, comparisons are made through the use of games, between both the competition's graphics cards and the previous generation of the same company, to see how big the performance jump was.
 
Crysis Warhead, for example, was 27% faster on an HD5870 than on a GTX285 in 1920 with 4x AA in "gamer" quality with 8.66 RC6, but with "enthusiast" quality at the same resolution the lead increased to 38%, same trend for 2560 with a 33% lead in "gamer" quality.

That only tells me that HD5870 handles enthusiast better than the GTX 285 does, not that HD5890 is bottlenecked by the CPU. How does the comparison look to HD4890?

There are still some issues with Cypress scaling, for example highly tessellated scenes in Heaven not rendering much faster than on Juniper at stock clocks (only about 40% faster despite almost everything is doubled, pointing to triangle setup limitation) and not that much slower at half clocks (only ~40% slower but already ~10% slower without heavy tessellation, whereas half-clocked Juniper shows perfect scaling, nullifying the first assumption), but there are opposites too, STALKER CoP for example, shows nearly perfect scaling.

Well that would be more damning evidence no? It certainly can't be used to excuse the relatively weak scaling.

When discussing "effeciency" they are less useful as you need to double every element of a system to truely be able to understand the difference between RV770 and Cypress; conversely lok at some of the stuff going on in the GPGPU forum andyou see that there are already goo examples that show the full engine scaling, and also cases that show greater than the "numbers of" due to specific architecture changes. Typially, with games, to get something that can show "2x" performances you need to scale the engine >2x if you are not going to get a 2x jump in all areas of the system.

That's fine but that raises more questions than it answers. I'm gonna assume that you guys do lots of profiling of existing and future game workloads and use that analysis to determine where to focus with future hardware. So now you're saying that doubling texture units and doubling ALUs did not result in doubled performance because the bottleneck is elsewhere. So why didn't you guys address those bottlenecks instead of doubling up stuff unnecessarily? Honest question.
 
Back
Top