NVIDIA Fermi: Architecture discussion

@spigz - I really think you're grossly overestimating the draw of Eyefinity. I'm not seeing this unbridled lust that you describe or a rush to purchase triple monitor setups. Like Dave mentioned a few posts above, developers have to get on board and support it natively before it can even be considered a mainstream solution (and even then people have to be willing to shell out for 3 monitors). What I can guarantee you is that stuff like Eyefinity, 3D vision etc will always take a backseat to good old performance leadership.

Think logarithmic.

It certainly isn't going to be a mainstream solution any time soon. But it can become a 'mainstream' computer gamer solution in the not too distant future.

Consider the situation a year from now when the 6800 series is being released. With console hardware static for at least the next two yeas and the continued migration of developers to programming for consoles first and computers a distant second, there will be a lot of unused power in the 6850/70 cards.

On the developer side, from what i've read, including eyefinity support is relatively painless and easy to implement ... ? If so, why not include it with their computer ports and a cheap 'eyefinity supported' checkoff on their product. For computer focused games, one might expect dx11 and eyefinity support to become the rule. Who's going to be releasing a major computer centric game a year from now without them?

When one has a GPU that already contains performance headroom for the forseeable future, what's the next logical upgrade that makes use of that unused performance potenial while simultaneuously providing a much better and expanded gaming experience? It starts small, but as the implementation improves, performance headroom expands and monitors become ever cheaper, a gamer specific 'perfect storm' scenario can occur and that eyefinity set-up adoption that starts slow starts to grow in a logarithmic manner ... until that niche market is saturated/matured ~ onto bigger and better monitors, but this will also 'bleed' over into the mainstream segment, and even the more casual gaming crowd with the free green will start to adopt it, and that also will be expanding, albiet at a much slower rate.

True, compared to the whole it will always be a niche market, for that matter that is true for the entire discrete GPU market, but that doesn't mean there isn't some serious money to be made or eyefinity won't become a compelling solution in the niche it operates in. Consider those developers currently restricting eyefinity support on their multiplayer games because it gives 'too much of an advantage'. As multi-monitor adoption continues to grow, those restrictions will become unviable and be lifted. When that occurs all those gamers still playing on single monitor set-ups will have a compelling reason to upgrade to a multi-monitor set-up. The same dynamic will play out across the entire computer multiplayer arena.

And eventually across the hard core console gamer segment and beyond as one might expect multi-monitor support will also be included in the next console hardware cycle considering what graphics capability will be available for implementation two to three years from now and that by then a sizeable established base of gamers with a multi-monitor set-up will be in place and eager for a console that supports it.
 
Last edited by a moderator:
A 32SP part seems unreasonable to me, as Fermi uses only about 15 sq. mm for each partition... I think it's way too low for a ~60 sq. mm part, and that's assuming the custom logic/L2 downscale well and, as they still need to fill 16 way SIMD, it's harder than with previous 8 way SIMD designs.
Why do you think that a 15 sq. mm cluster is "too low" for a 60 sq. mm part? GT218 is below 60 sq. mm (also on 40nm), and I very highly doubt its single (apparently 16-way only instead of 24 as the other GT2xx parts) cluster is approaching that size...
 
Yes I do believe it would be competitive with Cypress's salvage part (HD 5850), given the architectural differences and improvements over GT200, which could make this "half GF100" on par or a bit faster than the GTX 285, which competes with the HD 5850 right now, on most occasions.
Note that a part with 256 fermi cores might not really necessarily be faster than 240 tesla cores probably (for graphics, that is). Apart from the obvious (less texture units), there's another potential issue I haven't really seen much talking about, but I believe there's only half the SFU capability per cluster (the material so far I've seen is a bit lacking in that regard, in the techreport article the image doesn't match the description). So I think special functions only execute at 1/8 the rate of normal instructions, instead of 1/4 (this would also affect interpolation rate, but not the famous "missing or not" mul since that one is gone for good anyway). Not sure though if that's really true, and even if it is I have no idea if it would make much of a difference...
 
I have a question for Rys , in an interview with Nvidia product manager he asked whether Fermi will have a dedicated hardware for tessellation , and they confirmed that , that dated back to Oct 07 .

http://forums.nvidia.com/index.php?showtopic=149550

But when Rys wrote his piece at techreport , he mentioned that he expects fermi to feature software tessellator , I sense he didn't trust Nvidia's statement , does he still beileve in that or what ?

in other words , will fermi have a hardware tessellator or not ?

Read what Rys wrote, and then interpret the statement that the nVidia PR...err PM stated. They're not mutually exclusive. Having hardware for something can mean many many things (look ma, I just added a dedicated tristate buffer here/a few extra dedicated bitlines, win!!), whilst Rys gave an in-extenso explanation of where he was coming from, and what he meant by that bit you mention.
 
Note that a part with 256 fermi cores might not really necessarily be faster than 240 tesla cores probably (for graphics, that is). Apart from the obvious (less texture units), there's another potential issue I haven't really seen much talking about, but I believe there's only half the SFU capability per cluster (the material so far I've seen is a bit lacking in that regard, in the techreport article the image doesn't match the description). So I think special functions only execute at 1/8 the rate of normal instructions, instead of 1/4 (this would also affect interpolation rate, but not the famous "missing or not" mul since that one is gone for good anyway). Not sure though if that's really true, and even if it is I have no idea if it would make much of a difference...

Fermi has 4 SFUs per SM or Cluster. GT200 has 2 per SM or 6 per cluster.
SF operations stall a GT200 SM but with Fermi they only stall one dispatcher. But i don't know if the dispatcher is waiting 2 or 8 cycles like in GT200 before it can delivers a new half warp to a function unit.
 
Last edited by a moderator:
Fermi has 4 SFUs per SM or Cluster. GT200 has 2 per SM or 6 per cluster.
Right, which is what gives it half the instruction rate of gt200/g92 (as sm in gt200/g92 has 8 cores, but sm in gf100 has 32). I think I was just a bit confused because the techreport article stated there's a sfu per 16-way sub-block, but it's probably just wrong. The SFU might also be changed a bit in other ways, but it seems like a safe bet then peak special function throughput is only half with fermi compared to their older chips (half "per shader core", of course).
 
I think I was just a bit confused because the techreport article stated there's a sfu per 16-way sub-block, but it's probably just wrong.
Looking at it, I described it wrong in the text, but got it right in the diagram. Well spotted, will get that cleaned up.
 
SF operations stall a GT200 SM but with Fermi they only stall one dispatcher. But i don't know if the dispatcher is waiting 2 or 8 cycles like in GT200 before it can delivers a new half warp to a function unit.

I don't think that's accurate. At least one patent highlights the ability of the dispatcher (running at base clock) to issue instructions to the SFU and ALU pipelines in alternate cycles, even in G80 class hardware. Each ALU instruction runs for 4 hot-clocks or 2 base-clocks which provides a window to do so. The SFU and ALU pipelines presumably have dedicated operand collectors to support this as well.

Consider the situation a year from now when the 6800 series is being released. With console hardware static for at least the next two yeas and the continued migration of developers to programming for consoles first and computers a distant second, there will be a lot of unused power in the 6850/70 cards.

Oh dear, well I certainly hope developers target something less than a 5760x1200 resolution for their games because all we'll end up with is a lot of ugly. Do you really look forward to running the best that the Xbox and PS3 can do at the end of their lifetimes blown up in all its unrefined glory? There's something to be said for playing less demanding games at a higher resolution but there are much better uses of the available horsepower IMO. In this respect I think Nvidia's strategy is more potent in the longer term because at the end of it all Eyefinity is just upping the resolution.
 
Arguably were in the middle of the cycle, where cross-platform studios are now doing pre-emptive next generation research in their PC renderers and engines, because it's highly likely that at least the graphics hardware will resemble something DX11-ish at that point. Hopefully most of the cross-platform developers here would agree.
 
Yeah but will you take the odds that we'll see the new tech on PCs before the new consoles arrive? Our best bet seems to be stuff like CryEngine which targets PC hardware but scales down. Console ports rarely scale up.
 
I think the 512SP Fermi part may come much after the initial launch, once yields improve and respins are done - maybe a GTX 385 (or if they keep going with their ludicrous part renaming schemes, a GTX 480).
 
What sense would a quick A3 make, if A2 was not production ready in general? So we can assume that A2 was close to production ready.

It either is production ready or it isn't. If it wouldn't had been, a minor respin like A3 wouldn't had been enough and a Bx type respin would had been necessary. A small and quick respin like A3 serves probably for nothing else but minor refinements and in this particular case to kill time.
 
It either is production ready or it isn't. If it wouldn't had been, a minor respin like A3 wouldn't had been enough and a Bx type respin would had been necessary. A small and quick respin like A3 serves probably for nothing else but minor refinements and in this particular case to kill time.
I also thought about this reason, but you said the smaller chips are going to tape out when the high end part is finished. So nvidia loses time on the smaller chips with this A3-spin.
 
I believe the heterodyne flux transaction mogrifier needed to be fixed.

Care to elaborate a bit for us clueless readers here in order to understand what the result of that fix would be? (honest question)

I also thought about this reason, but you said the smaller chips are going to tape out when the high end part is finished. So nvidia loses time on the smaller chips with this A3-spin.

Let's see what silent_guy's answer to the above question will be. Unless of course it was some form of sarcasm, which I of course with my equal to 0 knowledge on matters like that cannot detect.
 
Care to elaborate a bit for us clueless readers here in order to understand what the result of that fix would be? (honest question)



Let's see what silent_guy's answer to the above question will be. Unless of course it was some form of sarcasm, which I of course with my equal to 0 knowledge on matters like that cannot detect.

Marty asked the Doc.. Doc said.. we must go back.. to the Future!
 
Back
Top