NVIDIA Fermi: Architecture discussion

Looks like we have an obvious trend here: G80 (90nm), GT200b (55nm) and GF100 (40nm) -- all falling within the narrow range of ~480mm². :LOL:

Remains to be seen if the latter is as exciting as the first one or as boring as the middleman ;)
 
Also, assuming that is the rv870 die shot, I could be wrong but it looks like there are 24 simd's there.

Mmm, fish, yum

Yeah, I think RV870 is actually 24 simd (1920sp) with a few ones disabled and I think it was Charlie D who also hinted that the real shader count is not 1600...

5890 in about 6 months maybe?

/Kef
 
Yeah, I think RV870 is actually 24 simd (1920sp) and a few are disabled. and I think it was Charlie D who also hinted that the real shader count is not 1600...

5890 in about 6 months maybe?

/Kef

It was Charlie D and not Suzie Q huh? Who's that Charlie bloke again and why is that story complete bullshit?
 
http://www.xbitlabs.com/news/video/...Fermi_Features_on_Gaming_Graphics_Cards.html#


At its GPU Technology Conference (GTC) last week Nvidia specifically noted that it paid a lot of attention on boosting double-precision floating point computational performance on its Fermi-G300 chip (about 750GFLOPs) and thus will be able to address new several new markets, e.g., high-performance computing. However, DP performance is rarely needed by average consumers on the desktop or laptop markets. Obviously, in order to create power efficient version of Fermi for notebooks or low-cost desktops, Nvidia will have to sacrifice some of its capabilities.

We're not talking about other (chips) at this point in time but you can imagine that we can scale this part by having fewer than the 512 cores and by having these cores have fewer of the features, for example less double-precision,” said Mr. Dally, who did not explain how it is possible to reduce double-precision floating point performance without decreasing single-precision point speed, something which is needed by video games. In fact, Mr. Dally’s comment may imply that non-flagship Fermi derivatives will have not only be slower in terms of performance, but will be seriously different in terms of implementation.
 
It doesn't seem all too hard to reduce or eliminate DP without affecting SP.

The data paths are already structured for peak SP, so that can remain unchanged.
The load/store units are structured for SP, so they won't mind.

One hackish possibility is to gut DP functionality from one of the pipelines in a core. It cuts DP and likely certain INT functions in half, and the schedulers will have to be updated so they only issue heavy operations on the main pipe.

It could be dumped entirely, which means yanking the heavier ALUs and putting in skinnier units.
Possibly the schedulers and operand collectors might have logic for the dual-register special case ellided.

The effectiveness of this will also depend on how much of the core is taken up by the ALUs. There are some big areas in those cores that probably won't change all that much.
 
Regarding die sizes:

It is not very probable the Fermi to be less than 530mm2 with those specs. (3 billion transistors)

In order the design to be less than 530mm2, the probable thing is:

1. Previous, Nvidia transistors numbers was without adding the cache related transistors.
(it would be strange (but i guess not impossible) if Nvidia had +30% transistors density in the same process and in the same Fab in relation with ATI since the goals regarding parametric issues is the same for NV & ATI imo)

and

2. In this design Nvidia started to add the cache (related) dedicated transistors since the cache is much more than what Nvidia had in the previous designs (and since from now and on, the trend will be this for the future GPGPUs)
(I guess the cache is higher density than the core but the ammount is so small that can't produce alone 480mm2 results)
 
Last edited by a moderator:
There's no way they would have under-represented the transistor count in prior designs. There's just too much geek cred involved.
 
Average transistor density:

Cypress -- 6.45 (2154M:334mm²)
Fermi -- 6.25 (3000M:480mm²)

Pretty even this round, eh? ;)
 
According to an interview with Andy Keane and PCGH, each CUDA core consists of:

a DP-FMA, a SP-FMA and an integer ALU

... and they say in some cases the DP-FMA can be used for SP-taks.

So we are talking about up to 4 FLOPs per CUDA core? :???:
The way i understand it, he talked about the total of Cores per SM. So one of the groups of 16 is DP and can also be used for SP, the other is SP only. INT cannot ever produce "FL"OPS. ;)


Those terms seem much better than the transistor count-based scaling idiocy I see everywhere I look. Also, I don't see how I could make the caveat stronger.

Jawed

The way you posted it, it sound like you considered GTX285
to be faster than HD 4890 only because of the ROPs/TMUs. The whole Fermi-chip (TWFC) with 512 ALUs will probably be losing its edge in terms of those two (comparatively speaking), but otoh gain (quite a lot?) on the compute side of things - even if it's only GT200-style-FLOPs, Cypress will only be 77% more SP-MAD-FLOPs (using a 1500is hotclock), while used to be an ~92%ish advantage.

Plus, as you said, we don't know the individual effectiveness of TWFC yet.
 
The way you posted it, it sound like you considered GTX285
to be faster than HD 4890 only because of the ROPs/TMUs.
And bandwidth. Not FLOPs though.

The whole Fermi-chip (TWFC) with 512 ALUs will probably be losing its edge in terms of those two (comparatively speaking), but otoh gain (quite a lot?) on the compute side of things - even if it's only GT200-style-FLOPs, Cypress will only be 77% more SP-MAD-FLOPs (using a 1500is hotclock), while used to be an ~92%ish advantage.
The deletion of SPI in ATI will mean the advantage is even lower, theoretically. Though you might argue that simply means higher utilisation on ATI (since it's rarely more than 85% on R700 I guess). Still don't know what the instruction throughput for attribute interpolation is, or what instructions are executed to fulfill it.

The hunt is still on for a game that's notably ALU-bound... Current reviews seem devoid.

Though you might say that such a game is not defined exclusively by: "when HD4890 is faster than GTX285", since it's possible to write a shader that has terrible ALU utilisation on ATI while being fine on NVidia - but you'd have to make the majority of all shading be bottlenecked on such shaders(s) for that to become obvious in benchmarking.

And, obviously in general, some shaders can be ALU-bound in a game, but the game still mostly scales with other capabilities. Hard to separate-out all this stuff. e.g. the SSAO in Battleforge:

http://www.anandtech.com/video/showdoc.aspx?i=3650&p=2

looks like a candidate for being ALU-bound (compare HD4870 and GTX285) but that might actually be unfiltered texel rate (the speed-up under CS5 apparently derives from the bandwidth efficiency of fetching from shared memory)...

Plus, as you said, we don't know the individual effectiveness of TWFC yet.
With post-processing and shared memory or memory-intensive algorithms it could be a lot faster, since NVidia seems to have made more effort in that direction.

But the games have to be released. And I was basing my position on the games being benchmarked now. Of course in 6 months' time, say, when GF100 is actually available, such games might have arrived.

Jawed
 
That seems to be a big issue for maintaining compatibility amongst cards. If there are cards with no DP, then it makes things a hell of a lot harder for developers - you end up with all sorts of ugly checks to find feature sets...ironically just like a CPU.

What happens if code is written assuming DP and then runs on a card with no DP? Does the JIT just emit an error?

David
Will DP have a meaning in the supposed lifetime of Fermi outside of some small professional market? And will Nvidia not really love to have a strong reason for this professional market to buy the more expensive Quadros and/or Teslas instead of Geforces? This could prove way more potent than just some BIOS-level stuff to differentiate pro from consumer products, couldn't it?
 
http://www.hexus.net/content/item.php?item=20568

"Designing a CPU and designing a GPU is a remarkably different thing," he continued, noting that one was all about locality of content in cache in very high frequencies and speculative execution. "The key word is speculation, so that they can have more parallelism," said Huang, adding "we don't speculate anything. We don't speculate at all."

The NVIDIA chief went on to explain that because his firm's technology was latency tolerant, it could simply use another thread "out of the 15,000 other threads we're executing on at any other time, instead of just one or two threads. We just keep running."

"The two architectures are radically different, the two approaches are radically different, the problems faced are dramatically different," continued an unrelenting Huang.
 
That's more like it ;)
And bandwidth. Not FLOPs though.


The deletion of SPI in ATI will mean the advantage is even lower, theoretically. Though you might argue that simply means higher utilisation on ATI (since it's rarely more than 85% on R700 I guess). Still don't know what the instruction throughput for attribute interpolation is, or what instructions are executed to fulfill it.

The hunt is still on for a game that's notably ALU-bound... Current reviews seem devoid.

Though you might say that such a game is not defined exclusively by: "when HD4890 is faster than GTX285", since it's possible to write a shader that has terrible ALU utilisation on ATI while being fine on NVidia - but you'd have to make the majority of all shading be bottlenecked on such shaders(s) for that to become obvious in benchmarking.

And, obviously in general, some shaders can be ALU-bound in a game, but the game still mostly scales with other capabilities. Hard to separate-out all this stuff. e.g. the SSAO in Battleforge:

http://www.anandtech.com/video/showdoc.aspx?i=3650&p=2

looks like a candidate for being ALU-bound (compare HD4870 and GTX285) but that might actually be unfiltered texel rate (the speed-up under CS5 apparently derives from the bandwidth efficiency of fetching from shared memory)...


With post-processing and shared memory or memory-intensive algorithms it could be a lot faster, since NVidia seems to have made more effort in that direction.
Mostly i agree. It's quite hard to filter out the different limitations, but i think there's a lot improved internally in Fermi, whereas AMD chose more or less the brute force approach, scaling mainly the number of execution units while leaving internal bandwith from and to caches etc. untouched an thus on the level of the identically clocked HD 4890.

Nvidia otoh did chose to overhaul much of their architecture to the point that we still don't even know the #s and caps of their TMUs and ROPs. Plus they've improved (on paper, mind you, independet real world test will show!) cache hierarchy, size of register file, they apparently added some kind of Warp-Level-Hyperthreading, allowing them to fill more empty slots, they've made context switching faster (though only relevant for interaction with physx, about DCompute i'm not sure) and so on.

Now, what I am totally unsure about is the way, in which all these improvements which seem to make perfect sense for computing, will affect real world gaming test, but i am willing to believe for now, that they at least won't slow down all the units compared to GT200.

Additionally, they're now at an even greater bandwidth advantage compared to the GTX 285 vs HD 4890 situation.

[I deleted the part from your quote wrt to games profiles possibly changing up until 6 months, which i highly doubt.]

All in all, I think Fermi made the wait 'til real gaming benchmarks show up very interesting again.
 


Huang seems content to bash a strawman, or at least strawman aspects of design that probably has better targets for criticism. As far as CPUs go these days, Larrabee's level of speculation is very modest.
The cache argument is something all architectures must worry about, since the power cost of a single computation is an order of magnitude less expensive versus a cache load, which is an order of magnitude less than a DRAM access.
 
It doesn't seem all too hard to reduce or eliminate DP without affecting SP.

The data paths are already structured for peak SP, so that can remain unchanged.
The load/store units are structured for SP, so they won't mind.

You cannot just rip out a bunch of logic from the FPUs and achieve anything. You'd really need to redesign the FPUs from scratch and re-layout the execution resources to save power and area.

It's not like Lego blocks where you can magically disconnect them, or even like a multi-core where you just lop off one core.

To do DP you need more bits for your operands (mantissa, exponent, etc.) and you have to store and play with them somewhere. That will be very close to the logic that does single precision (which is just a smaller mantissa and exponent).

One hackish possibility is to gut DP functionality from one of the pipelines in a core. It cuts DP and likely certain INT functions in half, and the schedulers will have to be updated so they only issue heavy operations on the main pipe.

It could be dumped entirely, which means yanking the heavier ALUs and putting in skinnier units.
Possibly the schedulers and operand collectors might have logic for the dual-register special case ellided.

The effectiveness of this will also depend on how much of the core is taken up by the ALUs. There are some big areas in those cores that probably won't change all that much.

Frankly at that point you need to significantly redesign almost the whole thing, the scheduler would be somewhat different, the dispatch as well, etc. etc.

You really cannot easily remove DP, any more than you could easily remove x87 from a CPU. It's pretty deeply integrated in there, and you'd need to redo your layout anyway to take advantage.

David
 
Last edited by a moderator:
1/ Nvidia can compete with 5870/5850 being shipped in volume
2/ Nvidia can ship GT300 in volume before Q1 2010 when they couldn't even produce (or at least spare 1 to show publicly) an actual prototype card at the start of Q4 2009? Because I don't think that's ever been done in this industry
I don't see any volumes of ATIs DX11-Chips. All they need is a (Paper)-launch to make people wait a little bit further for Fermi.

Only Problem for nvidia is the performance. Fermi must beat RV870 by 30-40%. Then everything is fine. Even when they deliver 3-4 months later than ATI.
 
Last edited by a moderator:
Back
Top