NVIDIA GF100 & Friends speculation

The problem with all answer from nVidia is that i don't see the filter units running at hot clock.
But they mentioned a few times that the clock will be higher than today...


http://forums.nvidia.com/index.php?showtopic=159270&view=findpost&p=1003188

And from the whitepaper:


It would be very stupid when the "TU clock" is slower than on the GTX285. Maybe there will be no "hot clock/2" because i don't find anything about it.

ROPs/L2 cache are on core clock (whereby the techreport article states that NV has told them that the difference between core and 1/2 hot clock should be in the 12-14% ballpark).

All indications point into the direction that TMUs will run at 1/2 hot clock; I'm just toying with the thought that they might have clocked the TFs at hot clock, which theoretically would mean it wouldn't be a problem anymore.
 
Okay, I suppose NVidia may not reuse as much logic, but 256 DP FMAs is still peanuts for a 3B transistor chip when all the data flow is already taken care of. I can't see one being larger than the equivalent of 10,000 full adders, so it should be well under 2% of the die space for all 256.

Not all transistors are the same. Things like cache cells take a lot of transistors but not a lot of area, things like high performance multipliers take a decent amount of area and a moderate amount of transistors. Know the fastest way to get more transistors? Add cache. Lots and Lots of cache. Cache transistor density is significantly higher than logic. So quoting a chip has x billion transistors some something that isn't 100's of million shouldn't have an effect is ridiculous.

A DP multiplier is a LARGE piece of logic. first you need a large multiplier array. These things scale in the square of bits. As an example from the FPGA world, a DP multiplier takes 6x more resources and an SP multiplier. Everything else scales at best linearly with FP width. the net result is that DP FP pipes take up a LOT of real estate, and while the difference for 1 is rather small, when you have 256 of them, it adds up, significantly so. So say a SP FP unit is .5mm^2. A DP unit is in the range of 1-1.5, but we'll use 1mm^2 for the example. The difference between 256 SP units, and 256 DP units is 128mm^2. Not at all insignificant.
 
Four months seems long to me, but since a lot of the work is compute bound, it can really depends on what compute resources you can throw at it.
I'm counting from synthesized netlist to metal tape out, which typically happens two weeks or so after base layer tape out and overlaps fracture.
 
So this is code for "swaaye, you're an idiot, and I shall make a spectacle of you LOLZ!".
Not my intention. Just want to counter the PR/FUD surrounding fermi's compute capabilities. Compute and graphics are feeding into each other improvement in one, allowing/inducing improvements in other.

To be complete, if you want the ultimate graphics chip, take larrabee 1.0 (which is about as much a compute chip as can be) and increase it's power and area efficiency by an OoM (without sacrificing the architecture, feel free to shoot x86 though :LOL:).
 
A lot of this depends on granularity. multiplies have the greatest area expansion and it looks like AMD specifically designed the multipliers such that they didn't add a significant area increase by reducing the delivered throughput to the point where they could easily utilize only the existing hardware without having to add additional hardware support in. In contrast it seems that fermi maintains a 2:1 mul ratio which requires adding significantly more hardware into the design. AKA, a DP mul requires roughly 4x the area of a SP mul, where as for add its only roughly 2x.

What is the area cost of an int32 multiplier vs fp32 multiplier? The former was added in fermi just to do dpfp more cheaply. int32 multiplication happens only in the t unit in ati, so clearly not something that is terribly useful for graphics. The AGUs don't use the int32 multiplier either.

If anything, that is the real cost of dp in fermi.
 
Even if they saw after A1 that they couldn't reach the speeds they wanted and decided right away that a B1 could improve things, they'd still want to fix the logic bugs in metal first. It takes at least 4 months go from a netlist to tape-out, another 2 months until silicon and only then you can start with qualification again.

Thx for the reply. Helps me a lot to understand hiw such things are actually done.

Nothing has taped out, nor has any prep been made to do so, so tapeouts are unlikely to be imminent. That puts things at 6 months or so out minimum.
-Charlie

Thx fpr the info. You are always a first rate source for such info. If NV has no mainstream and entry level DX11 chip till the end of 2010, they are broke.
 
Yes, but you can't change a fuse in the glove box and have the engine change from a V6 to a V8 now can you?

-Charlie
Okay, this line of reasoning is really, really bad, so I'm going to have to step back up again.

Look, there is no possible way that nVidia (or any company, for that matter) can produce consumer cards that perform as well as professional ones in professional apps. Here's the basic issue:

1. It takes time and money on nVidia's part to develop and validate drivers for use on professional software.
2. This money must be recuperated through the sales of professional graphics hardware (i.e. Quadro/Tesla).
3. If the sales of the professional hardware is not sufficient to pay for the development (hardware and software) of said professional hardware, then nVidia will make more money by stopping production.

What this all boils down to, then, is that nVidia has lower economies of scale on professional hardware, which means higher per-unit costs, which means they must necessarily sell those units for more. If they don't, then they're better off selling no units at all. So for nVidia to have any professional hardware units at all, they cannot allow GeForces to be used for those purposes (at least not easily).

You can whine about it all you like, but business realities guarantee that nVidia simply can't ever freely support professional applications on GeForce hardware. Getting annoyed that the two cards are just a BIOS flash apart won't change the business realities.
 
If AMD can pull another "4 chips in <6 months" in 2H10, (or even by 1Q11) they will be in a shitload of trouble.

I don't know. DX11 is clearly an advantage for Juniper-class products, but below that? I doubt many people would really care.
 
With the next shrink ATI wil reach 3200 SP-s easyli. If nvidia will keep the tesla bloat than they will need 2 shrinks for 1024 cores. And they are now more than 6 months late. They will need to take radical changes in the design.

The nvidia ALU-s were good against ATIs vector alus when running over 1700 MHz and ATI had 600-700 MHz clocks. Now they can barely reach 1400 MHz and ATI chips are running 800+ MHz, the gap is closer now with less area. And if the leakage will folow the trend from 55nm-40nm than maybe even next shrink wont help nvidia much.
 
With the next shrink ATI wil reach 3200 SP-s easyli. If nvidia will keep the tesla bloat than they will need 2 shrinks for 1024 cores. And they are now more than 6 months late. They will need to take radical changes in the design.

The nvidia ALU-s were good against ATIs vector alus when running over 1700 MHz and ATI had 600-700 MHz clocks. Now they can barely reach 1400 MHz and ATI chips are running 800+ MHz, the gap is closer now with less area. And if the leakage will folow the trend from 55nm-40nm than maybe even next shrink wont help nvidia much.


The Northern Islands won´t be an evolution it will be AMDs DX11 Revolution.
 
With the next shrink ATI wil reach 3200 SP-s easyli.

Since when have anyone's shrinks been exactly linear? :oops: And you seem to be forgetting that AMD also needs to make system wide changes in it's cache hierarchy to make it more efficient, no to mention, make it more compatible with bulldozer. Also, do you know (Cypress+1)'s die budget?

If nvidia will keep the tesla bloat than they will need 2 shrinks for 1024 cores.
Tesla bloat, WTF is that? Funny enough, you were arguing that AMD can double it's alu's in one shrink while NV apparently needs 2 shrinks to get to the same place. Did nv's wafers start getting fabbed in an alternative universe while I wasn't looking? :rolleyes:

And they are now more than 6 months late.
Remember spring of 2007? Situation was exactly the opposite.

They will need to take radical changes in the design.
They see to already have with Fermi. Now it's AMD's turn with NI.
 
Since when have anyone's shrinks been exactly linear? :oops: And you seem to be forgetting that AMD also needs to make system wide changes in it's cache hierarchy to make it more efficient, no to mention, make it more compatible with bulldozer. Also, do you know (Cypress+1)'s die budget?

its 40 to 28 not 32 so why isn't a doubling of ALU's possible given compraible die sizes? Also is it likely that AMD will change its 5D ALU's ? Sure make memory access and sharing more granular etc So 3200 might be a bit to far what about 2400-2800 depending on other changes. If AMD's clock speeds keep trending up as well.

a 28nm AMD card H2 this year could be a monster.........................
 
Tesla bloat, WTF is that? Funny enough, you were arguing that AMD can double it's alu's in one shrink while NV apparently needs 2 shrinks to get to the same place. Did nv's wafers start getting fabbed in an alternative universe while I wasn't looking? :rolleyes:

.

The wafers are same but unfortunatly nvidia has a much bigger chip.:LOL: So yes exactly as i sayed. Nvidias alu takes much more area than ATIs anyway. And to add things together a 1024 sp GF100 would be a monster even on next shrink. They need already a shrink for GF100 if they want to get on ATI level.

I didnt sayed that they will double everything again or that it should be much faster, i just allege that ATI could pull out a 3200sp chip after shrink while nvidia can be happy with a shrinked GF100 with higher clocks, added transistors and die area to made it manufacturable. And this with a 6 month late on market.
 
Nothing has taped out, nor has any prep been made to do so, so tapeouts are unlikely to be imminent. That puts things at 6 months or so out minimum.

-Charlie

LOL, only in the world of disinformation or of someone that just likes to make things up, does that make any sense...

Do you really expect anyone to believe that "no prep has been made" for other Fermi based parts, other than the high-end ?
 
Back
Top