NVIDIA Fermi: Architecture discussion



I'm not surprised at all. Given the small uses for DP in gaming, Nvidia has done this so that their cheaper cards don't cannibalise the sales of their far more expensive HPC/professional cards.

Sure, hobbyists can't get relatively cheap gaming cards for their own little HPC projects, but what's that against Nvidia being able to sell the same product for $5000? Nvidia's banked a lot of transistors on HPC, and they will get their money back or else! No cheap supercomputers for you!
 
I think it just indicates NVidia thinks that neither ECC nor increased memory capacity is enough to justify buying a Tesla.

Jawed

That's probably rather close to the truth. Support would be another differentiator, but that's a bit nuanced anyhow, whilst this is pretty clear-cut.
 
You'd have a nice indication for that weird theory if the TDP for the 2070@1.4GHz would had been set over 225W TDP.
 
Yeah, just they know that driver cut-out's don't work because the internet circumvents them, so they have to go for hardware limitations.
Who says it's a hardware limitation? If nvidia didn't think about that before it could as well be just drivers.
I'm wondering though how they actually limit it. Only one SM per GPC DP capable? Or all SMs in one GPC?

Jawed said:
Or, maybe GF100s will catch fire at consumer clocks if DP is ran at full rate?
Seems a bit unlikely. Only the multipliers really have to work any harder compared to full rate SP, operand fetch and the like is all the same. Plus in DP-heavy things you're probably not going to tax texture units, rops, etc. that much at the same time...
 
I'm not sure but wouldn't it be hard to limit DP performance in hardware, there is no dedicated hardware just for DP is there?
 
Who says it's a hardware limitation? If nvidia didn't think about that before it could as well be just drivers.
I'm wondering though how they actually limit it. Only one SM per GPC DP capable? Or all SMs in one GPC?

Because as I said above the internet finds a way around software limitations. Something as simple as hacking inf files or intercepting hardware IDs has already been used to break limitations in drivers, and every big company knows it too.

If it's the difference between buying something for a few hundred dollars or a few thousand, someone will put the effort into it. Nvidia would be foolish to rely on a software-only lockout.
 
OlegSH, it seems to me it's to claw back for small triangles the texture cache locality you get from higher dimensional rasterization scan patterns with large triangles and also to make the hierarchical Z checks a little cheaper.

PS. I don't think AMD does it though (it will almost certainly group quads from multiple triangles before shading, but I don't think it groups them by tile).
 
Last edited by a moderator:
If it's the difference between buying something for a few hundred dollars or a few thousand, someone will put the effort into it. Nvidia would be foolish to rely on a software-only lockout.
You're probably right, but it doesn't mean companies haven't done foolish things before :).
The raw DP numbers look quite shocking this way compared to HD5870, DP FMA is 1/3, and DP ADD only 1/6, at least if that DP limitation affects more than the multiplier part...
I guess that's also an indication derivative parts won't have any DP capability at all? After all it wouldn't make sense for these chips to have DP capability then disable it partially (otherwise they'd have more DP flops than the top part), and somehow I don't think different SMs in hardware (so some with DP capability some without) is what the doctor ordered.
 
The simplest thing is probably to just configure the schedulers to issue DP instructions at most every other clock. Disabling it at a SM level doesn't really make any sense since you'll screw up your SP performance as well. I doubt it matters at all, Nvidia is banking hard on its superior software platform and more flexible architecture to differentiate its products. For those looking to do a little cheap GPU computing on the side, industry leading peak DP performance probably isn't at the top of their must-have list.
 
At Tech Report, Scott Wasson mentioned two big hints about GF100's clock speeds:

1) Theoretical texture filtering rate on GF100 will be lower than GT200b (even though real world texture filtering performance for GF100 will often be superior to GT200b)

2) Running texturing hardware at half [hot clock] frequency will result in 12-14% boost compared to running at core clock frequency.

GT200b has a core clock frequency of 648 MHz, and a hot clock frequency of 1476 MHz.

Well, look at this, if GF100 has the exact same core clock frequency and hot clock frequency as GT200b, then the two conditions above are so nicely met!

So here are my predictions on first iteration GF100 clock frequencies:

Core Clock: 648 MHz
Hot Clock [Half Rate]: 738 MHz
Hot Clock: 1476 MHz

Nvidia has stated the hot clocks on the compute board are 1.25-1.40GHz......

-Charlie
 
With the number given of 2.8 billion tri/sec(with 8 pixel tris), I think that pretty much says that they are running at 1.4ghz shaders, which means 700mhz half clock, and if 700mhz is the TMU clock and if they expect 12-14% more performance than core clock that puts your core clock range at 602-616mhz.

Of course, being that clocks arent 100% defined yet, it doesn't mean they can't change them from what they showed at CES...
 
I'm still waiting for any data that shows GPUs without ECC suffer from memory errors (once the memory has passed being soak-tested for hardware problems).

Jawed

The issue isn't just hardware problems, but cosmic rays flipping your bits as well. Soak testing will do nothing to stop that. Build a big enough cluster and run it long enough, and the probability of failure becomes non-trivial. The estimate for cosmic ray bit flips is about 1 event per 256MB of memory per month. Amazon got taken down by cosmic rays in the 90s for 24 hrs by a cosmic ray event.

People building HPC are clusters going to be using several hundred cards and running them on jobs which could run for weeks or months and consume huge $$$ of power costs and time, so to have the results fscked up halfway through is a bitch.

Even if fears are overrated, the people in the position of purchasing huge amounts of equipment, especially for government laboratories, are risk averse and like to buy safety.
 
Because as I said above the internet finds a way around software limitations. Something as simple as hacking inf files or intercepting hardware IDs has already been used to break limitations in drivers, and every big company knows it too.

If it's the difference between buying something for a few hundred dollars or a few thousand, someone will put the effort into it. Nvidia would be foolish to rely on a software-only lockout.

And who would run such a system on hacked drivers? The price of a Quadro or Tesla is nothing compared to one engineer being forced to sit around doing nothing for a day or having te redo a callculation that took 24 hours and meaning that the whole schedule gets mixed up.

Apart from that most software vendors won´t give you support if you run GeForces on hacked drivers either.
 
Slightly ironic how NV pimped its half rate DP and DP performance in general so much and then the product comes out with 1/3rd the DP rate of the competition... ;)

Unless you buy a Tesla for 10+ times the price of course.

I'm not sure what they are thinking there, anyone who wants to do a GPU supercomputer can spend ~$3K a pop on Tesla's at ~600DP Gflops each, or they can buy 5870s at 544DP Gflops for ~$350 a pop, and the 5870s will consume ~15% less power. Or 5970s at $600 each with 928DP Gflops, and 34% less power.
 
Last edited by a moderator:
Layman perspective:

I'm not sure that the 5870 / 5890 are really the most important competitors of the Fermi for HPC. IMHO the real competitor could be Llano:

- wait ~half a year
- use a 4 CPU server board
- 16GB of ECC RAM
- 4 Llano APUs with HT3 [ ~ 800GFlops - 1TFlop DP Performance ]

- have fun ;)

- If AMD was not so stupid to include not enough HT3 interfaces in the Llano APUs then such a system would have far more HPC performance than Fermi at a comparable prize (depending how much AMD wants for the additional HT3 interfaces for server applications).
 
Back
Top