NVIDIA Fermi: Architecture discussion

no-X · Dec 22, 2009

Using your logic, ATi could marketed RV770 as 900SPs part, because they never said, that any real product will offer full configuration.

If any GPU has a set of SPs, which is always used for redundancy (=the case of Tesla parts), there's no reason to advertise full number of SPs... until you want to spoil your competitors launch...

Silus · Dec 22, 2009

no-X said:
Using your logic, ATi could marketed RV770 as 900SPs part, because they never said, that any real product will offer full configuration.

If any GPU has a set of SPs, which is always used for redundancy (=the case of Tesla parts), there's no reason to advertise full number of SPs... until you want to spoil your competitors launch...

That would be quite different wouldn't it ? ATI would be advertizing that their architecture has 900 SPs, when in fact it only has 800. That's not what's happening here. What we have here is a product based on Fermi, that has some units disabled. Is that a big deal ?

Fermi has 512 ALUs. Products based on Fermi will have as many ALUs as NVIDIA wants to or has to (up to 512). Since they never revealed how many ALUs Fermi based Teslas would have (until now that is), I really don't understand why the big deal over this...

Most people in this thread are always going on about how NVIDIA is focusing so much on the HPC market (with Tesla) and that it doesn't matter for gamers, yet make a big deal about what a Tesla product is, now that full specifications are known.

As a saying around here goes (which I will roughly translate): "Punished for having a dog and punished for not having one"

A.L.M. · Dec 22, 2009

Silus said:
As usual, yes. There was never any promise that Tesla parts would have 512 SPs.

Silus said:
How is that splitting hairs ? Must assumptions that Tesla would use the full Fermi chip, be considered "facts" now ? I'm not the owner of "lessthanaccurate" you know
Until now, there was nothing that proved that Fermi based Teslas would use the full chip.

Also, everyone assumed that Fermi missed its target clocks, based on the announcement of Tesla parts, which had lower DP capabilities than expected. We know that Tesla parts go through a much stricter validation process, which usually makes them be clocked much lower than their GeForce counterparts. Now, we also know that Tesla won't be using the full Fermi chip which lowers its DP capabilities even more (448 SPs @ 1400 Mhz = ~627 GFLOPs)

Silus said:
That would be quite different wouldn't it ? ATI would be advertizing that their architecture has 900 SPs, when in fact it only has 800. That's not what's happening here. What we have here is a product based on Fermi, that has some units disabled. Is that a big deal ?

Fermi has 512 ALUs. Products based on Fermi will have as many ALUs as NVIDIA wants to or has to (up to 512). Since they never revealed how many ALUs Fermi based Teslas would have (until now that is), I really don't understand why the big deal over this...

Most people in this thread are always going on about how NVIDIA is focusing so much on the HPC market (with Tesla) and that it doesn't matter for gamers, yet make a big deal about what a Tesla product is, now that full specifications are known.

As a saying around here goes (which I will roughly translate): "Punished for having a dog and punished for not having one"

Oh, come on...
Everyone, Rys included, was fooled by the words of JHH, calculating the theoretical math power of Tesla with 512 cores active. This is because they wanted to be unclear in their statements...
They were simply boasting something that they couldn't, just to keep up the hype.
The whole press meeting and presentation of Fermi was about the Tesla version, not the Geforce one. Thus I expect (as everyone else out there does) that if someone is telling me: "I am presenting a HPC card based on a chip that can have up to 512 cuda cores", I will find 512 cuda cores active in the high end version of that line of products.

All the theoretical comparisons with RV870 were wrong, actually, and they didn't clarify anything, just in order to fool people...

Sontin · Dec 22, 2009

A.L.M. said:
All the theoretical comparisons with RV870 were wrong, actually, and they didn't clarify anything, just in order to fool people...

Do you know the specification of the geforce cards?

3dilettante · Dec 22, 2009

Debate a claim or a data point.
Nobody with a decent point is going to concern themselves with what Charlie can/will/won't/can't/has the itch/hankerin' for/desire to do.

If you want to debate his motivations or whatever level of good/bad qualities, take it up in on the Semiaccurate forum.

trinibwoy · Dec 22, 2009

Silus said:
As a saying around here goes (which I will roughly translate): "Punished for having a dog and punished for not having one"

The complete "english" translation is "Damned if you do and damned if you don't"

Whether or not Nvidia broke some implicit promise by disabling a few cores isn't really an issue as long as they achieve promised performance. AMD had to disable a few SIMDs to produce their second tier part and they're working with a much smaller die. It's an ominous sign for Geforce parts if the cut was made for yield reasons though. Tesla should be a whole lot more tolerant to low yields given the much lower volumes and higher ASP.

Silus · Dec 22, 2009

trinibwoy said:
The complete "english" translation is "Damned if you do and damned if you don't"

Yeah, that's basically it

trinibwoy said:
Whether or not Nvidia broke some implicit promise by disabling a few cores isn't really an issue as long as they achieve promised performance. AMD had to disable a few SIMDs to produce their second tier part and they're working with a much smaller die. It's an ominous sign for Geforce parts if the cut was made for yield reasons though. Tesla should be a whole lot more tolerant to low yields given the much lower volumes and higher ASP.

Exactly. But I doubt it will affect the 512 ALUs part that much anyway. Those parts are always low volumes too.

Silus · Dec 22, 2009

A.L.M. said:
Oh, come on...
Everyone, Rys included, was fooled by the words of JHH, calculating the theoretical math power of Tesla with 512 cores active. This is because they wanted to be unclear in their statements...
They were simply boasting something that they couldn't, just to keep up the hype.
The whole press meeting and presentation of Fermi was about the Tesla version, not the Geforce one. Thus I expect (as everyone else out there does) that if someone is telling me: "I am presenting a HPC card based on a chip that can have up to 512 cuda cores", I will find 512 cuda cores active in the high end version of that line of products.

All the theoretical comparisons with RV870 were wrong, actually, and they didn't clarify anything, just in order to fool people...

I'm not disputing that it was what they wanted us to believe. You are however disputing that they never said it in so many words: that Tesla would not use Fermi's full chip. And the actual fact is that they didn't.

Also, they never boasted about something that they couldn't.
Fermi was pitched as 8x DP over GT200. And this was in GTC, which as you said was about the Tesla version. Last time I checked 8 * 78 GFLOPs (624) is roughly equal to the ~627 GFLOPs that the now known Tesla based on Fermis, will have. So what they claimed, this far, is what they seem to be delivering.

Groo The Wanderer · Dec 22, 2009

Razor1 said:
nope thats only for Tesla. Telsa has much more power contraints then Geforce, reasons amount of ram and closed casings and cluster configs.

If anyone listened to the web conference about Tesla, there was mention about power usage and the flop numbers given in recent documentation, but it has nothing to do with the other lines of cards.

So you are saying that it is more power efficient to have fewer higher clocked shaders than more lower clocked ones? Interesting view of physics your company has. I wonder if that is the explanation for bumpgate?

-Charlie

Groo The Wanderer · Dec 22, 2009

A.L.M. said:
Oh, come on...
Everyone, Rys included, was fooled by the words of JHH, calculating the theoretical math power of Tesla with 512 cores active. This is because they wanted to be unclear in their statements...
They were simply boasting something that they couldn't, just to keep up the hype.
The whole press meeting and presentation of Fermi was about the Tesla version, not the Geforce one. Thus I expect (as everyone else out there does) that if someone is telling me: "I am presenting a HPC card based on a chip that can have up to 512 cuda cores", I will find 512 cuda cores active in the high end version of that line of products.

All the theoretical comparisons with RV870 were wrong, actually, and they didn't clarify anything, just in order to fool people...

I disagree. I think they expected to be at 512, if you recall, they only had A1 silicon for ~2-3 weeks before the conference. I don't think they realized how screwed they were at the time. Now they do.

Short story, they may have been deluded by their own theoretical silicon prowess, but I think this may have been a rare case of corporate honesty on their part, which is why it seems so odd and unfamiliar to hear.

Once again, reality intruded and ruined their master plan. Le-sigh. I am pretty sure ORNL wasn't as pleased with the silicon they got either.

-Charlie

3dilettante · Dec 22, 2009

Groo The Wanderer said:
So you are saying that it is more power efficient to have fewer higher clocked shaders than more lower clocked ones?

I would have thought that downclocking more units could have been the way to go for the top low-volume bin. Random defects shouldn't have forced a cluster inactivation, even with the historically coarser methods Nvidia has used for redundancy.

Just wondering:
Maybe they aren't getting much of a power win in downclocking, and can't push the voltage any lower without incurring stability or subthreshold leakage problems?
edit: maybe it's not subthreshold leakage, but one of the static leakage components?

The supply voltage on what I believe was one of the official Tesla slides looked to be decently low and close to 1.0 and voltage scaling has been getting more challenging the lower it gets.

Arun · Dec 22, 2009

Now I'm biased here, because the number one thing on my wishlist for Fermi architecturally was per-cluster clock speeds and it looks like it didn't happen, but I suspect this is more likely to be power-related; parametric yields, if you wish.

Intra-chip variability is a big deal on these nodes and for such a massive chip. You need to choose one voltage for the entire chip, and some parts are going to have quite a bit of headroom left whereas others will just barely deliver. If you want to minimize voltage for a given clock frequency (i.e. optimize performance/mm²), then it helps to get rid of the lowest-clocking clusters. If you want to minimize leakage, you can disable the highest-clocking clusters which are usually the most leaky. So if you want to optimize overall power consumption, you pragmatically do a little bit of both.

Obviously it'd be ridiculous to claim they couldn't deliver a low-volume SKU with 512 cores if they really wanted to, and I see neliz in this thread hinting they very well might still deliver a GeForce one. But given that power efficiency is super-important here and that these boards are already described as "<=225W" despite that, it probably wouldn't be a good idea to do so because variability would indirectly kill them. As I said, I might be overemphasising this point because flexible clock domains were at the top of my wishlist, but heh.

Dave Baumann · Dec 22, 2009

To make removing clusters power efficient then the chip would have to have full, and fairly fine grained power gating, which gets expensive. Perf/W usually goes in favour of lower speed (hence lower voltage) rather than disabling clusters as you are still paying for the leakage element of the disabled parts - this is why you seen GTX 295 and Hemlock in the configuration you do.

3dilettante · Dec 22, 2009

Well, the chip is very big, so what's a few tens of millimeters square between friends.

Upon review, the Tesla specs have voltage at 1.05.
Could a chip that big be pushed lower?

digitalwanderer · Dec 22, 2009

Dave Baumann said:
Perf/W usually goes in favour of lower speed (hence lower voltage) rather than disabling clusters as you are still paying for the leakage element of the disabled parts

You mean you still pay for the leakage on the disabled clusters? WTF? That makes no sense to me.

sethk · Dec 22, 2009

Arun's post makes a lot of sense regarding the targeted selection of voltage sensitive portions of the chip to disable in order to hit a desired clockrate@voltage number, as opposed to just lowering the voltage below 1.05 and seeing if it still runs. I have to believe that at a certain point a 'brute force' (i.e. across the board) lowering of voltage wouldn't really be possible even at lowered frequencies, unless the 'weakest link' is disabled before trying to lower the voltage.

Mize · Dec 22, 2009

digitalwanderer said:
You mean you still pay for the leakage on the disabled clusters? WTF? That makes no sense to me.

They're disabled logically, but it's not easy to disable power to them.

Razor1 · Dec 22, 2009

Groo The Wanderer said:
So you are saying that it is more power efficient to have fewer higher clocked shaders than more lower clocked ones? Interesting view of physics your company has. I wonder if that is the explanation for bumpgate?

-Charlie

Who said that, might want to read my post again

. And do you have any basic understanding of thermal output vs. leakage vs. voltage at a given frequency in silicon chips?

This is exactly why I stated you can't look at Tesla's flops and really work backwards to find out anything much about the Geforce line. And also A2 silicon was in the range of 550mhz to 650 mhz as base clocks if we use a multiplier of 2.2 to 2.4 to get the gflops of range they are going for. So come again? If you listened to the web conference call, they did state there the reason why the flops are lower for Tesla is because they had to because of end requirements of the systems they are going to be in.

digitalwanderer · Dec 22, 2009

Mize said:
They're disabled logically, but it's not easy to disable power to them.

Ah, thanks. Makes sense now.

mczak · Dec 22, 2009

Mize said:
They're disabled logically, but it's not easy to disable power to them.

I wonder how much effort this would be. Of course core i7/i5 is doing this, though it only has 4 cores so it isn't that fine-grained. nvidia would need to have 16 power gated sections (or if we assume shader clusters can only be disabled in pairs at least 8). OTOH though maybe this is easier if it only needs to be done statically? Obviously core i7 does this fully dynamic too.

NVIDIA Fermi: Architecture discussion

no-X

Silus

A.L.M.

Sontin

3dilettante

trinibwoy

Meh

Silus

Silus

Groo The Wanderer

Groo The Wanderer

3dilettante

Arun

Unknown.

Dave Baumann

Gamerscore Wh...

3dilettante

digitalwanderer

sethk

Mize

3dfx Fan

Razor1

digitalwanderer

mczak

Similar threads