NVIDIA Fermi: Architecture discussion

I wonder how much effort this would be. Of course core i7/i5 is doing this, though it only has 4 cores so it isn't that fine-grained. nvidia would need to have 16 power gated sections (or if we assume shader clusters can only be disabled in pairs at least 8). OTOH though maybe this is easier if it only needs to be done statically? Obviously core i7 does this fully dynamic too.

Well, actually it shouldn't be that hard to do it dynamically and I don't see static being any easier. It's all just real-estate/cooling to put individual power circuitry. That said, real-estate and cooling aren't always simple. :)
 
Full on power-gating for Nehalem involves sticking big transistors right on the power feeds to the cores.
I'm not a materials scientist, but getting that right is apparently a non-trivial task, and something Intel is apparently proud of accomplishing.

I wonder if a less performant option could be possible for a disabled core or cluster, where there is no concern about turning a core on and off with a fast response time like with Nehalem's power scheme.
 
I wonder if a less performant option could be possible for a disabled core or cluster, where there is no concern about turning a core on and off with a fast response time like with Nehalem's power scheme.

Something like disabling the power input at the same time as the cluster is logically disabled during the validation process? Would it be possible to simply burn a few fuses to disable a cluster?
 
Am I missing something (or simply misinterpreting), it seems that according to this: Tesla 20 and Fermi details ,

are they saying that the rumored change to 448 cores (from 512) is due to ECC requiring an extra byte ?? A rough google translate pops this out:

"Thus, the 512, 448 are reserved for data, 56 parity, 8 bits would be lost. The question is whether the cost calculation and control of parity is important, low or zero, if treated by units dedicated to the memory controller through which the cores ..."

I must be reading it wrong or highly confused.. (took a couple french courses wayy wayy back (like 20 years ago)
 
Am I missing something (or simply misinterpreting), it seems that according to this: Tesla 20 and Fermi details ,

are they saying that the rumored change to 448 cores (from 512) is due to ECC requiring an extra byte ?? A rough google translate pops this out:

"Thus, the 512, 448 are reserved for data, 56 parity, 8 bits would be lost. The question is whether the cost calculation and control of parity is important, low or zero, if treated by units dedicated to the memory controller through which the cores ..."

I must be reading it wrong or highly confused.. (took a couple french courses wayy wayy back (like 20 years ago)

No, Damien's talking about memory bandwidth. Each controller is 64-bit wide with 8-bit prefetch, or 64*8 = 512 bits. The original sentence says "Ainsi, sur les 512 bits, 448 seraient réservés aux données, 56 pour la parité et 8 bits seraient perdus." I guess Google translation didn't understand "bits" and just dropped it.
 
Last edited by a moderator:
No, Damien's talking about the bus size. The original sentence says "Ainsi, sur les 512 bits, 448 seraient réservés aux données, 56 pour la parité et 8 bits seraient perdus." I guess Google translation didn't understand "bits" and just dropped it.

I thought fermi was 384bit ? (sorry have a caffeine headache, the ol grey matter cpu doesn't seem to be processing at 100% utilization atm)
 
I thought fermi was 384bit ? (sorry have a caffeine headache, the ol grey matter cpu doesn't seem to be processing at 100% utilization atm)

Sorry my post wasn't very clear, I edited it: controllers are 64-bit wide with 8-bit prefetch, or 64*8 = 512 bits.
 
Something like disabling the power input at the same time as the cluster is logically disabled during the validation process? Would it be possible to simply burn a few fuses to disable a cluster?

That would be sticking fuses on all the power inputs, even the cores that remain on.

I was thinking it might be easier to do something like what Intel did if the need for a decent response time were removed.
It would still be a (edit: bunch of) large transistor capable of being both low-resistance, low leakage, and supplying a high current, but it could have a poor response time because it isn't going to switch at all except on power-up.
 
Charlie, why do you insist on taking the most dire interpretation of every scrap of information? Rackmounted computing devices are limited by heat and power. You do not always want to install the highest performing part in a rack. It's a function of rack cost, power density, and cooling. Sometimes it is more cost efficient to buy two less powerful boxes than to buy 1 powerful box. Even if NVidia could manufacture 512SP devices without issues, they would not be my first choice for sticking in a rack.

Yes, one can draw conclusions that Nvidia is having problems with Fermi, but the hyperbole and sheer negativ conclusion you draw from every bit of information is thoroughly intellectually dishonest.
 
I would have thought that downclocking more units could have been the way to go for the top low-volume bin. Random defects shouldn't have forced a cluster inactivation, even with the historically coarser methods Nvidia has used for redundancy.

Just wondering:
Maybe they aren't getting much of a power win in downclocking, and can't push the voltage any lower without incurring stability or subthreshold leakage problems?
edit: maybe it's not subthreshold leakage, but one of the static leakage components?

The supply voltage on what I believe was one of the official Tesla slides looked to be decently low and close to 1.0 and voltage scaling has been getting more challenging the lower it gets.

Yeah, that is what I meant. Also, given the size of the die, I wonder if they are having problems with variability on die for minimum voltages?

-Charlie
 
Obviously it'd be ridiculous to claim they couldn't deliver a low-volume SKU with 512 cores if they really wanted to, and I see neliz in this thread hinting they very well might still deliver a GeForce one. But given that power efficiency is super-important here and that these boards are already described as "<=225W" despite that, it probably wouldn't be a good idea to do so because variability would indirectly kill them. As I said, I might be overemphasising this point because flexible clock domains were at the top of my wishlist, but heh.

What I find interesting is that all the fanbois who are saying the 448 SP info is a good thing because the GF branded part will be 512SP/faster clocked/faster memory/all three don't seem to take into account what that will do to TDP.

If NV can make them (very likely), and they can cherry-pick a few low leakage parts (again very likely) for the press, that doesn't mean anything WRT production. I really doubt they can pull off any reasonably sized production run with low leakage 512SP parts. Time will tell.

-Charlie
 
Who said that, might want to read my post again;). And do you have any basic understanding of thermal output vs. leakage vs. voltage at a given frequency in silicon chips?

This is exactly why I stated you can't look at Tesla's flops and really work backwards to find out anything much about the Geforce line. And also A2 silicon was in the range of 550mhz to 650 mhz as base clocks if we use a multiplier of 2.2 to 2.4 to get the gflops of range they are going for. So come again? If you listened to the web conference call, they did state there the reason why the flops are lower for Tesla is because they had to because of end requirements of the systems they are going to be in.

So are you saying they lowered the flops on Tesla because it was simply too Teh Awsum for it's intended market, or that they were forced to lower the power, which, as a consequence, lowered frequency, and therefor flops?

Joking aside, if they had to lower flops for power reasons, how much will a 'full power' G3xx consume? See a problem there?

-Charlie
 
Joking aside, if they had to lower flops for power reasons, how much will a 'full power' G3xx consume? See a problem there?

-Charlie

225 Watt - look: http://www.semiaccurate.com/2009/11/16/fermi-massively-misses-clock-targets/

If NV can make them (very likely), and they can cherry-pick a few low leakage parts (again very likely) for the press, that doesn't mean anything WRT production. I really doubt they can pull off any reasonably sized production run with low leakage 512SP parts. Time will tell.
-Charlie

And there will never be a GTX295. :LOL:
 
Yeah, that is what I meant. Also, given the size of the die, I wonder if they are having problems with variability on die for minimum voltages?

-Charlie

Going by published micrographs, Nvidia's designs since G80 have placed the ALUs towards the exterior of the chip.
I wondered in comparison about how RV770 concentrated them in the center.

Nvidia's design puts the highest clocked parts in a position where intra-die variance would be greatest (one end of the chip to the other), and the clocks are likely reaching or exceeding the comfort zone of a generic process like TSMC's.

RV770 concentrated ALUs in a smaller area, which aside from possible thermal issues would reduce the amount of variation experienced in the high-performance section, while less critical parts with possibly slower transistors (more resistant to variation?) that didn't need much interaction with each other could sit off to the side. The chip was also smaller and lower-clocked.

I'd wonder why Nvidia keeps its central scheduler and other units in the center, when they run at a reduced clock. In the case of Fermi, perhaps it's the interconnect fabric around the scheduler and the L2 tiles that need to be in the center to keep variation from wrecking things, but this would come at the expense of putting the hot clock regions at a disadvantage.

There are no shots of Cypress for some reason, but the fuzzy wafer shots indicate it no longer concentrates its SIMDs in the center.
The chip is still much smaller and more modestly clocked than Fermi, though.
 
Charlie, why do you insist on taking the most dire interpretation of every scrap of information? Rackmounted computing devices are limited by heat and power. You do not always want to install the highest performing part in a rack. It's a function of rack cost, power density, and cooling. Sometimes it is more cost efficient to buy two less powerful boxes than to buy 1 powerful box. Even if NVidia could manufacture 512SP devices without issues, they would not be my first choice for sticking in a rack.

Yes, one can draw conclusions that Nvidia is having problems with Fermi, but the hyperbole and sheer negativ conclusion you draw from every bit of information is thoroughly intellectually dishonest.

Lets look at these in order:

1) I totally agree, it is much saner to use a lower clocked wider part, or potentially two lower clocked wider parts when looking at power use. That is my argument. Razor1 seems to be arguing that NV disabled two clusters on Fermi for power efficiency reasons, not manufacturing, something that doesn't really mesh with the physics of the situation.

Assuming a linear relationship between clocks and power use, a 448SP Fermi running at XMHz would be the same performance as a 512SP Fermi running at 14/16ths XMHz. (assuming there were no problems feeding the extra SPs etc etc). Now we know the relationship between clocks and power is not linear, so it wouldn't be much of a stretch to say that the slower 512 shader part is more power efficient.

If the claims from NV/Razor1/others about the castration being for power reasons, it doesn't make sense to do so vs downclocking a 512SP part.

2) The arguments put forth by many are that it is downclocked AND has shaders disabled. Given the volumes of high bin Fermis compared to consumer parts, and the margins that one brings in vs the other, I would suspect you could make a VERY strong case for picking low leakage, 'perfect' chips for even the low end Fermis.

Think they did this, or gave out the rejects to the GPGPU team? I would bet that the Fermis are both binned for low leakage and have shaders disabled for manufacturing reasons. Do you disagree? If not, binned 14/16ths Fermis consuming 190W 'typical', 225W TDP is quite alarming, don't you think?

3) The fact remains that the chip is hugely late, hard to manufacture, and consumes a ton of power. NV promised AIBs that they would have cards on Oct 15th 2009 last spring. They didn't. When the parts come out, lets see what they can manage to make in volume.

Everything I have said about them I can back up, although some I chose not to do publicly. I have explained several times why I find NV impossible to work with, you can search them out here if you are bored, but I am not going to type it in again. Most of the people countering what I say can't come up with a decent argument, much less a technical one.

-Charlie
 
So you are saying that it is more power efficient to have fewer higher clocked shaders than more lower clocked ones? Interesting view of physics your company has. I wonder if that is the explanation for bumpgate?
Take the extremes:
If you have a process with 0% leakage, slow clocked by higher area will most likely(!) be more power efficient than higher clock.
If 99% of your power consumption is leakage, your slower clock design probably(!) won't have a chance.

The balance between a Fermi design and a RV870 design is anyones guess. And the (!)'s are there for a reason: there are other variables that can tilt things either way for something like a GPU, such as the amount of logic sitting unused for redundancy, the amount of logic running at the highest clock speed, etc.

E.g. a design with a single clock domain may have more pieces of logic running at higher speed than necessary and may require more high leakage LVT cells.

The point is: reductio ad absurdum of a complex multi-layered topic into a single truth and then using this to question somebody's physics knowledge only makes yourself looking like a fool.
 
So are you saying they lowered the flops on Tesla because it was simply too Teh Awsum for it's intended market, or that they were forced to lower the power, which, as a consequence, lowered frequency, and therefor flops?

Joking aside, if they had to lower flops for power reasons, how much will a 'full power' G3xx consume? See a problem there?

-Charlie

I didn't say that either ;) I think you are looking for a way to justify your article any which way you can. We don't know have final numbers on the flops yet, the estimate they gave so far is solid, Fermi's Tesla versions will deliver those amout of flops, again this was stated in the web conference. now if Tesla comes out with 630 Gflops, we are looking at chip that is at 650 + mhz clocks, depending on what multiplier is used for the higher clocked shader units. Do you know how much heat is given off and power consumption the ram on these tesla version will use? Now take lets say 1.5 gb card vs a 3.0 gb card, the 3.0 gb card will probably have front and back ram chips, doubling the chips necessary (I'm just guessing at this point), how much more power usage is neccessary then?
 
I didn't say that either ;) I think you are looking for a way to justify your article any which way you can. We don't know have final numbers on the flops yet, the estimate they gave so far is solid, Fermi's Tesla versions will deliver those amout of flops, again this was stated in the web conference. now if Tesla comes out with 630 Gflops, we are looking at chip that is at 650 + mhz clocks, depending on what multiplier is used for the higher clocked shader units. Do you know how much heat is given off and power consumption the ram on these tesla version will use? Now take lets say 1.5 gb card vs a 3.0 gb card, the 3.0 gb card will probably have front and back ram chips, doubling the chips necessary (I'm just guessing at this point), how much more power usage is neccessary then?
Hmmm...

First, I don't think they use a "multiplier" between ROP domain and shader domain, they use the same frequency generator but that's all they share, each one has it's own multiplier. So, there's no doubt for any relationship between those frequencies. Given Tesla usage is far from being blending/texturing/rasterizing intensive, it's more than likely they'll lower ROP domain for this products family.

Second, RAM doesn't consume that much power. Say 2 watt for a 1Gb part, so that amounts to an astronomical 24 watt excess power consumption.

Now, given that Tesla is already given a max board power of 225 watt, il leaves us with a 200 watt absolute minimum for the exact same core configuration, add 10% for a full featured core and another 10% for frequency boost, and you end up with a "GTX380" consuming no less than 250 watt max, and that's quite optimistic.

Something I read there was clearly laughable btw, if Tesla don't need to have full math units, why the hell would they design such an absurdly big chip? They could simply design a 256SP core with the exact same functionality, a 256-bit bus and a slightly smaller L2. Who cares since it's the features that give it all it needs to have any appeal?

Remember GeForce products are not going to be available before 2 to 3 more month, basically 6 months after Cypress launch, and add to that the clear statement AMD is going to refresh their own GPUs on a yearly basis, so i'd better have to be almost 50% faster than a Cypress XT, which seems to not be fully useable at the moment.

And as a conclusion to this, where are the famous "Fermi derivatives"? Are they trying to sell us GT200 derivatives as such? (GeForce G310, anyone?)
 
Aside from the previous reply, I'm still wondering how they achieve ECC memory access on Tesla using any possible controller width.

If we consider some sort of data interleaving, we'll always end up with a "9" which doesn't work, could it be for the width or the prefetch burst access. Bandwidth is going to be mediocre when using this functionality if it requires to access the memory for just 1 word, be it 16, 32 or 64-bit. This would not cause a "1/8th" bandwidth penalty, but 50% penalty.

I didn't understand why DP performance sucked that much during the particles demonstration made at the GTC either, with a little more compute power than a GT200 based Tesla in SP they were only able to deliver half the throughput, which points to a severe bottleneck.
 
Bandwidth is going to be mediocre when using this functionality if it requires to access the memory for just 1 word, be it 16, 32 or 64-bit. This would not cause a "1/8th" bandwidth penalty, but 50% penalty.
The minimum burst read is 64 bytes per channel ... if you are accessing the memory for 1 word you are screwed period. The most straightforward method would be to simply go 1:7 and use 56 bit data with 8 bit ECC per channel.

If they really commisioned special GDDR5 then they could in theory abuse the sideband ECC datachannel for link correction to send/receive stored ECC data (so internally the GDDR5 chip would have to store one extra bit per byte somewhere).
 
Back
Top