NVIDIA Kepler speculation thread

AnarchX · Feb 21, 2012

whitetiger said:
Kepler GK104 has 1536 ALUs using 4.1B Transistors

Did you calculate it from the rumored die-size and Tahitis transistor-density?

whitetiger · Feb 21, 2012

AnarchX said:
Did you calculate it from the rumored die-size and Tahitis transistor-density?

It's just a guess...
- it could be lower if NV's densities aren't as high as AMDs...

silent_guy · Feb 21, 2012

One can make a very elaborate case with the numbers you have and then make what seems like a well reasoned conclusion, but chances are high that the whole exercise resembles the story of blind men describing an elephant based on the particular body part they're touching.

We don't know why Fermi consumed much more power than its competitors. We don't have know if it was due to the shaders only or if it was all across the chip. We don't know how much is due to leakage or how much is dynamic (because of dynamic voltage scaling.) We don't know if the ALUs are pure standard cells or some special fast cells with custom placement or something more special. Hell, we don't know even how large a % of each shader core actually runs on hot clock. I doubt the fetch, decode and score boarding are. And probably the register files too. So at that point, you're looking at a fairly low % of the shader core (less than 50%?), and much less than that for the total chip (20%?) and since the work needs to be done anyway, hot clock or not, the savings may well be less than that.

If the hot clocks are gone for Kepler, the reasons are most likely to be power related, but I don't expect miracles on that front alone. Fermi was so much behind in power efficiency, it's almost inevitable that they simply didn't pay as much attention as AMD across the board on the whole chip, not just the shaders or the hot clock.

Optimizing for power is a multi-front battle: architecture plays a role, but you can still lose it if you don't do your part at the tactical level and kill every single redundant wire toggle on the chip: clock gate FFs efficiently, combinationally block toggling inputs to unused cones of gates (eg block inputs to multiple exclusively used ALUs on the same bus), smart encode your major buses to reduce toggling etc. It's not glorious, but it's really hard work to get all/most of it right.

Even if Kepler doesn't have a hot clock and even if it's a power savings miracle, it'd be a mistake to use that as the definite proof of the argument, though I'm under no illusion that this is exactly what's going to happen.

silent_guy · Feb 21, 2012

whitetiger said:
Quantifying:
Cayman has 1536 ALUs using 2.64B Transistors
Tahiti has 2048 ALUs using 4.31B Transistors
(substitute your own transistor counts if you disagree with these)
Comparing VLIW4 with GCN, then
--> 63% more transistors yields 33% more ALUs
--> therefore the overhead to support GCN vs VLIW4 is 22%

Fermi GF114 has 384 ALUs using 1.95B Transistors
Kepler GK104 has 1536 ALUs using 4.1B Transistors
(substitute your own transistor counts if you disagree with these)
Comparing Fermi to Kepler, then
--> 2.1x transistors yield 4x the ALUs
--> therefore no increase in GPGPU burden, but ditching the hot-clock gives massive benefit in terms of ALU density....

And if anyone was asking for an example of the blind man and elephant, it'd be hard to top this...

Take a crude number from a marketing slide, don't bother to apply any reasonable correction factors for known chip differences, add a division or two and throw it out to the world as proof. Pointless.

But, hey, numbers don't lie: http://dilbert.com/strips/comic/2007-08-08/

Man from Atlantis · Feb 21, 2012

mczak said:
This looks like a ddr3 equipped version so probably for GK107-200. Wouldn't be surprising then power draw would be below HD7750 level. Maybe for the other versions need more pwm circuitry?

for comparison sake, i dig nvidias pwm designs what nvidia used similar pwm for their cards..

weaker pwm GT405 rated 25W
similar pwm GT430 rated 49W, GT530 rated 50W
stronger pwm GT545 DDR3 rated 70W, GT440 rated 65W..

so this card whether it is DDR3 or DDR5 (most likely it is DDR3 GK107-200) is rated 50W and DDR5(GK107-300) probably will be rated 65W.. i guess nv will try to fight against Cape Verde XT with GK106 salvage parts..

whitetiger · Feb 21, 2012

silent_guy said:
And if anyone was asking for an example of the blind man and elephant, it'd be hard to top this...

Take a crude number from a marketing slide, don't bother to apply any reasonable correction factors for known chip differences, add a division or two and throw it out to the world as proof. Pointless.

But, hey, numbers don't lie: http://dilbert.com/strips/comic/2007-08-08/

Let me get this straight, you're saying you don't agree?

Ailuros · Feb 21, 2012

whitetiger said:
It's just a guess...
- it could be lower if NV's densities aren't as high as AMDs...

It's a whole damn lot lower.

CarstenS · Feb 21, 2012

Was for sure, but still is?

Arty · Feb 21, 2012

A billion or so less transistors on the same die area? But dont AMD and Nvidia count them with differing metrics to make the comparison meaningless?

MDolenc · Feb 21, 2012

It's not THAT close to billion... If you compare GF114 vs. Cypress you get 1950M vs. 2150M (based on http://techreport.com/articles.x/20126) for aproximatly the same die area. So the difference is about 10%.

Arty · Feb 21, 2012

MDolenc said:
It's not THAT close to billion... If you compare GF114 vs. Cypress you get 1950M vs. 2150M (based on http://techreport.com/articles.x/20126) for aproximatly the same die area. So the difference is about 10%.

I meant that for GK104.

whitetiger · Feb 21, 2012

For me what interesting in that NV & AMD are converging on the same architectures having been divergent for several years, at least since the G80 generation.
- so this time round AMD has embraced GPGPU with GCN, and NV has moved away from the hot-clock
- so the differences in approach are more subtle now

- I would also expect Kepler to have benefited from what NV learned from Fermi - and so it should be a cleaner, more efficient architecture anyway.

- and the perf/mm^2 advantage that AMD had will be reduced because of these three factors....

So, I was asked to quantify what the GCN vs VLIW4 cost AMD
- i.e. how much did the decision to go for a more flexible architecture GPGPU approach
- and the answer is 22% less raw FLOPS per tranny.
(but raw FLOPs doesn't mean actual performance, obviously)

Not sure what there is to complain about there...

silent_guy · Feb 21, 2012

whitetiger said:
So, I was asked to quantify what the GCN vs VLIW4 cost AMD
- i.e. how much did the decision to go for a more flexible architecture GPGPU approach
- and the answer is 22% less raw FLOPS per tranny.
(but raw FLOPs doesn't mean actual performance, obviously)

It is an interesting question, but you're counting much more than just the transistors required for the raw flops and don't take into account the increased TEX and MC logic compared to previous generation. And then make a sweeping statement about just the shaders. So you're mixing apples and oranges to calculate grapes and then compare it to apples (I'm sure there is a better car based metaphor for this.)

Edit: really not singling you out, it's endemic and probably unavoidable given the limit amount of data there is in the open...

itsmydamnation · Feb 21, 2012

silent_guy said:
(I'm sure there is a better car based metaphor for this.)

it would definitely have to require something about engine sizes........ thats been double confirmed.

Arty · Feb 21, 2012

silent_guy said:
So you're mixing apples and oranges to calculate grapes and then compare it to apples.

Welcome to my new sig.

psurge · Feb 21, 2012

silent_guy said:
One can make a very elaborate case with the numbers you have and then make what seems like a well reasoned conclusion, but chances are high that the whole exercise resembles the story of blind men describing an elephant based on the particular body part they're touching.

LOL, fair enough. FWIW, I wasn't thinking that I'd get anything even close to definitive out of it - I was trying to transition from "blind man describing what he imagines a particular part of an elephant feels like" to "blind man feeling an actual part of something related to an elephant".

whitetiger · Feb 21, 2012

silent_guy said:
It is an interesting question, but you're counting much more than just the transistors required for the raw flops and don't take into account the increased TEX and MC logic compared to previous generation. And then make a sweeping statement about just the shaders. So you're mixing apples and oranges to calculate grapes and then compare it to apples (I'm sure there is a better car based metaphor for this.)

Well, I'm not claiming that they are anything other than sweeping generalisations, but, OTOH, things like TEX & MC tend to average out, to a first approximation.... particularly in a balanced architecture....
- e.g. bus with is +50%, and transistor count is +62%, so the proportion of the chip that dedicated to MC is about the same
- and since we don't know the original % anyway, that's the best we can do... unless we want to speculate that the MC takes less % than before...

Or if you want to take a guess that say 10% of the die on Cayman is MC
- ok, well, on Tahiti, maybe it's 9.2%
- but perhaps they've put some extra stuff in there, ok, well, it's more than 9.2%....
- i.e. the stuff that isn't in the MC (& TEX) has gone up by about 62%...

The point being that since there are 33% more shaders, and a total of 62% more transistors, it's obvious that the CU uses significantly more transistors per ALU....

Or to put in another way, if you just assume that for-all-intents-and-purposes, the Uncore has about the same % of die area as before, then we don't need to worry about it.
- if you think that the Uncore uses a significantly different % than before, then yes, it's a factor, otherwise it's not...

psurge · Feb 21, 2012

I think silent_guy's basic point is that a bunch of us are picking necessarily non-unique solutions to a severely under-constrained set of equations that involve a lot of hidden (to us) variables, with error bars of unknown magnitude on coefficients we only have wild-ass-guesses for. This is not the way to make a convincing argument

.

whitetiger · Feb 22, 2012

psurge said:
I think silent_guy's basic point is that a bunch of us are picking necessarily non-unique solutions to a severely under-constrained set of equations that involve a lot of hidden (to us) variables, with error bars of unknown magnitude on coefficients we only have wild-ass-guesses for. This is not the way to make a convincing argument .

If silent_guy doesn't like Kepler speculation in a Kepler Speculation thread, then he doesn't have to read it ...
- perhaps he should start a 'Kepler Known Facts' thread, and a have nice quite time there!

psurge · Feb 22, 2012

First of all, I shouldn't have spoken for silent_guy. Secondly, I apologize if I offended you - that was not my intention! I was just trying to say - it seems we're not going to be able to resolve any of these "how much did X cost" kind of questions in a very convincing way.

Still (speaking for myself only) - I think it's fun to speculate.

NVIDIA Kepler speculation thread

AnarchX

whitetiger

silent_guy

silent_guy

Man from Atlantis

whitetiger

Ailuros

Epsilon plus three

CarstenS

Moderator

Arty

KEPLER

MDolenc

Arty

KEPLER

whitetiger

silent_guy

itsmydamnation

Arty

KEPLER

psurge

whitetiger

psurge

whitetiger

psurge

Similar threads