NVIDIA Kepler speculation thread

trinibwoy · Dec 23, 2011

When is GK107 expected to appear?

DarthShader · Dec 23, 2011

DegustatoR said:
Nah, it'll be sooner than May.

Is that your opinion, or is it DOUBLE CONFIRMED?

Man from Atlantis · Dec 23, 2011

meh i say Feb 29th

DegustatoR · Dec 23, 2011

DarthShader said:
Is that your opinion, or is it DOUBLE CONFIRMED?

You're on defensive already?

DarthShader · Dec 23, 2011

wat? :|

Just askin, who's getting defensive here?

Domell · Dec 23, 2011

More important is if GK104 will be able to beat HD7970 because predicted performance is about ~10-15% faster than GTX580 so it`s not enough to beat the newest AMD GPU.

CarstenS · Dec 23, 2011

Who is predicting this?

Domell · Dec 23, 2011

A few days ago 3dcenter published some info about Kepler family based on a rumoured specification.

Ailuros · Dec 23, 2011

Domell said:
A few days ago 3dcenter published some info about Kepler family based on a rumoured specification.

To unbless my own laurels it's pure speculation.

CarstenS · Dec 23, 2011

Starting from their speculation (which it is), you could as well conclude that GK104 is the same as GK110 without DP capability (or with only 1:4).

Gipsel · Dec 25, 2011

trinibwoy said:
Maybe I'm just daft but still not seeing the difference between operand fetch from the register file and other memory operations. The whole point of buffering is to decouple operand fetch from the execution pipeline.

I don't get the connection to the instruction latency.
The operand buffering enables that other instruction may overtake there (in-order issue and out-of-order dispatch to the ALUs), i.e. one instruction does not stall another independent one (but it does stall dependent ones until the result is written to the reg file to resolve RAW hazards). It is still in the critical loop defining the latency.

trinibwoy said:
It seems like a small issue compared to all the other latency hiding GPUs do anyway.

For a given problem the problem gets more and more critical the wider the GPU gets, i.e. the less threads one has per vecALU.
And some constructs (like fine grained control structures) prefer low ALU latencies, because one needs far less threads to fill the bubbles. Just saying one can always throw more threads at the problem isn't exactly true, as this would necessitate the increase of the register files, especially for a heavier threads.

trinibwoy · Dec 25, 2011

Gipsel said:
I don't get the connection to the instruction latency.
The operand buffering enables that other instruction may overtake there (in-order issue and out-of-order dispatch to the ALUs), i.e. one instruction does not stall another independent one (but it does stall dependent ones until the result is written to the reg file to resolve RAW hazards). It is still in the critical loop defining the latency.

Meh, guess I just won't get it. I see the RAW hazard and operand buffering as two separate operations with only the former being strictly part of the ALU pipeline.

For a given problem the problem gets more and more critical the wider the GPU gets, i.e. the less threads one has per vecALU.
And some constructs (like fine grained control structures) prefer low ALU latencies, because one needs far less threads to fill the bubbles. Just saying one can always throw more threads at the problem isn't exactly true, as this would necessitate the increase of the register files, especially for a heavier threads.

That theoretical problem would have to be short on both TLP and ILP for this to be a problem even on Fermi. So rather unsuitable for GPUs anyway.

Gipsel · Dec 25, 2011

trinibwoy said:
Meh, guess I just won't get it. I see the RAW hazard and operand buffering as two separate operations with only the former being strictly part of the ALU pipeline.

It is quite simple: with scoreboarding, the "read operands" part of the pipeline is solely responsible for resolving RAW hazards.

trinibwoy said:
That theoretical problem would have to be short on both TLP and ILP for this to be a problem even on Fermi. So rather unsuitable for GPUs anyway.

Did you follow what Intel touted as one major advantage of Knights Corner over Fermi? Getting high performance also for smaller problems. In case of smaller matrices in a matrix multiply, the most important thing is not only a lower latency memory system. A lower latency launch of kernels and lower amount of threads (executed with lower latency) is going along the same lines.

Voxilla · Dec 25, 2011

So much for my big bang theory, I would suggest section: physics/rendering.
What's the latest speculation of new NV GPU ?
1024 shaders something ? 512 bit bus ... 4GB @

Kaotik · Dec 25, 2011

Voxilla said:
So much for my big bang theory, I would suggest section: physics/rendering.
What's the latest speculation of new NV GPU ?
1024 shaders something ? 512 bit bus ... 4GB @

You do realize that with such specs nV would just have another Fermi in their hands, requiring "2nd gen" to actually get fully working chips out too

Voxilla · Dec 25, 2011

Kaotik said:
You do realize that with such specs nV would just have another Fermi in their hands, requiring "2nd gen" to actually get fully working chips out too

So what's the problem with Fermi, mainly heat I guess, but that's everyones.
Time to get those superconductors up and running.

ninelven · Dec 25, 2011

Kaotik said:
You do realize that with such specs nV would just have another Fermi in their hands, requiring "2nd gen" to actually get fully working chips out too

Repeating something doesn't make it true.

trinibwoy · Dec 25, 2011

ninelven said:
Repeating something doesn't make it true.

Actually, thats true on the Internet

"1024 shaders and 512-bit" say nothing about how Fermi-like Kepler will be. Numbers are meaningless without context.

Kaotik · Dec 25, 2011

trinibwoy said:
Actually, thats true on the Internet

"1024 shaders and 512-bit" say nothing about how Fermi-like Kepler will be. Numbers are meaningless without context.

You're right - that was assuming similar architecture to that of Fermi, where biggest differences are made for DP purposes, and gaming wise it would stay more or less the same - of course counts scaled up due process shrink.

But it's quite clear that when one speculates "1024 shaders" he's not going to think nV will suddenly have, say, 1024 VLIW4 shaders similar to those in Cayman or something, but continue on the same scalar route as before, where the single "CUDA core" will stay quite similar to that of Fermi.

Only things I've seen so far being suggested as major change is removing the hotclock, which would allow nVidia to do smaller chips to my understanding (with same cuda core counts), but at the same time they would be slower units too.

Ailuros · Dec 26, 2011

Kaotik said:
You're right - that was assuming similar architecture to that of Fermi, where biggest differences are made for DP purposes, and gaming wise it would stay more or less the same - of course counts scaled up due process shrink.

But it's quite clear that when one speculates "1024 shaders" he's not going to think nV will suddenly have, say, 1024 VLIW4 shaders similar to those in Cayman or something, but continue on the same scalar route as before, where the single "CUDA core" will stay quite similar to that of Fermi.

It doesn't make a whole lot difference in today's GPU world if you call ALUs vector units or "scalar" (which they aren't anyway in the strict CPU sense). IHVs are always looking to take advantage as much as possible of low hanging fruit and I don't see why NV wouldn't want to consider any efficiency oriented changes unless someone considers NV's ALUs as "perfect".

Only things I've seen so far being suggested as major change is removing the hotclock, which would allow nVidia to do smaller chips to my understanding (with same cuda core counts), but at the same time they would be slower units too.

So only removing the hotclock and increasing by an equivalent amount SPs and in extension SMs and the job is done? I wish things would be as simple and while not impossible, you might not want to go that route in order to not hit a couple of brickwalls.

Let's do some speculative math:

Tesla M2050 515 GFLOPs DP @238W = 2.16 FLOPs/W

2.16 FLOPs * 3x = 6.49 FLOPs/W

225W * 6.49 = 1460 GFLOPs DP

In other words a Kepler Tesla could have 2921 GFLOPs SP.

Since Tesla boards are clocked typically lower than high end desktop SKUs in the above case it wouldn't be awkward if the high end GPU would reach 3.5 TFLOPs. In order to reach that rate with 1024 SPs you'd need a frequency of roughly 1.75GHz, which is anything but skipping the hotclock but rather increasing it. So in order to reach those hypothetical 3.5 TFLOPs the speculative math gave us at a frequency of 1GHz you'd need 1792 SPs to reach and even worse the lower the core frequency the higher obviously the SP amount.

And that's not even the entire story for such a puzzle: how many GPCs or else raster/trisetup units? how many SMs or clusters per GPC? how many stream processors per SM/cluster exactly? etc etc. Because if they'd to sustain the current Fermi SM scheme of 2*16/SM they'd need 56 SMs for a theoretical 1792SP count which sounds at least to me like a damn long distance for GPUs even of the 28nm generation.

By the way if NV chose a target frequency for Kepler that isn't below 1GHz like Tahiti, but quite a bit higher they might have done some "half-manual" layout of some sorts for units like TMUs/ROPs since those aren't necessarily all that tolerant to very high frequencies, otherwise GPUs would clock at beyond 1.5GHz for a long time now.

I'd love to stand corrected but the whole thing doesn't sound all that simple to me and not it can't obviously just equal getting rid of the hotclock and job done.

NVIDIA Kepler speculation thread

trinibwoy

Meh

DarthShader

Man from Atlantis

DegustatoR

DarthShader

Domell

CarstenS

Moderator

Domell

Ailuros

Epsilon plus three

CarstenS

Moderator

Gipsel

trinibwoy

Meh

Gipsel

Voxilla

Kaotik

Drunk Member

Voxilla

ninelven

PM

trinibwoy

Meh

Kaotik

Drunk Member

Ailuros

Epsilon plus three

Similar threads