NVIDIA Kepler speculation thread

More important is if GK104 will be able to beat HD7970 because predicted performance is about ~10-15% faster than GTX580 so it`s not enough to beat the newest AMD GPU.
 
Starting from their speculation (which it is), you could as well conclude that GK104 is the same as GK110 without DP capability (or with only 1:4).
 
Maybe I'm just daft but still not seeing the difference between operand fetch from the register file and other memory operations. The whole point of buffering is to decouple operand fetch from the execution pipeline.
I don't get the connection to the instruction latency.
The operand buffering enables that other instruction may overtake there (in-order issue and out-of-order dispatch to the ALUs), i.e. one instruction does not stall another independent one (but it does stall dependent ones until the result is written to the reg file to resolve RAW hazards). It is still in the critical loop defining the latency.
It seems like a small issue compared to all the other latency hiding GPUs do anyway.
For a given problem the problem gets more and more critical the wider the GPU gets, i.e. the less threads one has per vecALU.
And some constructs (like fine grained control structures) prefer low ALU latencies, because one needs far less threads to fill the bubbles. Just saying one can always throw more threads at the problem isn't exactly true, as this would necessitate the increase of the register files, especially for a heavier threads.
 
I don't get the connection to the instruction latency.
The operand buffering enables that other instruction may overtake there (in-order issue and out-of-order dispatch to the ALUs), i.e. one instruction does not stall another independent one (but it does stall dependent ones until the result is written to the reg file to resolve RAW hazards). It is still in the critical loop defining the latency.

Meh, guess I just won't get it. I see the RAW hazard and operand buffering as two separate operations with only the former being strictly part of the ALU pipeline.

For a given problem the problem gets more and more critical the wider the GPU gets, i.e. the less threads one has per vecALU.
And some constructs (like fine grained control structures) prefer low ALU latencies, because one needs far less threads to fill the bubbles. Just saying one can always throw more threads at the problem isn't exactly true, as this would necessitate the increase of the register files, especially for a heavier threads.

That theoretical problem would have to be short on both TLP and ILP for this to be a problem even on Fermi. So rather unsuitable for GPUs anyway.
 
Meh, guess I just won't get it. I see the RAW hazard and operand buffering as two separate operations with only the former being strictly part of the ALU pipeline.
It is quite simple: with scoreboarding, the "read operands" part of the pipeline is solely responsible for resolving RAW hazards. ;)
That theoretical problem would have to be short on both TLP and ILP for this to be a problem even on Fermi. So rather unsuitable for GPUs anyway.
Did you follow what Intel touted as one major advantage of Knights Corner over Fermi? Getting high performance also for smaller problems. In case of smaller matrices in a matrix multiply, the most important thing is not only a lower latency memory system. A lower latency launch of kernels and lower amount of threads (executed with lower latency) is going along the same lines.
 
So much for my big bang theory, I would suggest section: physics/rendering.
What's the latest speculation of new NV GPU ?
1024 shaders something ? 512 bit bus ... 4GB @
 
So much for my big bang theory, I would suggest section: physics/rendering.
What's the latest speculation of new NV GPU ?
1024 shaders something ? 512 bit bus ... 4GB @

You do realize that with such specs nV would just have another Fermi in their hands, requiring "2nd gen" to actually get fully working chips out too ;)
 
You do realize that with such specs nV would just have another Fermi in their hands, requiring "2nd gen" to actually get fully working chips out too ;)

So what's the problem with Fermi, mainly heat I guess, but that's everyones.
Time to get those superconductors up and running.
 
Kaotik said:
You do realize that with such specs nV would just have another Fermi in their hands, requiring "2nd gen" to actually get fully working chips out too
Repeating something doesn't make it true.
 
Actually, thats true on the Internet :D

"1024 shaders and 512-bit" say nothing about how Fermi-like Kepler will be. Numbers are meaningless without context.

You're right - that was assuming similar architecture to that of Fermi, where biggest differences are made for DP purposes, and gaming wise it would stay more or less the same - of course counts scaled up due process shrink.

But it's quite clear that when one speculates "1024 shaders" he's not going to think nV will suddenly have, say, 1024 VLIW4 shaders similar to those in Cayman or something, but continue on the same scalar route as before, where the single "CUDA core" will stay quite similar to that of Fermi.

Only things I've seen so far being suggested as major change is removing the hotclock, which would allow nVidia to do smaller chips to my understanding (with same cuda core counts), but at the same time they would be slower units too.
 
You're right - that was assuming similar architecture to that of Fermi, where biggest differences are made for DP purposes, and gaming wise it would stay more or less the same - of course counts scaled up due process shrink.

But it's quite clear that when one speculates "1024 shaders" he's not going to think nV will suddenly have, say, 1024 VLIW4 shaders similar to those in Cayman or something, but continue on the same scalar route as before, where the single "CUDA core" will stay quite similar to that of Fermi.

It doesn't make a whole lot difference in today's GPU world if you call ALUs vector units or "scalar" (which they aren't anyway in the strict CPU sense). IHVs are always looking to take advantage as much as possible of low hanging fruit and I don't see why NV wouldn't want to consider any efficiency oriented changes unless someone considers NV's ALUs as "perfect".

Only things I've seen so far being suggested as major change is removing the hotclock, which would allow nVidia to do smaller chips to my understanding (with same cuda core counts), but at the same time they would be slower units too.

So only removing the hotclock and increasing by an equivalent amount SPs and in extension SMs and the job is done? I wish things would be as simple and while not impossible, you might not want to go that route in order to not hit a couple of brickwalls.

Let's do some speculative math:

Tesla M2050 515 GFLOPs DP @238W = 2.16 FLOPs/W

2.16 FLOPs * 3x = 6.49 FLOPs/W

225W * 6.49 = 1460 GFLOPs DP

In other words a Kepler Tesla could have 2921 GFLOPs SP.

Since Tesla boards are clocked typically lower than high end desktop SKUs in the above case it wouldn't be awkward if the high end GPU would reach 3.5 TFLOPs. In order to reach that rate with 1024 SPs you'd need a frequency of roughly 1.75GHz, which is anything but skipping the hotclock but rather increasing it. So in order to reach those hypothetical 3.5 TFLOPs the speculative math gave us at a frequency of 1GHz you'd need 1792 SPs to reach and even worse the lower the core frequency the higher obviously the SP amount.

And that's not even the entire story for such a puzzle: how many GPCs or else raster/trisetup units? how many SMs or clusters per GPC? how many stream processors per SM/cluster exactly? etc etc. Because if they'd to sustain the current Fermi SM scheme of 2*16/SM they'd need 56 SMs for a theoretical 1792SP count which sounds at least to me like a damn long distance for GPUs even of the 28nm generation.

By the way if NV chose a target frequency for Kepler that isn't below 1GHz like Tahiti, but quite a bit higher they might have done some "half-manual" layout of some sorts for units like TMUs/ROPs since those aren't necessarily all that tolerant to very high frequencies, otherwise GPUs would clock at beyond 1.5GHz for a long time now.

I'd love to stand corrected but the whole thing doesn't sound all that simple to me and not it can't obviously just equal getting rid of the hotclock and job done.
 
Back
Top