NVIDIA Kepler speculation thread

Is a 32bit MADD/FMA ALU exactly 1/2 of a 64bit MADD/FMA ALU ( ie there are no extra bits needed to calculate to that precision?). Does it change the way the registers and caches load and store data etc? If there are difference like the above and they cost more power and die space then thats a likely explanation.

If you have 32x32 multiply-adders you need 4 of them to make a 64x64 multiplier. The basic decomposition is:

v64 = v32a + (v32b * (2^32))
v64x * v64y = (v32xa + (v32xb * (2^32))) * (v32ya + (v32yb * (2^32))) =
(v32xa * v32ya) + (v32xa * v32yb * (2^32)) + (v32xb * v32ya * (2^32)) + (v32xb * v32yb * (2^32) * (2^32))

Which is four multiplies (one multiply and three MADDs, rather) when you consider that the * (2^32) parts are simple shifts. It's a little different for signed vs unsigned, but that's the gist of it.

Of course, for floating point the multiplier is just over the significand. So for FP32 you want a 23x23 multiplier and for FP64 a 52x52 multiplier. But we know on Fermi 32-bit integer multipliers were single cycle so it at least had 32x32 multipliers. It's possible the ALUs had a 32x64 multiplier, which would need two cycles for FP64 multiplication. This should also mean that 64-bit integer multiply was only two cycles too. It's also possible it was only 32x52 or something like that. Or it could have had something totally different, like half FP64 and half FP32 units.

This is just for the actual multiplication part of the multiplier. There's other stuff in the floating point pipeline, like normalization, that wouldn't scale the same way.
 
So is it true what I think Fudzilla said in passing, other kepler chips delayed because of lack of 28nm?

Is the supposed late April or May launch of the salvage GK104's still on, anybody?
 
Didn't LuxMark show a massive decline in performance with recent drivers on GTX580?
It did -- for almost 6 months now. Sloppy OCL1.1 stack and no sign of 1.2 implementation anytime soon. Go figure where it all lies in the priority list for NV... :rolleyes:
 

You haven't offered me anything to dodge. Your guesswork is as good as mine.

All I am saying is that the compiler is not to blame for GK104's sucky compute performance. It's cache subsystem and the register file size vs ALU size trade off made here. It has too little RF and cache to do justice to it's ALUs in compute.

Yep, and I accepted that possibility several times. Still says nothing about big Kepler being similarly handicapped.

Static scheduling is not going anywhere. It's here to stay.

The flops/register ratio hasn't changed from GF110->GK104. So how does your theory of insufficient latency hiding explain why it is slower than GF110 in some workloads?

Why do you think that won't be the case for the big Kepler chip? Have you spent time on the presentation I linked in post 3966?

Why would they make it even harder to extract GPU performance for their HPC customers? Haven't looked at the presentation, will check it out.
 

Yeah I think Gipsel brought this up a few months ago in stating that future nVidia architectures will follow a GCN like approach where arithmetic latencies are known so there's no point in doing heavyweight score-boarding for those. Cycle counters will work just fine.

Let's see if big Kepler drops the dual-issue requirement. Would like to see how a compiler scheduled single issue configuration does against Fermi.
 
I think, the number of active threads per SMX is the concern in this case.

Dual-issue may result in more bubbles in the pipeline but it won't reduce net throughput. It will be less efficient in terms of occupancy but doesn't explain the drop in absolute performance.

Number of registers per thread hasn't changed AFAIK.

Edit: Actually the minimum number of registers per thread has increased according to PCPer and HWC reviews. Each Kepler SMX now manages 64 warps, up from 48 in Fermi. So 2x the registers for 1.33x the threads. That further weakens rpg.13's theory.
 
The flops/register ratio hasn't changed from GF110->GK104. So how does your theory of insufficient latency hiding explain why it is slower than GF110 in some workloads?
I'm not trying to explain anything, since the data is generally crap. My only observation is that 2 instructions from the same hardware thread can't be issued consecutively (whereas Fermi and prior apparently could). As far as I can tell this constraint applies whether or not the second instruction is dependent upon the first. GCN doesn't have any constraint here (well there'll be some indexed register modes that do raise this constraint, but ignoring them).

Now that may be relevant in kernels with heavy register allocation, i.e. with a small pool of hardware threads that can be scheduled.

That's the end of my thoughts on this ALU architecture. When the detailed tests are done, we'll know more.

Why would they make it even harder to extract GPU performance for their HPC customers?
Because their research shows that relying upon "minimal compilation effort" requires hardware overheads that totally dominate power consumption. NVidia is directing its efforts towards getting the right data to the right place at the right time.

Fancy instruction scheduling just in front of the ALUs doesn't solve the much bigger problem of where to put data: DDR, L2, L1 or register; and when to move it. And the problem gets much worse when you scale to multiple-chip and multiple cabinet. Spending transistors on a fancy instruction scheduler doesn't make moving data around an exaflop cluster better.

NVidia's aim is to invest in the tool chain to identify the correct memory hierarchy for apps and to increase the abstraction between programming model and architecture. Then rely upon compilation and analysis tools to optimise for the hardware you have at hand.

GK104 is the first step towards a chip that "gets out of the way" of these tools.
 
Why would they make it even harder to extract GPU performance for their HPC customers?
Well, I will be very interested to see how this plays out in the coming months, to see how much extra compute performance nVidia is able to squeeze out through better drivers.
 
Furthermore, Kepler's pipeline latency is now halved as well.

* The execution time for an instruction (those with maximum throughput, anyway) is 11 clock cycles, down from 22 in Fermi.
* Throughput of 32-bit operations are not identical. Max throughput per clock on an SMX is: 192 floating point multiply-add, 168 integer add, or 136 logical operations. These all had the same throughput in compute capability 2.0.
* Relative to the throughput of single precision multiply-add, the throughput of integer shifts, integer comparison, and integer multiplication is lower than before.
* The throughput of the intrinsic special functions relative to single precision floating point MAD is slightly higher than compute capability 2.0.
* Max # of blocks (16), warps (64), and threads (2048) per SMX

GF110: 22 cycles latency, max 24 single-issued warps per scheduler, min 21 registers per thread.
GK104: 11 cycles latency, max 16 dual-issued warps per scheduler, min 32 registers per thread.
 
Another open question is whether Kepler's static scheduling only supports a single in-flight instruction per warp per SIMD. All architectures since G80 can have multiple instructions in flight from the same warp. If they drop this capability that's a significant source of ILP that they've abandoned. I see no reason why you couldn't statically schedule multiple in-flight instructions though.
 
Back
Top