NVIDIA Kepler speculation thread

lanek · Mar 28, 2012

The 7870 is far of having half of the Shaders 1D of the 7970 and is clocked 75mhz higher .. ( 1280 vs 2048 ). The fact the 7870 is so close of the 7950 is too the clockspeed... ( and maybe better balanced ).

Idont think Nvidia have been conservative with the 680 like AMD have been with the 7970. ( just looking on 7870, and what we know of the 7970, this one could have been set at 1000-1025mhz till the start and same for the 7950, the 7950 vs 7870 will have look different with a 7950@900-925mhz instead of 800mhz ( who is clearly low allready vs the 925mhz of the 7970 )

Ailuros · Mar 28, 2012

And who says that GK106 will be exactly half of a GK104 in terms of unit amounts anyway? If you're looking just at ALUs it's obviously not the entire story. With a hypothetical 192bit bus and 24 ROPs that's two points where clock for clock it's only by 1/4th behind a GK104.

Anyway moot point at this stage. Eventually when GK106 launches we'll see where it'll end up sooner or later.

lanek · Mar 28, 2012

Ailuros said:
And who says that GK106 will be exactly half of a GK104 in terms of unit amounts anyway? If you're looking just at ALUs it's obviously not the entire story. With a hypothetical 192bit bus and 24 ROPs that's two points where clock for clock it's only by 1/4th behind a GK104.

Anyway moot point at this stage. Eventually when GK106 launches we'll see where it'll end up sooner or later.

I was comment on the 3Dcenter "speculation" and say why i imagine the 768CC is too low at my sense. I imagine more something like 960CC . Exactly as Nvidia have done for the 560 and GTX460.. half SM, more CC / SM.

Alexko · Mar 28, 2012

Ailuros said:
And who says that GK106 will be exactly half of a GK104 in terms of unit amounts anyway? If you're looking just at ALUs it's obviously not the entire story. With a hypothetical 192bit bus and 24 ROPs that's two points where clock for clock it's only by 1/4th behind a GK104.

Anyway moot point at this stage. Eventually when GK106 launches we'll see where it'll end up sooner or later.

True, but on Fermi/Kepler, much of the front-end is shared on a per-GPC basis, so if GK106 is essentially a GK104 with only 2 GPCs, then it is half a GK106 in just about every way, except for ROPs and memory.

Ailuros · Mar 28, 2012

lanek said:
I was comment on the 3Dcenter "speculation" and say why i imagine the 768CC is too low at my sense. I imagine more something like 960CC . Exactly as Nvidia have done for the 560 and GTX460.. half SM, more CC / SM.

For the record's sake that speculative list at 3DC doesn't mention ONLY 768SPs but the other unit amounts I mentioned before too. Even more so I'm a member of the 3DC crew.

Arty · Mar 28, 2012

If we are talking about absolute perf/mm2 then topping Pitcairn seems like an uphill task, especially if the given performance range is 6950+/- in the same die area as Pitcairn. perf/watt might be relative easier win against Pitcairn if the numbers are in the 130W range.

From a perceived product standpoint, it would be like a proverbial 650Ti going up against x870 series, which just shows how much Kepler has changed things for Nvidia.

Kaotik · Mar 28, 2012

Arty said:
perf/watt might be relative easier win against Pitcairn if the numbers are in the 130W range.

7870 uses 103W on average in gaming (Crysis 2 @ 1920x1200 Extreme settings, average over 12 seconds)

From a perceived product standpoint, it would be like a proverbial 650Ti going up against x870 series, which just shows how much Kepler has changed things for Nvidia.

More like "just shows how much focusing on graphics only or compute too affects things"

trinibwoy · Mar 28, 2012

Arty said:
From a perceived product standpoint, it would be like a proverbial 650Ti going up against x870 series, which just shows how much Kepler has changed things for Nvidia.

Well even if we assume nVidia was able to push GK104 until it beat the conservatively clocked Tahiti there's no guarantee that GK106 would fare as well vs Pitcairn. GK107 is doing pretty good though so it will be interesting to watch.

silent_guy · Mar 28, 2012

Kaotik said:
More like "just shows how much focusing on graphics only or compute too affects things"

Because a GTX560 was such a stellar perf/mm2 and perf/W performer too?

rpg.314 · Mar 29, 2012

RecessionCone said:
Don't forget GF104 had its ALUs running at 2x frequency.
48 ALUs * 2 Ops/Hz = 96 ALU Ops/Hz (GF104)
Versus
192 ALUs * 1 Op/Hz = 192 ALU Ops/Hz
(GK104)

So GK104 doubled the compute and also doubled the registers.

And what about the increased clocks?

RecessionCone · Mar 29, 2012

rpg.314 said:
And what about the increased clocks?

I was responding to the assertion that GK104 increased compute by 4x over GF104, which I understood as a per clock number.

If we take clocks into account, GK104 increased compute by 2x per clock, and increased clocks by ~30%. So if you want to compare total throughput, GK104 increased by ~2.6x over GF104.

So, now I'm a little confused. What did you mean by 4x?

psurge · Mar 29, 2012

From the realworldtech chart, the doubling of compute was not accompanied by a doubling of warps per SMX. So it looks like there are actually more registers available per warp than on GF104, but less warps per compute and less L1 capacity per warp.

Ignoring dual issue for a second...

On a GK104 SMX, 4 warps can issue each cycle, so it'll take ~16 cycles to execute an ALU instruction over all 64 warps (less if dual issuing). If RAM latency is say 256 cycles, then I guess loads need to be separated by about 16 ALU instructions from use-sites in order to fully hide memory latency.

On GF110, I think it's 2 warps each base clock cycle, so ~24 base clocks to walk through 48 warps (<-- not sure if that number is right). That's 50% more latency hiding in terms of base clock cycles, and since those are also lower than the GK104 clock, it's even better than that in terms of wall clock latency hiding ability.

On the other hand, shouldn't GK104's higher memory/base clock mean that a cache miss takes less wall-clock time to service than on GF104? Not sure how to account for that...

Is that anywhere near right?

EduardoS · Mar 29, 2012

I'm not sure, but I expect memory latency to be near 100ns, at 1GHz about 100 base clocks, not 256.

And then, there are the caches and prefetching...

silent_guy · Mar 29, 2012

EduardoS said:
I'm not sure, but I expect memory latency to be near 100ns, at 1GHz about 100 base clocks, not 256.

And then, there are the caches and prefetching...

If it were a matter of order of magnitude, I'd say 1000 rather than 100.

psurge · Mar 29, 2012

EduardoS - I'm focusing on a situation where the caches are simply too small to hold the working set. But I readily admit I pulled the 256 cycle number mostly out of my back-side -I'm not qualified to read RAM spec sheets and estimate memory controller latencies and so forth

Does your 100ns number include on-chip latencies? And, are you sure GPUs do any prefetching? Maybe it's naive on my part, but unless you want to minimize the overall latency of a computation, isn't it better from a throughput/power perspective to spend bandwidth on memory accesses you know will be useful rather than speculating?

Gipsel · Mar 29, 2012

psurge said:
From the realworldtech chart, the doubling of compute was not accompanied by a doubling of warps per SMX. So it looks like there are actually more registers available per warp than on GF104, but less warps per compute and less L1 capacity per warp.

I think the numbers in that chart talking about register file size/work item may be a bit misleading for a lot of cases. After all, the number of work-groups or work-items per core are maximum values, the hardware can support for very light threads. Often, this isn't that important, hence Kepler made compromises there by supporting less work groups (relatively to the compute capabilities), which skews the numbers.
To look at it from the other side one can see how many work items the register file size is able to support in case of "heavy" threads, i.e. a case where each thread needs 64 registers for instance (iirc it is the maximum for nV, AMD GPUs support up to 128 regs per thread).

For GF100 an GF104 this works out to be 512 work-items or 16 work groups, for GK104 it is 1024 work-items or 32 work groups (I assume GK110 will be different), and for GCN/Tahiti it is 1024 work-items or 16 work groups.
But now one has to take the issue rate into account. For GF100 it is a single instruction for two work-groups per (base clock) cycle, which means there can be instructions from up to 8 workgroups per scheduler overlapping in flight (they would need 8 cycles for issue one instruction from each of them), which are available for latency hiding. This is in fact not enough to hide the arithmetic latencies (10 base clock cycles), let alone any memory latencies. GF104 is slightly worse as the issue rate can be higher, so one runs more often into the situation where one waits for a memory access and no arithmetic instructions are left for issue to do something useful.
The same is true for GK104, where up to 8 instructions from 4 threads can be issued per cycle. That means there are again only 8 available workgroups each scheduler can choose from and it is quite likely to run out of ready wavefronts. I have no idea how the arithmetic latencies compare to Fermi, they probably changed.

GCN issue rates are slightly harder to compare with, but generally it can schedule only 1 arithmetic (vALU) instruction per cycle (+ 1 scalar + 0.5 local memory access + 0.25 vector memory access + 0.25 export + 1 branch + 1 internal instructions) per cycle, actually each of the 4 schedulers can issue one vALU instruction every four cycles. That means there are 16 workgroups available for latency hiding (which need at least 16 cycles to schedule), significantly more than Fermi and also Kepler have at its disposal.

OpenGL guy · Mar 29, 2012

Gipsel said:
GCN issue rates are slightly harder to compare with, but generally it can schedule only 1 arithmetic (vALU) instruction per cycle (+ 1 scalar + 0.5 local memory access + 0.25 vector memory access + 0.25 export + 1 branch + 1 internal instructions) per cycle, actually each of the 4 schedulers can issue one vALU instruction every four cycles. That means there are 16 workgroups available for latency hiding (which need at least 16 cycles to schedule), significantly more than Fermi and also Kepler have at its disposal.

A typical VALU instruction takes 4 clocks to execute, so being able to issue one VALU instruction per SIMD every 4 clocks is exactly the right balance. It only takes 4 wavefronts to get peak ALU performance, not 16.

dnavas · Mar 29, 2012

Gipsel said:
This is in fact not enough to hide the arithmetic latencies (10 base clock cycles), let alone any memory latencies. .... I have no idea how the arithmetic latencies compare to Fermi, they probably changed.

Is this what you're looking for?
http://forums.nvidia.com/index.php?showtopic=225312&view=findpost&p=1387312

* The execution time for an instruction (those with maximum throughput, anyway) is 11 clock cycles, down from 22 in Fermi.
* Throughput of 32-bit operations are not identical. Max throughput per clock on an SMX is: 192 floating point multiply-add, 168 integer add, or 136 logical operations. These all had the same throughput in compute capability 2.0.
* Relative to the throughput of single precision multiply-add, the throughput of integer shifts, integer comparison, and integer multiplication is lower than before.
* The throughput of the intrinsic special functions relative to single precision floating point MAD is slightly higher than compute capability 2.0.

psurge · Mar 29, 2012

Gipsel - good point on max register usage cutting down on the number of warps per SMX, I had forgotten to account for that.

I don't quite understand your GCN numbers though. Doesn't a CU track a maximum of 40 wavefronts? You can process an ALU op for 4 entire waves every 4 cycles. So by executing 1 ALU instruction over all 40 waves, you can hide 40 cycles of memory access latency (much more than on Fermi/Kepler), again assuming you have enough registers for 40 wavefronts.

BTW, also interesting from the thread dnavas linked:
http://forums.nvidia.com/index.php?showtopic=225312&view=findpost&p=1388098

Looks like shared memory and register file allocation granularity have doubled.

fellix · Mar 29, 2012

psurge said:
I don't quite understand your GCN numbers though. Doesn't a CU track a maximum of 40 wavefronts?

Those 40 WF are not arbitrarily available to all the SIMD units in the multiprocessor.

NVIDIA Kepler speculation thread

lanek

Ailuros

Epsilon plus three

lanek

Alexko

Ailuros

Epsilon plus three

Arty

KEPLER

Kaotik

Drunk Member

trinibwoy

Meh

silent_guy

rpg.314

RecessionCone

psurge

EduardoS

silent_guy

psurge

Gipsel

OpenGL guy

dnavas

psurge

fellix

Similar threads