NVIDIA Kepler speculation thread

Yes, that's what I wanted to say, the GF104 SMs don't look that much more efficient now, and if you'd extend them for faster FP64 they might not really have that much of an advantage.

From what I saw they mostly look pretty weak, the shorter the shaders are - which seems to be the case quite often in synthetic programs trying to measure maximum throughput.

I jumped to the wrong conclusion in the first place also…
 
From what I saw they mostly look pretty weak, the shorter the shaders are - which seems to be the case quite often in synthetic programs trying to measure maximum throughput.
Well that those SMs can't be at peak rate with only 40 instructions per shader follows directly from the 64bit per clock export alone (there could well be other reasons too). With 40 instructions there will be 48/40 pixels per clock at peak rate, which is 1.2 pixels / clock. But that's at hot clock, at normal clock that's already 2.4 pixels per clock - 20% over the 2 pixels (64bit) / clock limit. So because of that alone you'd need at least 48 instructions to reach peak efficiency (and 32 instructions for GF100/GF110).
(edit: that's only true for the scalar sequence of course, it shouldn't have influenced the vec4 40 instruction shader result. So there's definitely something else at work too.)
But I can't tell if that makes much of a difference for todays apps - I could definitely imagine some shaders are still quite short, but the export restriction might not be the only thing these hit. If you run something like quake3 the pixel shader will certainly never be close to even 32 instructions, but then again it's going to be limited by texturing too.
 
A small but practical question: when is exactly Kepler expected to be released? Will be ready for BF3 (October, 25) or are we sailing in the November-December timeline?
 
Not until some pipe cleaner SKU on the new 28nm manufacturing process, applied on the current generation. New architecture on a new process are traditionally avoided by the IHVs as too volatile mix.
 
A small but practical question: when is exactly Kepler expected to be released? Will be ready for BF3 (October, 25) or are we sailing in the November-December timeline?

So far, all we have is this:

Nvidia-GTX-600-to-Be-Released-in-Q4-2011-Using-28nm-Manufacturing-Node-Rumors-Say-2.jpg


It says 2011, but then again it also says 2009 for Fermi and 2007 for Tesla. So I'm thinking early 2012.
 
The next GTC is May 2012. They're skipping 2011 entirely and that could be because they have nothing to show as yet.
 
The next GTC is May 2012. They're skipping 2011 entirely and that could be because they have nothing to show as yet.

Wow, I didn't know that… That's not a very good sign. I sure hope they can manage to release something before May 2012, especially if AMD succeeds in launching Southern Islands this year in 28nm, or it's going to be HD 5870 vs GTX 285 all over again.
 
Wow, I didn't know that… That's not a very good sign. I sure hope they can manage to release something before May 2012, especially if AMD succeeds in launching Southern Islands this year in 28nm, or it's going to be HD 5870 vs GTX 285 all over again.

It sorta depends on what Kepler is bringing to the table. Anand surmises that Kepler will still be CUDA 4.x. If it's just a tweaked Fermi then there's no reason to build a GTC around it. Next year's GTC could very well be focusing on 2013 and Maxwell. Just gotta wait and see but yeah if AMD has that kinda lead on 28nm nVidia will be in a rough spot.
 
Interesting paper by Lindholm and Dally. Lindholm designed G80's shader core and Dally is nVidia's current chief architect so this stuff could show up in future architectures. The goal is to reduce power consumption by simplifying the scheduler and register file access.

First bit focuses on creating a two-level scheduler hierarchy. The first level holds all warps including those waiting for long latency instructions (texture and global mem fetches). The second level contains only 6-8 active threads. An active thread is one that's executing or waiting on a low latency instruction (ALU or shared memory fetches). All instruction issue happens from the small active pool and threads are swapped between pools as necessary.

The second complementary piece is register file caching. According to the paper the register file consumes 10% of chip dynamic power consumption and a single access uses more power than an FMA. The idea is to cache 6 registers per thread in a small multi-ported register file with single cycle access to all operands. So no need for an operand collector and the register cache can be located nearer to the ALUs. Only the registers from active threads are cached. Flushes to the main register file are required if a thread gets evicted from the active pool. Compiler hints help reduce flushes by marking registers as dead after their last reference in the instruction stream.

All this work for a grand total of 3.6% reduction in power consumption plus an increase in power consumption due to register cache flushes. :)

http://cva.stanford.edu/publications/2011/gebhart-isca-2011.pdf
 
What is a bit funny, is that half of their reduced number of reads of the main register file are actually reads of results of the immediately preceding instruction. AMD's VLIW designs catched those compeletely with their pipeline registers PV and PS (and they even managed to use that for reducing the instruction latency, opposed to the proposed solution in the paper).

Edit:
And the whole modeling is done for an ALU latency of 8 cycles (mayby 12 cycles including register reads and writes?). Otherwise they would need roughly twice the amount of active threads and therefore twice the size of the register file cache (which is expensive and limits the powersavings). It would also limit the usefulness of the 2 level scheduling. Does this mean nvidia will go in the direction of simplifying things a bit and reducing latency in the future? When I think about it, Fermi also reduced latency already compared to earlier design.
 
Last edited by a moderator:
Interesting paper by Lindholm and Dally. Lindholm designed G80's shader core and Dally is nVidia's current chief architect so this stuff could show up in future architectures. The goal is to reduce power consumption by simplifying the scheduler and register file access.

First bit focuses on creating a two-level scheduler hierarchy. The first level holds all warps including those waiting for long latency instructions (texture and global mem fetches). The second level contains only 6-8 active threads. An active thread is one that's executing or waiting on a low latency instruction (ALU or shared memory fetches). All instruction issue happens from the small active pool and threads are swapped between pools as necessary.

The second complementary piece is register file caching. According to the paper the register file consumes 10% of chip dynamic power consumption and a single access uses more power than an FMA. The idea is to cache 6 registers per thread in a small multi-ported register file with single cycle access to all operands. So no need for an operand collector and the register cache can be located nearer to the ALUs. Only the registers from active threads are cached. Flushes to the main register file are required if a thread gets evicted from the active pool. Compiler hints help reduce flushes by marking registers as dead after their last reference in the instruction stream.

All this work for a grand total of 3.6% reduction in power consumption plus an increase in power consumption due to register cache flushes. :)

http://cva.stanford.edu/publications/2011/gebhart-isca-2011.pdf

The paper says:

Our own estimates show that the access and wire energy required to read an instruction's operands is twice that of actually performing a fused multiply-add [15].
[my emphasis]

That's an important distinction. My intuition is that it takes a lot of energy not so much because the power requirements are high, but because it takes time to access the registers. This would explain how having a closer register cache might help.
 
What is a bit funny, is that half of their reduced number of reads of the main register file are actually reads of results of the immediately preceding instruction. AMD' VLIW designs catched those compeletely with their pipeline register PV and PS (and they even managed to use that for reducing the instruction latency, opposed to the proposed solution in the paper).

Do you reckon that would still be possible in Southern Islands?
 
Interesting paper - didn't read all of it yet, but this seems like the most important point: "Prior work examining a previous generation NVIDIA GTX280 GPU (which has 64 KB of register le storage per SM), estimates that nearly 10% of total GPU power is consumed by the register file [16]. Our own estimates show that the access and wire energy required to read an instruction's operands is twice that of actually performing a fused multiply-add". Also keep in mind the dual-scheduler approach may save a tiny bit of power beyond the RF caching itself.

This also very nicely explains how Bill Dally's Echelon presentation lane diagram on Page 39 got away with only 6 register reads for 4 FMAs plus 2 LSIs (requiring 12 register reads to feed the FMAs alone). It even explicitly calls them "main registers". Of course, it should also be able to share operands between ALUs being issued on the same cycle.

Gipsel, it's a good point that AMD's PS and PV registers result in similar savings at a much lower cost. However the savings are also much smaller; I'm not sure where you're getting the idea you can get 50% of the reuse with it as Figure 6 clearly says otherwise. And amusingly enough, I seem to remember that it was R300 which introduced that mechanism because of the extra register bandwidth required for its ADD+MADD pipeline! AFAICT there's no reason they couldn't keep doing this on GCN.

Also this does not really limit NVIDIA to an artificially low number of ALU pipeline stages. The lower efficiency seen with fewer active warps is mostly related to Instruction Level Parallelism. In fact, the efficiency loss from 8 active warps to 4 active warps should be nearly identical to going from scalar to 2-way VLIW with 8 active warps! It's a trade-off just like everything else.
 
Do you reckon that would still be possible in Southern Islands?
I would think that with proper result forwarding and pipelined register reads not contributing to the instruction latency (as in CPUs) it is very possible to get a 4 cycle latency ;)
Those pipeline registers as in R600 through Cayman (actually just some kind of result forwarding without carrying out the writes to the register file; edit: they can be carried out to the register file, that is optional and encoded to the instruction, in that case it is only a means to reduce the latency by that result forwarding) are only possible with compiler support, as the compiler has to determine that the results don't need to be written to the register file (the paper also proposes a combined software/hardware solution to this). But yes, in that case you can get some of the benefits (reduced register file accesses) for basically zero costs.
 
Last edited by a moderator:
However the savings are also much smaller; I'm not sure where you're getting the idea you can get 50% of the reuse with it as Figure 6 clearly says otherwise.
You are right, I only skimmed through the paper and mixed up a few numbers in that course. And the 50% reuse is what nvidia gets from their modelling, I was only saying one could get half of it almost for free.

But as admitted, that was a bit too high. The correct number derived from figure 2b would be a reduction of register reads (and writes btw., as those don't need to be written to the register file) between 5% and 21% (result with lifetime of 1, that means immediate reuse, depending on code). So that is on average maybe 1/4 of that with a RFC (50%-59% reduced reads) not the half.
Actually it is even a bit higher than that, as it only accounts for results read once. But as no lifetime information is provided for several accesses, we don't know how much can be saved in addition to that. But there is clearly the possibility that a result accesed twice is first needed in the directly succesive instruction, which means an additional saved read (but no saved write in that case).
I would even claim, that for AMD's VLIW design the amount of saved register reads is even higher, as the VLIW-instructions bundle as many independent operations as possible (they are rarely completely filled). That means it rises the probability of dependent operations in the the next instruction for that wavefront significantly, thus increasing the effectiveness. Maybe I should count the abundance of PV and PS as operands in some kernels. ;)

edit:
As the inner scheduler has a greedy scheduling policy and the RFC stores the last 6 results, it is in effect quite similar to the PV/PS registers, which store up to the last 5 results. The difference is that the RFC can store results longer (it uses a FIFO replacement scheme, that means after 6 instructions the result will be gone) in case of more dependency between instructions, which makes it possible to use it for results with a longer lifetime. The PV/PS registers are limited to a lifetime of 1, which makes them actually less effective in GCN compared to the VLIW architectures because only a single result would be stored.

edit2:
I'm not at home, but I've found the ATI ISA code for an old version of the boinc project collatz conjecture. I just counted the register reads in the innermost loop of the main kernel. And believe it or not, but 49% of all register reads refer to the PS/PV registers. Probably that example is above average, but it shows that you can get quite close to the numbers nvidia got with their RFC.
Also this does not limit NVIDIA to an artificially low number of ALU pipeline stages. The inner scheduler is said to be capable of hiding shared memory latency on its own, which is 20 cycles in the paper. Remember that the inner scheduler is greedy and not round-robin.
But higher latency means a tendency towards more necessary active warps from which a need for a higher capacity of the inner scheduler and also a larger RFC would result. So a shorter latency would increase the possible benefits.
 
Last edited by a moderator:
I haven't read the paper yet, just the abstract and a few posts here, but doesn't this evolution strike any one as absurd?

I mean the registers were supposed to hide latency, and now they themselves are burning so much power that an L1 cache for RF is being proposed. What's next? 3 levels of cache for RF?

LRB1 might have failed, but it had a good idea of allocating registers out of gp cache and instructions to modify cache replacement policy. Also, a combination of hw multithreading and sw multithreading had some good ideas behind it, just like a 2 level hw scheduler.

Now I don't know if caches will burn more power or not, but I do think this evolutionary step, if taken, is absurd. Though it may well be inevitable.
 
Back
Top