NVIDIA Kepler speculation thread

That's an important distinction. My intuition is that it takes a lot of energy not so much because the power requirements are high, but because it takes time to access the registers. This would explain how having a closer register cache might help.
The key word is "wire". When trying drive signals on a wire, its resistance and capacitance determine how quickly it can be done and how much energy is needed to meet the timing target.
Wires at this level of geometry have had very poor scaling for several nodes now.

A wire's delay, or the effort needed to make it reach a given delay, goes up quadratically with its length. That's why long wire runs are broken down with repeaters. These small tranistor blocks break a wire of length N into two 1/2N sections. The cost is a slight delay and power consumption, but the wire's delay and energy cost are far more dominant until you get to very small dimensions.

The closer RFC cuts wire length down, and its smaller size also may mean smaller word lines and perhaps a different design for the sense amps on the SRAM banks.
The difference is telling that a cache can be heavily multiported and can save power so long as it is small and close by.

The FMA operation itself is energetically cheaper despite the larger number of transistors and wires because the distances are so tiny.

I haven't read the paper yet, just the abstract and a few posts here, but doesn't this evolution strike any one as absurd?

I mean the registers were supposed to hide latency, and now they themselves are burning so much power that an L1 cache for RF is being proposed. What's next? 3 levels of cache for RF?
The register files for GPUs are sized like caches, so this idea is coming up faster for them.
The need to save power is what is driving this, because power scaling is lagging so much.

LRB1 might have failed, but it had a good idea of allocating registers out of gp cache and instructions to modify cache replacement policy.
To the former, the L1 caches of active CPUs glow hot on thermal images. It's active SRAM and the wire distances in question are as long or longer as the millimeter discussed in the paper. This is a general problem for all designs.
To the latter, I have not seen confirmation on the ability to change the replacement policy, just some ambiguous wording.

Also, a combination of hw multithreading and sw multithreading had some good ideas behind it, just like a 2 level hw scheduler.
The Larrabee software rasterizer does create a second level of scheduling, essentially a software version of the batches and warps done on GPUs when faced with texture latency. The distances involved are not necessarily better depending on how far the data migrates.
As the paper noted, multilevel scheduling has been considered for CPU designs.
 
But higher latency means a tendency towards more necessary active warps from which a need for a higher capacity of the inner scheduler and also a larger RFC would result. So a shorter latency would increase the possible benefits.

Yup. They modeled a much shorter pipeline than the one in Fermi. Maybe a hint of things to come.

I haven't read the paper yet, just the abstract and a few posts here, but doesn't this evolution strike any one as absurd?

I don't think any comparison to LRB1 is meaningful since that thing never materialized. It was a paper tiger after all. :)

Register file caching doesn't seem to be a novel concept. It's been investigated before on CPUs so these guys aren't completely crazy. Though there are a few other more straightforward things they can do to reduce register file complexity:

- Dedicated MRF per SIMD instead of one big shared MRF.
- Narrower SIMDs but more of them per scheduler.
 
Do you reckon that would still be possible in Southern Islands?

If the pipeline depth is only 4 cycles maybe, otherwise it's doubtful. It was easy to do when they had statically compiled clauses of consecutive instructions issued to the pipeline.
 
The operand sourcing provided by the PV and PS would be covered by a bypass network, or to view it another way a bypass network would hide what the PV and PS functionality exposed.
If the functionality as used by Evergreen were to remain in SI, the ISA would need to have mention of it. I didn't pore through the general instruction outlines to see any hint of it.

It may be harder to manage either in SI or in future designs, as there are mentions of preemption or other kinds of interruption that could break the cycle restrictions for using PV and PS, something not possible in the clause-based scheme.
 
If the pipeline depth is only 4 cycles maybe, otherwise it's doubtful. It was easy to do when they had statically compiled clauses of consecutive instructions issued to the pipeline.
And they would need some kind of a greedy scheduling policy preferably issuing instructions for one wavefront until it stalls. But as said above, it would be as effective as with the VLIW architecture only if they add a FIFO to it so you could access the results of the last few (let's say 4) instructions. This would basically be the same RFC as the paper proposes.

The low hanging fruit is probably just result forwarding and avoiding the fetch from the register file if a dependent instruction follows immediately (which catches already about 25% of the cases the RFC could handle). And as the GCN's register files are more local to the ALUs anyway (compared to VLIW and Fermi), it will be lower power from the start.
 
The operand sourcing provided by the PV and PS would be covered by a bypass network, or to view it another way a bypass network would hide what the PV and PS functionality exposed.
If the functionality as used by Evergreen were to remain in SI, the ISA would need to have mention of it. I didn't pore through the general instruction outlines to see any hint of it.
I didn't see a hint of it either. So I would think it will feature just a normal result forwarding/bypass network.
 
But higher latency means a tendency towards more necessary active warps from which a need for a higher capacity of the inner scheduler and also a larger RFC would result. So a shorter latency would increase the possible benefits.
Sorry, I edited between my post and yours, I think my latest version already kinda answers that: you wouldn't really need to increase the number of active warps. It simply results in similar ILP requirements as a 2-way VLIW machine which isn't the end of the world. I suspect the results are worse than they should be because they're using a commercial NVIDIA compiler likely not optimised for ILP. And now that I think about it, it's even better than 2-way VLIW because all that matters is the *average* amount of ILP for the set of active warps in the inner scheduler, not the minimum amount like with VLIW. So as with everything else, it's a trade-off.

And as the GCN's register files are more local to the ALUs anyway (compared to VLIW and Fermi), it will be lower power from the start.
I haven't had the time to read and think as much about GCN as much as I would like, but I honestly don't see how that could even be the case. Is there some trick I'm missing here? As far as I can tell, each lane already has its own dedicated MRF on all DX10 architectures. Thanks! :)
 
I haven't had the time to read and think as much about GCN as much as I would like, but I honestly don't see how that could even be the case. Is there some trick I'm missing here? As far as I can tell, each lane already has its own dedicated MRF on all DX10 architectures. Thanks! :)

I'm going to hazard a guess at this.
In the previous architecture we had 256 KiB of register file split into 16 clusters of 16 KiB each.
Within a cluster, the register network allowed access to all registers in the 4 banks of 3 read port memory.

With the latest design, it's 4 private register files of 64 KiB.
If the SIMD lanes are truly separate, that will cut things down to each lane having 4 KiB of register file, with no connection to the rest of the SRAM.

Potentially, this means the register file can be quite close to the execution units, assuming we don't need to worry about lane crossing.
The proximity benefit is potentially reduced at least in part by the higher port count compared to the main register file listed in the paper, which is heavily banked but only dual-ported.
 
I'm going to hazard a guess at this.
In the previous architecture we had 256 KiB of register file split into 16 clusters of 16 KiB each.
Within a cluster, the register network allowed access to all registers in the 4 banks of 3 read port memory.
I once asked someone from AMD, and he told me explicitly that R600's RF had only one read and one write port. He could have been mistaken, but as far as I can tell the tricks used in G80/Fermi are just as applicable to all AMD architectures. And in the pre-unification days, I'd expect the VS had multiple read ports but the PS was already only multibanked.

If the SIMD lanes are truly separate, that will cut things down to each lane having 4 KiB of register file, with no connection to the rest of the SRAM.
But how is that any different from today's architectures? Each lane (i.e. single scalar ALU) has its own dedicated register banks already. It's not like shared memory where you need a truly general purpose collection mechanism.

Potentially, this means the register file can be quite close to the execution units, assuming we don't need to worry about lane crossing.
Hmmm, you know, I'm pretty sure you're right. G80 could read 3 registers (->24 warps/768 threads), GT200 could read 4 registers (->32 warps/1024 threads), and Fermi can read 6 registers (->48 warps/1536 registers). And yet G80 had a 32KB RF, GT200 had a 64KB RF, and Fermi has a 128KB register file. Logically each RF 'effective read port' requires one bank, but the register file on G80 and Fermi isn't a multiple of the number of 'effective read ports' so that means there isn't necessarily a 1:1 relationship between them and the banks themselves. This is also implied by the existence of very rare bank conflicts in the register file that couldn't possibly exist otherwise.

I can imagine a few reasons for that, some of which would apply to GCN, some of which wouldn't. Hmm, need to think about it more when I have the time... It's amazing how damn hard it is to get anyone to talk about their register files :) For example, even though I haven't tested it, I'm willing to bet that unlike on GF1x0, it's not possible to achieve maximum FMA throughput on GF1x4 without either: 1) sharing register operands. 2) using shared memory. 3) using the interpolator. Of course, there are plenty of other fun stuff nobody talks about (e.g. how the dual dispatcher mechanism on Fermi really works).
 
I once asked someone from AMD, and he told me explicitly that R600's RF had only one read and one write port. He could have been mistaken, but as far as I can tell the tricks used in G80/Fermi are just as applicable to all AMD architectures. And in the pre-unification days, I'd expect the VS had multiple read ports but the PS was already only multibanked.
That may be right and I may have misremembered. I think it's 4 banked with single ported reads over 3 cycles. Effectively, the VLIW exposes what Nvidia does via hardware.

But how is that any different from today's architectures? Each lane (i.e. single scalar ALU) has its own dedicated register banks already. It's not like shared memory where you need a truly general purpose collection mechanism.
The subdivision is potentially different. The old way is 256/16 pools of registers, with the wires to read from any register in the pool. The new way as described would be 256/4/16 pools, although I'm not sure how register access is handled to accomplish this with no port conflicts (aside from adding more ports).
 
I think it's 4 banked with single ported reads over 3 cycles.
Yes, that's how AMD describes it in the architecture manual, 4 banks each with 1 read and one write port and the operands are read over 3 cycles.
The subdivision is potentially different. The old way is 256/16 pools of registers, with the wires to read from any register in the pool. The new way as described would be 256/4/16 pools, although I'm not sure how register access is handled to accomplish this with no port conflicts (aside from adding more ports).
In principle, you just need a single bank with a read and write port with 128bit width. One SIMD lane will execute the same instruction for 4 cycles, just for different data elements of the wavefront (but for the same nominal register!). If you allocate the registers for those 4 data elements in one 128 bit word, you can fetch the 3 operands over the usual 3 cycles with the 4th cycle still available to copy something to the LDS or exporting to memory or whatever. If you really want to save something on the register file ports, you put in just a single shared read/write port to the register file (edit: actually that's how I think the current Radeons work, each bank has just a shared read/write port). That will create bubbles in the pipeline for LDS or memory instructions if you have a lot of 3-operand instructions, but those could be squeezed in when a two operand instruction is somewhere in the instruction stream. Doesn't sound too bad.
But to sum it up, the register files get smaller which enables a placing very close to the consumers and the network for crossing data between banks gets unnecessary with the instructions fixed to a certain SIMD.

Btw., the operand collection can of course overlap with the preceding instruction, it doesn't need to contribute to the back to back instruction latency (it doesn't on current Radeons).
 
Last edited by a moderator:
In principle, you just need a single bank with a read and write port with 128bit width. One SIMD lane will execute the same instruction for 4 cycles, just for different data elements of the wavefront (but for the same nominal register!).
And what makes you think that's so different from what G80/Fermi already does? From the paper: "Figure 1(c) provides a more detailed microarchitectural illustration of a cluster of 4 SIMT lanes. A cluster is composed of 4 ALUs, 4 register banks, a special function unit (SFU), a memory unit (MEM), and a texture unit (TEX) shared between two clusters. Eight clusters form a complete 32-wide SM."
If you really want to save something on the register file ports, you put in just a single shared read/write port to the register file (edit: actually that's how I think the current Radeons work, each bank has just a shared read/write port).
As I said, I know for a fact that R600 had a 1R+1W RF (unless that AMD engineer was somehow mistaken). IIRC, the cost increase from a 1R+1W RF versus a 1R/1W RF is quite a bit smaller than the increase from 1R+1W to 2R+1W. So the greater simplicity on the logic side for 1R+1W is probably well worth the cost.
 
You may excuse the OT, but chances are extremely weak that an engineer is mistaken about hw he probably co-developed. The only other chance would be that he isn't a real engineer.
 
You may excuse the OT, but chances are extremely weak that an engineer is mistaken about hw he probably co-developed. The only other chance would be that he isn't a real engineer.
Actually, that doesn't apply because he didn't work on the R600 shader core :) He did work on the shader core of other AMD GPUs before that though, and was still working at AMD when I asked.

Anyway, this is all very boring compared to the fact that Fermi is really VLIW. Jen-Hsun didn't have a brainfart when he said it had 256 SPs. It's hilarious how much NVIDIA really doesn't care about giving us a correct view of the actual hardware and I don't expect that to be any different for Kepler. Oh well, not really the right thread for this.
 
And what makes you think that's so different from what G80/Fermi already does? From the paper: "Figure 1(c) provides a more detailed microarchitectural illustration of a cluster of 4 SIMT lanes. A cluster is composed of 4 ALUs, 4 register banks, a special function unit (SFU), a memory unit (MEM), and a texture unit (TEX) shared between two clusters. Eight clusters form a complete 32-wide SM."
Do you see the box labeled "operand buffering"? It can be simpler if there is just a single bank and a single ALU (and no SFU) attached. And the necessary effort for it scales faster than linear with the number of ports on that thing. As I said, with the old VLIW architecture and also Fermi, each ALU needs access to several register banks, in GCN a single bank is completely enough. The operand collectors have to be simpler (but you need more of it).
 
Anyway, this is all very boring compared to the fact that Fermi is really VLIW. Jen-Hsun didn't have a brainfart when he said it had 256 SPs. It's hilarious how much NVIDIA really doesn't care about giving us a correct view of the actual hardware and I don't expect that to be any different for Kepler. Oh well, not really the right thread for this.
Can you explain this?
 
Can you explain this?
Oh, it does have 512 ALUs. But it very likely doesn't have 512 SPs nor two dispatchers per SM. I'd reveal exactly what that means, but meh, I might actually want to get an article out of this within the next 10 years, so I'll shut up for now - sorry :)
 
If the Fermi ISA isn't VLIW, then the architecture doesn't meet the definition.
There could be funky situations where we have multiple stages of decode, where something could come up as a debatable point.

There was a presentation on K8 where the AMD architect kind of argued that AMD's chip was internally VLIW, though he admitted that would be readily disputed.
 
EDIT: Actually, to be even clearer: what I'm saying is that Fermi can apparently only switch warps every 2 instructions. One explanation for this is 2-way VLIW being executed sequentially, but I realise it doesn't have to be. I had one test that implied it was static bundling, but now that I've changed some things, it looks more like superscalar. Maybe GF110 is simply already dual-issue rather than dual-dispatcher (but executed sequentially rather than in parallel). No idea what GF104 would look like then. As I said, this needs more testing, shouldn't even have posted about it in the first place just yet...
 
Back
Top