NVIDIA Kepler speculation thread

I guess Kepler is a family of 22nm chips, then. That would explain the delays.
A slightly earlier report by 3DCenter puts the GK100 as (translated) "Kepler architecture" and the GK104 as (translated) "Architecture known if Kepler or Kepler Fermi / Fermi mix" (should that be "unknown"?) So if I'm reading this report right, they say GK104 might be some form of Fermi/Kepler hybrid.

So going from that report and others, NVIDIA could have a series of transition chips, starting from the hybrid 28 nm test vehicle GF117/GK117, to the GK107 and GK104, and culminating in the "true" Kepler GK100/GK112 about a year from now.

No but seriously, that's impossible unless NVIDIA made really drastic changes, like removing TMUs and ROPs and doing it all with the shaders, or maybe doing away with the hotclock domain. Neither option sounds very likely.
Well the 1000/1500 CC stuff is in the same report as the (3DCenter) one about removing the hot clock.
 
No, I am just a little moderator over there.:LOL:
@reliability: These possible specs are compiled rumors in the 3DC-forum. And I would say chances are a good margin over 50%, that the drop of hot-clock, >1GHz gpu clocks and >>2TFLOPs for GK104 are true.
 
Last edited by a moderator:
All GPUs do it too. No difference there, only that Larrabee is limited to 4 hardware threads (matching the latency) while GPUs can do more.

Ok, then I'm not sure where you were leading me with that question. I haven't seen any evidence that Fermi's pipeline latency includes the scheduler stages as well.

As I said, complexity (or dumbness) of the scheduling is a major factor. With Larrabee and Radeons reading and writing registers is outside of the critical loop defining the latency.

Ok, but how do you know nVidia's scheduling is on the critical path and not running in an independent pipeline from the ALUs? Everything I've read points to the latter.

Or a subset or a combination with some other clever tricks. Who knows outside of nVidia? All nv has said so far is that programming gets easier, the GPUs get significantly broader (which they probably wouldn't mention if it is a factor <=2, it makes a hotclock less likely), the matrix multiplication efficiency will rise and that performance/W will be significantly improved (>=2.5) not only by the proces change but also by reducing the distance data has to be driven over distances on the die.
Make your own guesses!

So GCN clone isn't the only way forward then? :) I honestly don't know what they would do but as I said initially I'll be surprised if they dumb things down.

So that means each SM has (at least) a 8192 bit*) bus running from the register file over the 2 to 3 Vec16 ALUs, the L/S units and the SFUs to supply them with data (and take the results back). Sounds quite expensive and energy consuming to me (and adds to the latency).

Yeah that's exactly what it looks like. Doesn't say much about what Kepler is doing though.
 
So we not good enough for Ailuros any more? :D

Hey I'm wasting my time here mostly in the handheld forum. Those who know me know that I could spend days speculating. Somebody whispered months ago in my ear that I should forget any nonsense I've read or heard about Kepler so far.

So I started putting one and one together. Intel's Knights Corner will most likely reach 1.2 TFLOPs DP in its final version (or more) and while it'll still have its very own advantages DP FLOPs are still a very powerful marketing tool for HPC.

How do you get a significantly high DP rate without making any nonsensical design decision? Because the 1:1 SP-DP scenario I had read some time ago definitely belongs into the latter ballpark; a huge waste of transistors and would probably have led for today's standards into a quite underwhelming desktop GPU. Go a similar route (not necessarily the same exactly as AMD) and you'll end up with your desired N TFLOPs DP and you probably also have an insane amount of single precisioon FLOPs for desktop/3D but with quite a bit lower efficiency per FLOP then up to Fermi.

Don't hurt me if it turns out to be wrong or right in the end. I just haven't read across the net one single case scenario that makes sense for over 2.5x sustainable DP FLOPs per W for Kepler Teslas vs. Fermi Teslas.
 
Ok, then I'm not sure where you were leading me with that question. I haven't seen any evidence that Fermi's pipeline latency includes the scheduler stages as well.
What do you think where the 18 cycle latency of Fermi comes from (compared to the 4 cycle latency of GCN and Larrabee)? Just caused by the hotclock? Or why the latency rises by two clock cycles for each register conflict? How do you explain that, if the time needed for reading and writing the registers would be not included? In fact, that are actually the drawbacks of a bog standard scoreboarding, i.e. it fits perfectly together. For me it is evidence.
Ok, but how do you know nVidia's scheduling is on the critical path and not running in an independent pipeline from the ALUs? Everything I've read points to the latter.
See above. There is no result forwarding in Fermi.
So GCN clone isn't the only way forward then? :)
There was a runaway smiley behind that statement. Of course it is only one possibility. But not one without some merits. ;)
Yeah that's exactly what it looks like. Doesn't say much about what Kepler is doing though.
So you agree, that it doesn't look like the most power efficient solution and GCN (or even the old VLIW architectures) probably do quite a bit better in this respect? :smile:
 
What do you think where the 18 cycle latency of Fermi comes from (compared to the 4 cycle latency of GCN and Larrabee)? Just caused by the hotclock?

That's my understanding yes.

Or why the latency rises by two clock cycles for each register conflict? How do you explain that, if the time needed for reading and writing the registers would be not included? In fact, that are actually the drawbacks of a bog standard scoreboarding, i.e. it fits perfectly together. For me it is evidence.

Do you have a link demonstrating that behavior? Would be interesting to see but for now I'll take your word for it.

There was a runaway smiley behind that statement. Of course it is only one possibility. But not one without some merits. ;)

I'm buying the no-hot-clock bit but that's it. Still not expecting a GCN clone :)

So you agree, that it doesn't look like the most power efficient solution and GCN (or even the old VLIW architectures) probably do quite a bit better in this respect? :smile:

Yes I'll concede that GCN is more power efficient than G80 :D
 
That's my understanding yes.
18 cyles latency at a clock below 2 GHz for a single precision FMA? Surely not!
Do you have a link demonstrating that behavior? Would be interesting to see but for now I'll take your word for it.
NVidia itself just says 18-22 cycles arithmetic latency, but you can test it by measuring instruction latency with the same register used as source operands multiple times (fermi reads it several times). So the behaviour basically looks like that:
fadd dest_reg, reg1,reg2 //18 cycles
fadd dest_reg, reg1,reg1 //20 cycles
ffma dest_reg, reg1,reg2,reg3 // 18cycles
ffma dest_reg, reg1,reg1,reg2 // 20cycles
ffma dest_reg, reg1,reg1,reg1 // 22cycles

I lost the link detailing that explicitly, but it can be seen for all instructions, here for some test measuring the latency of integer instructions for instance.

The latency is normally measured as read-after-write latency, each instruction in the chain is dependend on the output of the preceding one. If you look up how the standard scoreboarding scheduling works (nvidia claims to use a scoreboarding scheme) you will see that it resolves RAW hazards simply by delaying the reading of the source operands until the write to the register by the preceding instruction is completed (a well known weakness), there is no result forwarding in the "standard" scoreboarding implementation. So reading and writing the registers are required to be included in the RAW latency by the scheduling scheme.

By the way, one also gets some information about the scheduling behavior of Fermi by such microbenchmarking (one can deduce some kind of zigzag pattern for the chosen instructions). The minimum delay of 6 (hotclock) cycles (at the asfermi page they count in scheduler cycles) between the issue of independent instructions from the same warp. This paper refers to that as "independent instruction reissue latency" or "issue-to-issue latency"). I hope this shows that the link to the asfermi project holds some water (I think I've read it quite detailed somewhere in nvidia's developer forum, but don't find it right now).
 
And if nVidia really makes their GPUs quite a bit broader (which will happen without a hotclock), they almost need a simpler scheduling, as they will have quite a few more schedulers.
There's no reason why the lack of a hotclock would result in more schedulers. You just need to double the ALU width from Vec8 to Vec16 - in fact, that's already what it looks like to the scheduler.

18 cyles latency at a clock below 2 GHz for a single precision FMA? Surely not!
Agreed. It has to include the RF (but probably not all of the scheduler). It also makes sense that each bank conflict would increase cycle count by 2 clocks rather than 1 because that runs at the scheduler clock rate and you can't really make it less than 1 scheduler cycle (which becomes 2 ALU cycles).

The latency is normally measured as read-after-write latency, each instruction in the chain is dependend on the output of the preceding one. If you look up how the standard scoreboarding scheduling works (nvidia claims to use a scoreboarding scheme) you will see that it resolves RAW hazards simply by delaying the reading of the source operands until the write to the register by the preceding instruction is completed (a well known weakness), there is no result forwarding in the "standard" scoreboarding implementation. So reading and writing the registers are required to be included in the RAW latency by the scheduling scheme.
So would you agree that NVIDIA's proposed register file cache would solve most of these issues for real-world workloads? While in a way it's obviously orthogonal to what we're discussing and there might (or might not) still be real advantages to having a more AMD-like scheduler/RF, this would make it significantly less important for several reasons.

If you want a more GCN-like scheduler the really big question to ask is this: how do you issue multiple loads in parallel as early as possible while needing their results as late as possible? A register scoreboard provides an elegant 'perfect' solution to the problem (obviously limited in practice by how many instructions you can keep waiting). AMD's VLIW clauses are less effective and less elegant.

GCN uses a very smart trick you very nicely described in one of your posts back in July. I think it's a very good solution - not theoretically as 'perfect' as a register scoreboard but pretty close and clearly less expensive in hardware. NVIDIA could theoretically do something like this as well but I honestly don't see it happening, especially not for Kepler.
 
AMD's VLIW clauses are less effective and less elegant.
R600 architecture scoreboards at the clause level, not the instruction level, and does this for ALU and TEX clauses.

Intra-clause stalls happen with incoherent gather of constants (from constant buffers) and extremely obscure register access techniques.
 
There's no reason why the lack of a hotclock would result in more schedulers. You just need to double the ALU width from Vec8 to Vec16 - in fact, that's already what it looks like to the scheduler.
It is already at Vec16 in Fermi. NVidia would need to go to physical vec32 ALUs. ;)
Logically, the vector ALUs are basically vec32 anyway, as this is the warp size and one instruction is always executed for 32 elements (the issue just takes two cycles).

But yes, this allows to double the ALU count (so just the normal increase comes on top of it) without a disproportionate increase of schedulers (I've written about that possibility in the 3DC forum). In fact, if they do that (going to vec32 ALUs), it is sure they will lose the hotclock (or the schedulers will run at that clock too).
So would you agree that NVIDIA's proposed register file cache would solve most of these issues for real-world workloads? While in a way it's obviously orthogonal to what we're discussing and there might (or might not) still be real advantages to having a more AMD-like scheduler/RF, this would make it significantly less important for several reasons.
If you think about it, a register file cache may make less sense for a single register file for several vecALUs or it would need to be bigger or the scheduler needs to apply some bias so that instructions from the same warp are preferentially issued to a certain vecALU. The reasoning is simple: If an instruction from a warp gets issued to one of the vecALUs and the next instruction to another one, the regfile cache will be basically useless. That means instructions from one warp has to be issued to the same vector ALU anyway. But if you do that, you can just bind the vecALU to one scheduler and embed the regfile into it (embed a small regfile to each vecALU lane), i.e. replace the regfile cache in each ALU lane with the regfile itself.
Somehow this regfile cache is neither fish nor fowl in my opinion. It may bring some improvements, but it appears to me a bit like stopping after going half of the way.
If you want a more GCN-like scheduler the really big question to ask is this: how do you issue multiple loads in parallel as early as possible while needing their results as late as possible? A register scoreboard provides an elegant 'perfect' solution to the problem (obviously limited in practice by how many instructions you can keep waiting). AMD's VLIW clauses are less effective and less elegant.
I think with the typical code run on GPUs it is currently often easier to run a few threads more to cover long latencies. Fermi's scoreboarding isn't buying them much in this respect (especially as data dependencies are basically known at compile time). The main advantage in my opinion is that they have a single entity managing all dependencies, while in case of GCN there are several spots, each handling only one particular area. That makes the scoreboarding conceptually simpler, but also more expensive to implement (even with the small window of only 4 instructions in flight per warp).
GCN uses a very smart trick you very nicely described in one of your posts back in July. I think it's a very good solution - not theoretically as 'perfect' as a register scoreboard but pretty close and clearly less expensive in hardware. NVIDIA could theoretically do something like this as well but I honestly don't see it happening, especially not for Kepler.
I don't see it either. In the 3DC forum post linked above I already clarified that a Kepler SM may show an apparent similarity to GCN in some aspects, but that it will be just cursory. It will probably work entirely different in main aspects. After all, my statement with the GCN clone (I posted "I'm almost inclined to think that Kepler will look somewhat similar to GCN!" and a runaway smiley [emphasis added]) was just some kind of a bait. ;)
 
Last edited by a moderator:
NVidia itself just says 18-22 cycles arithmetic latency, but you can test it by measuring instruction latency with the same register used as source operands multiple times (fermi reads it several times).

Oh I'm not arguing that point. The RAW latency for dependent instructions will definitely impact observed latency. I'm asking if the scheduler itself and the operand buffering process are on the critical path for the ALU pipeline. Is RAW latency a significant problem for GPUs where you typically have multipe threads and/or independent instructions in flight?

If you think about it, a register file cache may make less sense for a single register file for several vecALUs or it would need to be bigger or the scheduler needs to apply some bias so that instructions from the same warp are preferentially issued to a certain vecALU. The reasoning is simple: If an instruction from a warp gets issued to one of the vecALUs and the next instruction to another one, the regfile cache will be basically useless. That means instructions from one warp has to be issued to the same vector ALU anyway. But if you do that, you can just bind the vecALU to one scheduler and embed the regfile into it (embed a small regfile to each vecALU lane), i.e. replace the regfile cache in each ALU lane with the regfile itself.

I had forgotten about this paper, we discussed it a few months ago. Seems to me they're focusing more on reducing the active working set than on simplifying the scheduler. In which case having multiple instructions in flight per warp will actually make the RFC more effective as it increases the chance of reuse.

Btw, the they're proposing an RFC that's private not to a vecALU but to a single ALU :D:

"Each RFC bank is private to a SIMT lane, greatly reducing distance from the RFC banks to the ALUs. The tags for the RFC are located close to the scheduler to minimize the energy spent accessing them."

http://research.nvidia.com/sites/default/files/publications/2011_06_NVIDIA_ISCA.pdf
 
If you think about it, a register file cache may make less sense for a single register file for several vecALUs or it would need to be bigger or the scheduler needs to apply some bias so that instructions from the same warp are preferentially issued to a certain vecALU. The reasoning is simple: If an instruction from a warp gets issued to one of the vecALUs and the next instruction to another one, the regfile cache will be basically useless. That means instructions from one warp has to be issued to the same vector ALU anyway. But if you do that, you can just bind the vecALU to one scheduler and embed the regfile into it (embed a small regfile to each vecALU lane), i.e. replace the regfile cache in each ALU lane with the regfile itself.
Somehow this regfile cache is neither fish nor fowl in my opinion. It may bring some improvements, but it appears to me a bit like stopping after going half of the way.
I think with the typical code run on GPUs it is currently often easier to run a few threads more to cover long latencies. Fermi's scoreboarding isn't buying them much in this respect (especially as data dependencies are basically known at compile time). The main advantage in my opinion is that they have a single entity managing all dependencies, while in case of GCN there are several spots, each handling only one particular area. That makes the scoreboarding conceptually simpler, but also more expensive to implement (even with the small window of only 4 instructions in flight per warp).
IIRC, in Fermi warp-> ALU mapping is static, ie even warps issue to one ALU and odd ones to another.
 
Oh I'm not arguing that point. The RAW latency for dependent instructions will definitely impact observed latency. I'm asking if the scheduler itself and the operand buffering process are on the critical path for the ALU pipeline. Is RAW latency a significant problem for GPUs where you typically have multipe threads and/or independent instructions in flight?
If you have to stall for RAW latency too, it's not desirable as memory latency isn't exactly decreasing. I just hope they bother to put 2x or more cache in Kepler than in Fermi.
 
Last edited by a moderator:
Oh I'm not arguing that point. The RAW latency for dependent instructions will definitely impact observed latency. I'm asking if the scheduler itself and the operand buffering process are on the critical path for the ALU pipeline.
Operand buffering is basically reading the operands from the register file. So yes, operand buffering is included in the critical path, as you can see the varying latency in case of register conflicts.
Is RAW latency a significant problem for GPUs where you typically have multipe threads and/or independent instructions in flight?
You need more of them and hence a slightly larger register file to accomodate them. It is not insignificant in my opinion.
Btw, the they're proposing an RFC that's private not to a vecALU but to a single ALU :D:
"Each RFC bank is private to a SIMT lane, greatly reducing distance from the RFC banks to the ALUs. The tags for the RFC are located close to the scheduler to minimize the energy spent accessing them."
As the mapping of the "thread" number within a warp to the vALU lane number is fixed, "private to a vALU" or "private to a vALU lane" essentially means the same. Embedding the reg file cache into each ALU lane enables the short distances (almost zero) the data has to travel to the ALUs as nv says. AMD simply puts the entire reg file (private for each lane) in there (i.e. very close to the ALUs). Appears simpler and maybe even more effective to me.
IIRC, in Fermi warp-> ALU mapping is static, ie even warps issue to one ALU and odd ones to another.
The mapping to the schedulers is static (odd and even Warps, same as with the VLIW architectures there are odd and even wavefronts coming from the the two sequencers). But the mapping of schedulers to vALUs is not static, And the L/S units and the SFUs are also in there. Can you explain how the GF104/114 style SMs would work with a static mapping? ;)
 
Operand buffering is basically reading the operands from the register file. So yes, operand buffering is included in the critical path, as you can see the varying latency in case of register conflicts.

Maybe I'm just daft but still not seeing the difference between operand fetch from the register file and other memory operations. The whole point of buffering is to decouple operand fetch from the execution pipeline.

You need more of them and hence a slightly larger register file to accomodate them. It is not insignificant in my opinion.

It seems like a small issue compared to all the other latency hiding GPUs do anyway.

As the mapping of the "thread" number within a warp to the vALU lane number is fixed, "private to a vALU" or "private to a vALU lane" essentially means the same. Embedding the reg file cache into each AiLU lane enables the short distances (almost zero) the data has to travel to the ALUs as nv says. AMD simply puts the entire reg file (private for each lane) in there (i.e. very close to the ALUs). Appears simpler and maybe even more effective to me.

The proposed two level regfile hierarchy simply builds on the co-location concept. It might be more complex to implement but on the flip side it's probably more efficient to access the RFC than the much larger regfile. We may never know for sure but it would be interesting to see something like it in practice. Or maybe you're right and nVidia will just chuck the full regfile in there - will know sooner or later :)
 
Back
Top