NVIDIA Kepler speculation thread

Is the GK104 the low end piece? 2TFLOPs is pretty weak these days isn't it? The 7970 is rumoured to be 3.5-7TFLOPs...
 
If 7970 ~1,4x GTX 580/~2x GTX 560 Ti, than 7950(~83% 7970) could be 1,7x GTX 560 Ti.
If Keplers game performance scale with GFLOPs like GF104/114s ones, than ~2 TFLOPs would be enough to reach 7950s performance level.
But GK104 is said to be "clearly over 2 TFLOPs", which could catch some decreased per GFLOP performance.
 
Last edited by a moderator:
Is the GK104 the low end piece? 2TFLOPs is pretty weak these days isn't it? The 7970 is rumoured to be 3.5-7TFLOPs...

GK104 is the successor to GF104/114 (GTX 460/560) so it's the performance part. If the rumors about 768SPs are correct, that means a hotclock above 1.3GHz (I don't buy that the whole chip will be clocked that high).

This would mean performance well above the GTX 560 Ti—raw FLOPS really don't matter much.
 
I'm almost inclined to think that Kepler will look somewhat similar to GCN! :runaway:

I would be really surprised if nVidia dumbed down their scheduler. There are benefits to be had in irregular compute workloads. Getting rid of the hot clock is definitely believable. The other option is that the hotclock is there but it's so low that somebody assumed it was the base clock.
 
I would be really surprised if nVidia dumbed down their scheduler.
The current one is quite dumb already. It doesn't allow result forwarding for instance and the reading and writing of registers counts to the latency of an operation. One could get rid of those things (even AMD's VLIW architectures didn't suffer from it, and GCN doesn't neither) together with a slightly simpler scheduling scheme.
And if nVidia really makes their GPUs quite a bit broader (which will happen without a hotclock), they almost need a simpler scheduling, as they will have quite a few more schedulers.

Actually, this would help to remedy their current efficiency problem with matrix multiplications too (something they promised for kepler), as they will have more register files and more L1 caches providing bandwidth (the limiter right now). Going a step further towards a GCN clone (;)) would be to use also distributed register files, i.e. each vector ALU lane gets its separate one (like AMD's GPUs, the VLIW architectures had a slightly more convoluted scheme, but each of the 16 VLIW groups in a SIMD engine also had its own regfile, closely integrated to the ALUs). This would help the power efficiency quite a bit, as the distance between register file and ALU gets lowered significantly.

You see, there are some reasons for it. ;)
 
The current one is quite dumb already.

Perhaps but that's not a very good reason for making it even dumber :)

It doesn't allow result forwarding for instance and the reading and writing of registers counts to the latency of an operation. One could get rid of those things (even AMD's VLIW architectures didn't suffer from it, and GCN doesn't neither) together with a slightly simpler scheduling scheme.

AMD's result forwarding is a benefit of the static dependencies baked into the precompiled instruction clauses. I don't think you're suggesting nVidia will go down that road so how would they achieve register forwarding by simplifying their architecture?

And if nVidia really makes their GPUs quite a bit broader (which will happen without a hotclock), they almost need a simpler scheduling, as they will have quite a few more schedulers.

Or they could expand upon Fermi and dispatch multiple instructions per clock from the same scheduler ala GF104 (or CPUs for that matter). The current architecture can support a 3-4 wide instruction window per scheduler. With sufficient register bandwidth they can drive a much wider machine without dramatically increasing the number of schedulers.

Actually, this would help to remedy their current efficiency problem with matrix multiplications too (something they promised for kepler), as they will have more register files and more L1 caches providing bandwidth (the limiter right now).

What would help remedy the matrix multiplication efficiency problem? GCN and preceding architectures are pretty much driven by static, precompiled instruction dependencies. Are you suggesting nVidia will adopt something similiar?

Going a step further towards a GCN clone (;)) would be to use also distributed register files, i.e. each vector ALU lane gets its separate one (like AMD's GPUs, the VLIW architectures had a slightly more convoluted scheme, but each of the 16 VLIW groups in a SIMD engine also had its own regfile, closely integrated to the ALUs). This would help the power efficiency quite a bit, as the distance between register file and ALU gets lowered significantly.

IIRC nVidia's patents (operand collector et al) already point to a separate register file per SIMD lane. Potentially going back all the way to G80. What makes you think it's any different?
 
AMD's result forwarding is a benefit of the static dependencies baked into the precompiled instruction clauses. I don't think you're suggesting nVidia will go down that road so how would they achieve register forwarding by simplifying their architecture?
How Larrabee is doing it? It has just a 4 cycle latency for all the simple arithmetic instructions and is afaik an in-order design (while Fermi with its scoreboarding scheme can dispatch out-of-order but has an 18+ cycle latency [18 cycles + 2 cycles per register conflict]).

By the way, GCN will feature a 4 cycle vector-to-vector (and scalar to vector) instruction latency (vector-to-scalar most probably 8 cycles though). Being a strictly in-order design it doesn't need to take into account any dependencies between vector instructions as throughput equals latency (both 4 cycles). So it can get quite a bit simpler.
Or they could expand upon Fermi and dispatch multiple instructions per clock from the same scheduler ala GF104 (or CPUs for that matter). The current architecture can support a 3-4 wide instruction window per scheduler. With sufficient register bandwidth they can drive a much wider machine without dramatically increasing the number of schedulers.
Using that 4 instruction window with 4 issue ports would be a quite expensive way to make it a pseudo-dynamically scheduled VLIW4 machine. :LOL:
What would help remedy the matrix multiplication efficiency problem? GCN and preceding architectures are pretty much driven by static, precompiled instruction dependencies. Are you suggesting nVidia will adopt something similiar?
Fermi is limited by the amount of data you can get to the ALUs. You can remedy this by (i) significantly raising the size and bandwidth of the caches (or reduce the arithmetic throughput per cache), (ii) putting larger register files to the SMs (enabling more data per thread and more data reuse, reducing the needed cache bandwidth; that's how it works on Radeons, together with the next point), or (iii) reduce the arithmetic latencies, so you need a lower occupancy to attain full utilization again allowing to use more registers per thread and more data reuse.
IIRC nVidia's patents (operand collector et al) already point to a separate register file per SIMD lane. Potentially going back all the way to G80. What makes you think it's any different?
That' simply not feasible. I'm with Jawed here.
Or how is the data getting from the register files embedded into each SIMD lane to another SIMD/vecALU within the same SM (think GF104 ;)) or to the SFUs?
Edit: Or have a look at a die shot ;)
 
How Larrabee is doing it? It has just a 4 cycle latency for all the simple arithmetic instructions and is afaik an in-order design (while Fermi with its scoreboarding scheme can dispatch out-of-order but has an 18+ cycle latency [18 cycles + 2 cycles per register conflict]).

I'm not that familiar with Larrabee but doesn't it interleave "hardware threads" to hide its pipeline latency? If I understand you right you're basically saying Fermi has a deep pipeline and GCN has a shallow one. What does that particular distinction have to do with the scheduler implementation?

By the way, GCN will feature a 4 cycle vector-to-vector (and scalar to vector) instruction latency (vector-to-scalar most probably 8 cycles though). Being a strictly in-order design it doesn't need to take into account any dependencies between vector instructions as throughput equals latency (both 4 cycles). So it can get quite a bit simpler.

Is Fermi's ALU pipeline 18 clocks because the scheduler is complex or simply because the pipeline is 18 clocks?

Using that 4 instruction window with 4 issue ports would be a quite expensive way to make it a pseudo-dynamically scheduled VLIW4 machine. :LOL:

Perhaps, but what's the alternative that you're suggesting exactly? That nVidia do away with scoreboarding and execute precompiled instruction clauses? That's the "GCN clone" point I'm trying to get clarified. Is it the static compilation, round robin instruction issue, short pipeline, large register files, all of the above? :)

That' simply not feasible. I'm with Jawed here.
Or how is the data getting from the register files embedded into each SIMD lane to another SIMD/vecALU within the same SM (think GF104 ;)) or to the SFUs?
Edit: Or have a look at a die shot ;)

Not embedded within the SIMD, dedicated per SIMD lane. At least that's how US patent 7,834,881 reads to me. Data moves from register files to the SIMDs via the operand collector.
 
I'm not that familiar with Larrabee but doesn't it interleave "hardware threads" to hide its pipeline latency?
All GPUs do it too. No difference there, only that Larrabee is limited to 4 hardware threads (matching the latency) while GPUs can do more.
If I understand you right you're basically saying Fermi has a deep pipeline and GCN has a shallow one. What does that particular distinction have to do with the scheduler implementation?

Is Fermi's ALU pipeline 18 clocks because the scheduler is complex or simply because the pipeline is 18 clocks?
As I said, complexity (or dumbness) of the scheduling is a major factor. With Larrabee and Radeons reading and writing registers is outside of the critical loop defining the latency. Fermis scheduling does not allow that, so the time for reading and writing the regs add to the time the ALUs actually need for the computation (a dependent operation is issued and starts to read its operands only after the preceding instruction has finished writing the register file, it is not allowed to overlap [only independent operations can]). The hotclock does not help either to keep the latencies down, but from the additional latency in case of a register conflict, one can deduce that Fermi probably needs at least 4 (probably more) of the 18 cycles latency for the register access only.
Perhaps, but what's the alternative that you're suggesting exactly? That nVidia do away with scoreboarding and execute precompiled instruction clauses? That's the "GCN clone" point I'm trying to get clarified. Is it the static compilation, round robin instruction issue, short pipeline, large register files, all of the above? :)
Or a subset or a combination with some other clever tricks. Who knows outside of nVidia? All nv has said so far is that programming gets easier, the GPUs get significantly broader (which they probably wouldn't mention if it is a factor <=2, it makes a hotclock less likely), the matrix multiplication efficiency will rise and that performance/W will be significantly improved (>=2.5) not only by the proces change but also by reducing the distance data has to be driven over distances on the die.
Make your own guesses! ;)
Not embedded within the SIMD, dedicated per SIMD lane. At least that's how US patent 7,834,881 reads to me. Data moves from register files to the SIMDs via the operand collector.
So that means each SM has (at least) a 8192 bit*) bus running from the register file over the 2 to 3 Vec16 ALUs, the L/S units and the SFUs to supply them with data (and take the results back). Sounds quite expensive and energy consuming to me (and adds to the latency). ;)

Putting a register file to each vector ALU you can get rid of this massive bus. GCN probably has only a 2048 bit bus (plus the narrow broadcast connections from the scalar regfiles and the local memory to the vALUs, vALUs can directly use those as operands) to connect the vALUs to the local memory and the L1/TMUs as the reg files are embedded to each ALU and the operands of the arithmetic instructions don't have to travel over that bus at all).

NVidia said that reading the operands from the register file costs more energy than a multiply add itself. Is nVidia taking their own claim of vastly increased energy efficiency really serious? :rolleyes:

*): Take GF104/114:
3 vec16 ALUs + L/S + SFUs, 4 instructions every 2 clocks issued over [strike]4[/strike] 2 cycles results in:
64 individual operations * (3 source operands + 1 result) * 32 bit = 8192 bit/clock
 
Last edited by a moderator:
GK104 is the successor to GF104/114 (GTX 460/560) so it's the performance part. If the rumors about 768SPs are correct, that means a hotclock above 1.3GHz (I don't buy that the whole chip will be clocked that high).

This would mean performance well above the GTX 560 Ti—raw FLOPS really don't matter much.
3DCenter also speculates/claims that GK104 will have ~1000 CCs at ~1 GHz clock, and ~1500 CCs for the GK100.
 
3DCenter also speculates/claims that GK104 will have ~1000 CCs at ~1 GHz clock, and ~1500 CCs for the GK100.

Whoa, that's new, last time it was 1000 CC's for GK100, and even that would mean it would be hard not to get Fermi round 2 on new process, unless they've managed to shrink the chip on other areas a LOT
 
Curious, I was expecting something like 768/64 for GK104 and 1024/64 for GK100, but I'm always surprised one way or another....
 
Curious, I was expecting something like 768/64 for GK104 and 1024/64 for GK100, but I'm always surprised one way or another....

Those are the more realistic expectations IMO, and still would give extremely high risk on Fermi-pt2 for GK100
 
3DCenter also speculates/claims that GK104 will have ~1000 CCs at ~1 GHz clock, and ~1500 CCs for the GK100.

I guess Kepler is a family of 22nm chips, then. That would explain the delays.

No but seriously, that's impossible unless NVIDIA made really drastic changes, like removing TMUs and ROPs and doing it all with the shaders, or maybe doing away with the hotclock domain. Neither option sounds very likely.
 
Back
Top