Is the GK104 the low end piece? 2TFLOPs is pretty weak these days isn't it? The 7970 is rumoured to be 3.5-7TFLOPs...
I wonder could this mean Kepler will represent a move to a more compact ALU architecture, with emphasis on wider instruction/thread parallelism? GF104 already hinted something in this direction with its expanded dual-issue warp scheduling.
I'm almost inclined to think that Kepler will look somewhat similar to GCN!
Ooh, what are they?I would be really surprised if nVidia dumbed down their scheduler. There are benefits to be had in irregular compute workloads.
The current one is quite dumb already. It doesn't allow result forwarding for instance and the reading and writing of registers counts to the latency of an operation. One could get rid of those things (even AMD's VLIW architectures didn't suffer from it, and GCN doesn't neither) together with a slightly simpler scheduling scheme.I would be really surprised if nVidia dumbed down their scheduler.
The current one is quite dumb already.
It doesn't allow result forwarding for instance and the reading and writing of registers counts to the latency of an operation. One could get rid of those things (even AMD's VLIW architectures didn't suffer from it, and GCN doesn't neither) together with a slightly simpler scheduling scheme.
And if nVidia really makes their GPUs quite a bit broader (which will happen without a hotclock), they almost need a simpler scheduling, as they will have quite a few more schedulers.
Actually, this would help to remedy their current efficiency problem with matrix multiplications too (something they promised for kepler), as they will have more register files and more L1 caches providing bandwidth (the limiter right now).
Going a step further towards a GCN clone () would be to use also distributed register files, i.e. each vector ALU lane gets its separate one (like AMD's GPUs, the VLIW architectures had a slightly more convoluted scheme, but each of the 16 VLIW groups in a SIMD engine also had its own regfile, closely integrated to the ALUs). This would help the power efficiency quite a bit, as the distance between register file and ALU gets lowered significantly.
You really need to go re-read them.IIRC nVidia's patents (operand collector et al) already point to a separate register file per SIMD lane.
How Larrabee is doing it? It has just a 4 cycle latency for all the simple arithmetic instructions and is afaik an in-order design (while Fermi with its scoreboarding scheme can dispatch out-of-order but has an 18+ cycle latency [18 cycles + 2 cycles per register conflict]).AMD's result forwarding is a benefit of the static dependencies baked into the precompiled instruction clauses. I don't think you're suggesting nVidia will go down that road so how would they achieve register forwarding by simplifying their architecture?
Using that 4 instruction window with 4 issue ports would be a quite expensive way to make it a pseudo-dynamically scheduled VLIW4 machine.Or they could expand upon Fermi and dispatch multiple instructions per clock from the same scheduler ala GF104 (or CPUs for that matter). The current architecture can support a 3-4 wide instruction window per scheduler. With sufficient register bandwidth they can drive a much wider machine without dramatically increasing the number of schedulers.
Fermi is limited by the amount of data you can get to the ALUs. You can remedy this by (i) significantly raising the size and bandwidth of the caches (or reduce the arithmetic throughput per cache), (ii) putting larger register files to the SMs (enabling more data per thread and more data reuse, reducing the needed cache bandwidth; that's how it works on Radeons, together with the next point), or (iii) reduce the arithmetic latencies, so you need a lower occupancy to attain full utilization again allowing to use more registers per thread and more data reuse.What would help remedy the matrix multiplication efficiency problem? GCN and preceding architectures are pretty much driven by static, precompiled instruction dependencies. Are you suggesting nVidia will adopt something similiar?
That' simply not feasible. I'm with Jawed here.IIRC nVidia's patents (operand collector et al) already point to a separate register file per SIMD lane. Potentially going back all the way to G80. What makes you think it's any different?
How Larrabee is doing it? It has just a 4 cycle latency for all the simple arithmetic instructions and is afaik an in-order design (while Fermi with its scoreboarding scheme can dispatch out-of-order but has an 18+ cycle latency [18 cycles + 2 cycles per register conflict]).
By the way, GCN will feature a 4 cycle vector-to-vector (and scalar to vector) instruction latency (vector-to-scalar most probably 8 cycles though). Being a strictly in-order design it doesn't need to take into account any dependencies between vector instructions as throughput equals latency (both 4 cycles). So it can get quite a bit simpler.
Using that 4 instruction window with 4 issue ports would be a quite expensive way to make it a pseudo-dynamically scheduled VLIW4 machine.
That' simply not feasible. I'm with Jawed here.
Or how is the data getting from the register files embedded into each SIMD lane to another SIMD/vecALU within the same SM (think GF104 ) or to the SFUs?
Edit: Or have a look at a die shot
Is Fermi's ALU pipeline 18 clocks because the scheduler is complex or simply because the pipeline is 18 clocks?
All GPUs do it too. No difference there, only that Larrabee is limited to 4 hardware threads (matching the latency) while GPUs can do more.I'm not that familiar with Larrabee but doesn't it interleave "hardware threads" to hide its pipeline latency?
As I said, complexity (or dumbness) of the scheduling is a major factor. With Larrabee and Radeons reading and writing registers is outside of the critical loop defining the latency. Fermis scheduling does not allow that, so the time for reading and writing the regs add to the time the ALUs actually need for the computation (a dependent operation is issued and starts to read its operands only after the preceding instruction has finished writing the register file, it is not allowed to overlap [only independent operations can]). The hotclock does not help either to keep the latencies down, but from the additional latency in case of a register conflict, one can deduce that Fermi probably needs at least 4 (probably more) of the 18 cycles latency for the register access only.If I understand you right you're basically saying Fermi has a deep pipeline and GCN has a shallow one. What does that particular distinction have to do with the scheduler implementation?
Is Fermi's ALU pipeline 18 clocks because the scheduler is complex or simply because the pipeline is 18 clocks?
Or a subset or a combination with some other clever tricks. Who knows outside of nVidia? All nv has said so far is that programming gets easier, the GPUs get significantly broader (which they probably wouldn't mention if it is a factor <=2, it makes a hotclock less likely), the matrix multiplication efficiency will rise and that performance/W will be significantly improved (>=2.5) not only by the proces change but also by reducing the distance data has to be driven over distances on the die.Perhaps, but what's the alternative that you're suggesting exactly? That nVidia do away with scoreboarding and execute precompiled instruction clauses? That's the "GCN clone" point I'm trying to get clarified. Is it the static compilation, round robin instruction issue, short pipeline, large register files, all of the above?
So that means each SM has (at least) a 8192 bit*) bus running from the register file over the 2 to 3 Vec16 ALUs, the L/S units and the SFUs to supply them with data (and take the results back). Sounds quite expensive and energy consuming to me (and adds to the latency).Not embedded within the SIMD, dedicated per SIMD lane. At least that's how US patent 7,834,881 reads to me. Data moves from register files to the SIMDs via the operand collector.
3DCenter also speculates/claims that GK104 will have ~1000 CCs at ~1 GHz clock, and ~1500 CCs for the GK100.GK104 is the successor to GF104/114 (GTX 460/560) so it's the performance part. If the rumors about 768SPs are correct, that means a hotclock above 1.3GHz (I don't buy that the whole chip will be clocked that high).
This would mean performance well above the GTX 560 Ti—raw FLOPS really don't matter much.
3DCenter also speculates/claims that GK104 will have ~1000 CCs at ~1 GHz clock, and ~1500 CCs for the GK100.
Curious, I was expecting something like 768/64 for GK104 and 1024/64 for GK100, but I'm always surprised one way or another....
3DCenter also speculates/claims that GK104 will have ~1000 CCs at ~1 GHz clock, and ~1500 CCs for the GK100.