Impossible? Wow, strong words here.1:3 is impossible
Impossible? Wow, strong words here.
SPP ratio clearly does not have to be a power of two. Cell for example has an 1:7 ratio. I'm sure there's a bunch of others too.
Correct, if you bought the IBM BladeCenters with Cell you got CPUs with all the 8 SPEs enabled.Wasn't Cell 1:8 but one was disabled to increase yields? Besides, general practice seems to be power of two for various good reasons.
Impossible? Wow, strong words here.
SPP ratio clearly does not have to be a power of two. Cell for example has an 1:7 ratio. I'm sure there's a bunch of others too.
Going by the maxwell SMM diagram at http://www.hardware.fr/articles/928-23/gm204-smm-nouveautes.html, one way to get to 1:3 would be to share a 32 wide DP execution unit between 3 of the single precision SMM partitions. I have no idea whether that's a good idea, but it seems possible at least.
There is no indicator for this. Async execution of pure compute load works since Kepler (when using CUDA, not the 3D queue), but mixed dispatch and draw call load isn't supported by neither the frontend nor by the SM/X/M.while Pascal should go more for stuff like proper Async Support
Page 7 of this NVIDIA presentation has a DP performance and bandwidth roadmap for Tesla GPUs.
Pascal: ~4000 DP GFLOPS, ~1000 GB/s
Volta: ~7000 DP GFLOPS, ~1200 GB/s
(GFLOPS and bandwidth seem to be accurate to 2 and 3 significant figures respectively)
My guess is 1:2 DP for the relevant chips for ~8000 SP GFLOPS on Pascal, which would give ~32 (enabled) "SMP"s at ~980 MHz, ~36 SMPs at ~870 MHz, and ~40 SMPs at ~780 MHz (assuming 128 SP CCs per "SMP," this also counts enabled SMPs only).
That interpretation doesn't make sense for the existing chips.I think your interpretation of the chart is wrong. The captions (Pascal, Volta, ...) in the chart indicate the computing power and memory bandwidth and not the squares. The squares a markers for the years.
Part DP GFLOPS DP GFLOPS DP GFLOPS
(Square) (Caption) (Actual)
K80 1900 2200 1864 (base), 2912 (max boost)
K40 1400 2000 1430 (base), 1680 (max boost)
K20 1200 1500 1173
M2090 600 1100 666
Part GB/s GB/s GB/s
(Square) (Caption) (Actual)
K40 290 360 288
K20 210 270 208
M2090 170 230 177
Square and Caption estimates are from rough pixel counting.
Yeah, sure! But, you could theoretically have 128 SMMs that each produce a DP result in 3 cycles instead of 1, right...?Not a power of two, but you can't divide 128 by 3 You can have a bit over 1:3 or under 1:3, just not the exact as long as you stay with 128 SP SMMs.
Yeah, sure! But, you could theoretically have 128 SMMs that each produce a DP result in 3 cycles instead of 1, right...?
In fact, there would be. Think 1:2 in term of width and MUL array utilization, but count in that the MUL array still has a higher latency when operating on fp64, same as all other adders necessary later in the FPU.There would be no point in that. If you use the same ALUs for the job, you should either be doing 1/4 (waste half internal bandwidth when doing fp64, use the existing MUL resources) or 1/2 (use bandwidth efficiently, have twice the mul resources that you'd need for fp32).
OK, I feel stupid now.In fact, there would be. Think 1:2 in term of width and MUL array utilization, but count in that the MUL array still has a higher latency when operating on fp64, same as all other adders necessary later in the FPU.
Is it possible GK210 already removed the raster/video specific units and used the die savings for the doubled shared memory and register files? That would explain why the compute-only K80 is the only SKU that uses GK210.I still suspect this is finally the generation where NVIDIA makes a HPC-only chip without a rasteriser. They've already gone in that direction with K80 by making a new chip with a bigger register file etc...