[...]I think Nvidia is playing most of their bets rather safe.
Maybe this is the real reason?
Without major stunts (actually I cannot think of anything that would work) you would loose half your potential FP32 throughput as soon as you cannot find paired instructions anymore in addition to the more complex instruction routing
That's only if you think of SP as being like the actual HP implementation. Intel uses ganging. In other words, the SP is two real lanes and DP is two lanes working in concert.
Of course I could be wrong with a high likelihood, but for a couple of generations now, power seems to be the main concern, not area anymore.
No doubt, power has been a tight constraint for quite a while now: but NVidia keeps telling us that computation is not the power hog, it's routing data into and out of the ALUs. Routing and area must interact. It seems likely to me that routing either to SP or DP ALUs and then routing results back hinders power-efficiency (larger overall candidate area spanned by the data).
Having dedicated SP and DP ALUs allows one or the other to be turned off while the other is working. On the other hand, multipliers built from repeating blocks of functionality and used for both SP and DP can turn off the blocks that are only needed for DP while doing SP.
GP102 is probably Maxwell-like in the quantity of DP it offers. Does GP102 have more SP ALU capability than GP100?
Is GP102 power limited in its SP capability?
So you would have to have a whole line of your GPUs totally dedicated to HPC,
Isn't that precisely what GP100 is?
or your other chips (think about power) would have to carry the more complex mulitpliers (and adders and muxes) in their guts as well. Correct me if I'm wrong, but even 53x53 MULs would be ok for DP, right? For iterating over the SP-ALUs, you'd need 27x27, which is a ~26,5% increase over the 24-bit MULs. I am not sure, if you can effectively mask the addtional bits out so they do not use any energy anymore.
I would expect a modern design to switch off the paths that aren't required in SP mode. Intel's design (being multi-precision) is the obvious place where this should be the case. But does anyone know if that's what's happening?
Intel already has these AVX-based ALUs and plans on using them in their regular Xeon processors as well. They need their code compatibility which has been touted as (one of) the big advantages of going Intel from day one.
Absolutely. I talked about this earlier (NVidia doing the noble thing, Intel buying bums on seats.)
My feeling is, that maybe even with Volta, we might see a completely separated lineup for HPC (FP64+FP32-focus), Deep-Learning (FP16+INT8-focus and other uses such as gaming (FP32 focus plus whatever you can cram in for free, INT8?).
Volta isn't that far away it seems, so that all sounds reasonable.
But x86 or Phi coupled to on-package or on-die FPGA functionality shouldn't be more than 5 years away. And that, in my view, marks the end of GPUs in HPC of any kind.