One of the great gifts of hardware.fr is this page, where all recent SMs with its internal architecture are placed next to each other and you can immediately see the changes as you mouse over the different links. Compare a GK110 against a GK104 and it's obvious that, in broad strokes, there isn't much different in architecture. There is no reason to believe that something similar can't be done for the Maxwell SM. In other words, I do think that it's mostly a matter of tacking on more FP64 units.
I don't see why the cache should be in any way different? Neither do I see why the register file and the way the array is fed would have to be changed significantly. At best, FP64 will have half the performance of FP32, so the logical way to go about it is to simply use 2 adjacent 32-bit registers and fetch them sequentially. And if Pascal implements FP16 the way it's done Tegra X1 (see this Anandtech article), it doesn't require major architectural plumbing either.
Can they be done simultaneously in the same pipeline? (without the huge performance penalty we see right now)
This is what I was talking about, took me a while to find the document
http://docs.nvidia.com/gameworks/co...daexperiments/kernellevel/pipeutilization.htm
Last edited: