got around some maxwell plans from nvidia. quite interesting stuff. there will be two lines, maybe the second one aligns with the finfet stuff, don't know. anyway, that's what i gathered:
the smx structure changes slightly. nv did some optimization that they can use now the dp alu also for sp, it supports now all sp instructions and can be used in parallel. it means an smx looks now to have 256 alu. technically, that reduces maxwell's dp rate to 1:4, but in reality it just boosts the sp performance in comparison to kepler. nv found out how to gate of the unused parts of the dp alu to keep the power down when doing sp stuff.
but the real changes are in the cache area. that will boost the efficiency big time.
first off, the registers are doubled per smx. more threads using a lot of registers can now run in parallel and better hide the latencies. and the caches got increased as well. the L1 cache also used as shared memory is now 128kb (doubled) and can be split between cache and shared memory in steps of 32/96, 64/64, or 96/32. maxwell keeps the 16 tmus per smx.
the gpcs consist of usually 3 smx, but got changed quite a bit. there is still that geometry engine and stuff, but each gpc now includes 768kb of l2 cache, backing the r/w-L1 as well as the read only texture L1 in the smxs and also serve as instruction cache for the smx. all this gets topped off with a much larger l3 cache than in kepler. now to some numbers for the first line.
gm100:
8 gpc (8 triangles per clock), 24 smx, 384 tmus, 6144 alu, 8mb l3 (and there are also 8 l2s in the gpcs!), 64 rops, 512 bit interface, up to 8 gb @ 6+ ghz
target frequency for gf 930mhz, boost 1GHz
target frequency for tesla 850mhz, gives 2.61 dp tflops, double that of kepler, comes with 16gb
gm104:
5 gpc, 15 smx, 240 tmu, 3840 alu, 4mb l3, 40 rops, 320 bit interface (7 ghz), 2.5gb for cheap models, probably a lot of asymmetric 3gb or (symmetric again) 5gb models, target 1+ ghz, can do dp only with 1:16 rate
gm106:
3 gpc, 9 smx, 144 tmu, 2304 alu, 4mb l3, 24 rops, 192 bit interface, 7ghz, 3gb ram
gm108:
2 gpc, 4 smx, 64 tmu, 1024 alu, 2mb l3, 16 rops, 128bit interface, 2 gb ram
but really interesting gets the refresh, probably waiting for tsmc's finfets. then 64 bit arm cores developed by nv gets integrated on the same die. they can coherently access the common l3 cache. the big thing is that they will be used by the graphics driver to offload some heavy lifting from the system cpu. basically most part of the driver will be running on the gpu itself! nvidia expects this will give them at least the same speed up as amd will get from mantle, but without using a new api with straight dx11 or opengl code! and it will also help with the new cuda version for maxwell, where one can access both gpu as well as cpu cores seamlessly.
the specs are planned to stay almost the same for gm110/114/116, just the 110 gets full 8 ARM v8 cores and a doubled l3 (16mb!) compared to the gm100. the finfets may also allow a further speed boost. the 8 arm core version is actually called gm110soc, so maybe nv will start to market them as standalone processors for hpc. the consumer version is likely cut down to 4 arm cores, the same as gm114 will get (which also gets a doubled l3 to 8mb). the gm116 will only get 2 cpu cores on die, i have not seen that a gm118 got mentioned..