NVIDIA Maxwell Speculation Thread

My problem remains that despite GK210 gets on paper only 2.9 TFLOPs DP, it's real time efficiency (K80 vs. K40 see NV's own graph above) leaves a lot left to be desired and it's not particularly convincing with its 300W TDP either. From a hypothetical GM200 you should get at least 2.5TFLOPs at the usual 225W TDP for Teslas.
Do we know for sure that the maximum DP rate of Maxwell is 1:2? If it is lower for whatever reason then that could explain a lot. (There is this post but I can't find the doc either, the closest I got is this.)
 
Do we know for sure that the maximum DP rate of Maxwell is 1:2? If it is lower for whatever reason then that could explain a lot. (There is this post but I can't find the doc either, the closest I got is this.)

The problem today is rumor point something like they basically will not add more FP64 units compared to GM204 ( 1 FP 64 unit / SMM ).. I will maybe bet for an 1/8 DP rate anyway ...

AMD dont use additional FP units ( well they use aditional hardware and scalar units ) but there FP32 units is able to do the FP64 one.. I dont understand why nvidia dont go this road instead of use aditional units for it.
 
Last edited:
Do we know for sure that the maximum DP rate of Maxwell is 1:2? If it is lower for whatever reason then that could explain a lot. (There is this post but I can't find the doc either, the closest I got is this.)
We don't know anything at all. 1:3 is unlikely because 128 isn't divisible by 3, but it could be 1:2 or 1:4 or 1:8. Etc. And if it's not intended to be used for Tesla, then it will likely stay at 1:32.
 
AMD dont use additional FP units ( well they use aditional hardware and scalar units ) but there FP32 units is able to do the FP64 one.. I dont understand why nvidia dont go this road instead of use aditional units for it.
A hybrid ALU that can do both 32 and 64 bit operations saves transistors but costs power due to keeping more of the transistors active during compute. There are also likely pipeline bubble losses if you're mixing 32 and 64 and the ALU has to change modes.

A separate unit costs transistors, but can be idled when unneeded, saving a lot of power. This is very common for 64 bit units which may never be fired up in the whole lifetime of a gamer's GPU use.

GPUs are now more power constrained than transistor limited, so spending the die area on separate 64 bit ALUs is worthwhile.
 
AMD dont use additional FP units ( well they use aditional hardware and scalar units ) but there FP32 units is able to do the FP64 one.. I dont understand why nvidia dont go this road instead of use aditional units for it.

AMD's traditional method for doing this is to have the SIMD loop the same operation over the hardware multiple times, which has a throughput and power cost. Area-wise, it would be smaller than having an FP64 unit, although it would probably be incrementally larger than having FP32-only units, since some things like wider intermediate values need to be maintained.
A unit that can properly run at a 2SP or 1DP for all operations would be larger and less efficient than one that could only do DP, but I believe it should be more area efficient than having to tote around 2 SP and 1 DP units in order to offer the same functionality. With clock and power gating, however, the split units can be more efficient when the unit that isn't needed is off, which I believe was one of the reasons cited for IMG's latest architecture having FP16 and FP32 units.

I unfortunately can't seem to find a good comparison of the various solutions.

The 1:2 rate for Hawaii could indicate a more hardware-intensive method is being employed than previously.
 
A hybrid ALU that can do both 32 and 64 bit operations saves transistors but costs power due to keeping more of the transistors active during compute. There are also likely pipeline bubble losses if you're mixing 32 and 64 and the ALU has to change modes.

A separate unit costs transistors, but can be idled when unneeded, saving a lot of power. This is very common for 64 bit units which may never be fired up in the whole lifetime of a gamer's GPU use.

GPUs are now more power constrained than transistor limited, so spending the die area on separate 64 bit ALUs is worthwhile.

Ok, but it is exactly what is liking the GCN architectures, they need to be filled ( for dont say over filled ), they dont like to use power for nothing.. Basically the scalar chips and other units chips are taken over the instructions who can be made by the "hybrid " FP32/64 cores.. ( its just a question of additional instructions support )

its 2 differents approach of the problem, i admith both have their lost and win ... but lets be honest this remind me the situation of 64bit AMD vs Intel Itanium 64 bit instructions.. i see a lot more of win on the side of AMD, they was first to bring 1/2 DP rate on the table afterall. where Nvidia was really constrained.. there 1:3 DP rate was really theoric, and when software was not aligned, we was more in in the 1/4 even less rate in average ) GCN is using scalar units for do the complementary job and dispatch branch for allow high occupancy and this is working really well. With all their wavefront configurations, you get a true parrallel system capable .

I will be honest, when i say i dont understand why "Nvidia dont take the same road", i lie somewhere, Pascal, Volta will use the same scalar approach than GCN. The system used by Nvidia cost them too much transistors for be viable.. and if it work well for AMD, i dont say why they will not use it.

I AM 100% sure they will go the same road, .. But i should have start to say i understand why they have not do it for Maxwell allready.

@Dilletante, you forget the scalar units.. This said you are right, with an GCN chips you want it over filled, occupancy should be at his maximum, then you see all his power, ( but is it not what we want when compute things ? ) ...
 
Last edited:
Ok, but it is exactly what is liking the GCN architectures, they need to be filled ( for dont say over filled ), they dont like to use power for nothing..

its 2 differents approach of the problem, i admith both have their lost and win ... but lets be honest this remind me the situation of 64bit AMD vs Intel Itanium 64 bit instructions.. i see a lot more of win on the side of AMD, they was first to bring 1/2 DP rate on the table afterall. where Nvidia was really constrained.. GCN is using scalar units for do the complementary job and dispatch branch for allow high occupancy and this is working really well.

I will be honest, when i say i dont understand why "Nvidia dont take the same road", i lie somewhere, Pascal, Volta will use the same scalar approach than GCN.
GF100 had 1/2 DP back in 2009.
 
GF100 had 1/2 DP back in 2009.

Not real 1/2DP .. it was 1/4 , i will need a real proof about it ... (( not that i tell you lie, but on fermi, this was impossible to atain a real 1/2 DP ).. I can be wrong, but i will be suprised . As long i remember, Fermi had 1/4 Dp rate .. ( but maybe im wrong, and i can apology by advance )
 
There is always a difference between peak and sustained performance. But this discussion has been about architecture, and it's important to get the facts correct.

You are right, but Nvidia was just do his marketing thing, sustained or not, they have never approach the 1/2 DP rate, even on the hardware side this was impossibble ( today it is easy to calculate it ) ... 1:33 was the best case in theory , and in average with Fermi it was over 1:4... Its why now they use SGEMM as reference for DP in their presentation

This said, this was allready excellent at this period .. this was more than we was waited and more than we was expected .
 
The problem today is rumor point something like they basically will not add more FP64 units compared to GM204 ( 1 FP 64 unit / SMM ).. I will maybe bet for an 1/8 DP rate anyway ...
I was thinking more along the lines of can a Maxwell SMM hypothetically support 1:2 DP rate or is the ceiling lower (due to the architecture). If the limit results in the DP rate being 1:4 (for example), then Ailuros's hypothetical GM200 wouldn't have 2.5+ DP TFLOPS, but closer to 1.25-1.5 DP TFLOPS, which is in the neighborhood of big Kepler.
 
I was thinking more along the lines of can a Maxwell SMM hypothetically support 1:2 DP rate or is the ceiling lower (due to the architecture).
Maxwell's simpler ALU topology makes it easier to support higher DP ratios than Kepler, mostly by removing Kepler's extra FP32 ALUs which were frequently starved.

The important limitation in feeding the DP ALUs is register bandwidth. Kepler and Maxwell's operand collectors, while not identical, can both read three registers per clock. This is enough to feed 16 FMAD DP ALUs since their three arguments are 64 bit, two registers each. Maxwell and Kepler both have 4 schedulers and 4 operand collectors, so a Maxwell SMM could support the same number of DP units as a Kepler SMX, giving Maxwell a 1:2 DP rate. A less aggressive design (saving transistors) could drop that to a lower ratio like 1:4, or even down to the current GM204's 1:32.
 
Do we know for sure that the maximum DP rate of Maxwell is 1:2? If it is lower for whatever reason then that could explain a lot. (There is this post but I can't find the doc either, the closest I got is this.)

GM107 and GM204 have 4 FP64 SPs/SMM. Some of us were assuming that GM200 would sport 64 FP64 SPs/SMM (like Kepler) which in theory would mean a 1:2 DP/SP ratio.
 
Damn, that's a crapload of ROPs, if not a fake... Would a GPU be able to utilize that many though with less than 320GB/s available to it? Caching framebuffer pixels in L2 would help, but would it be enough to make the chip efficient?

I believe the same question was asked about GM204 and we all know the results there. I think the key lies in the increased L2 amount relative to Kepler, as well as the delta color compression technology. Also, 320GB/s is a ton of bandwidth.
 
Good fillrate is something that NVidia has prioritised successfully many times in the past. I can't think of a major chip introduction where NVidia was stingy with fillrate and I can't think of a time when NVidia was not competitive because it didn't have enough of (the right kinds of) fillrate.
 
Back
Top