NVIDIA Maxwell Speculation Thread

Fast DP would mean FP64 arithmetic runs at quarter or half the speed of FP32 artihmetic. Similar to Tesla, GTX Titan, Radeon 7970 (one third), Hawaii-based FirePro. Slow DP would mean 1/32th the rate as on GM204 (and like GK104 does at 1/24th. 1/16th I believe for most AMD consumer cards, 1/8th on 290/290X)

I'm of the opinion that it is a slow DP GPU, because we seem to know GM200 is not a Tesla GPU.
GM200 Quadro that diverges from Tesla would be pretty inconsistent (why get a Tesla then?)
A fast DP GM200 would then be like that just for a new GTX Titan card and that is again sending mixed signals ($1000 card better than the $5000 one?)
 
Fast DP would mean FP64 arithmetic runs at quarter or half the speed of FP32 artihmetic. Similar to Tesla, GTX Titan, Radeon 7970 (one third), Hawaii-based FirePro. Slow DP would mean 1/32th the rate as on GM204 (and like GK104 does at 1/24th. 1/16th I believe for most AMD consumer cards, 1/8th on 290/290X)

If my speculative math isn't completely off base it should have as many FP64 as GK104 FP32 SPs. GM204 & GM107 have only 4 FP64SPs per SMM.

I'm of the opinion that it is a slow DP GPU, because we seem to know GM200 is not a Tesla GPU.
GM200 Quadro that diverges from Tesla would be pretty inconsistent (why get a Tesla then?)
A fast DP GM200 would then be like that just for a new GTX Titan card and that is again sending mixed signals ($1000 card better than the $5000 one?)

NV is using for several years/generations now it's biggest core for enthusiast GeForces, high end Quadros and Tesla GPUs. Why should they change that now? Quadros and Teslas are low volume/high margin. If they could out of the latter two revenue cover R&D costs for the biggest core they would had considered by now designing and selling dedicated cores only for workstation and/or HPC markets.
 
NV is using for several years/generations now it's biggest core for enthusiast GeForces, high end Quadros and Tesla GPUs. Why should they change that now? Quadros and Teslas are low volume/high margin. If they could out of the latter two revenue cover R&D costs for the biggest core they would had considered by now designing and selling dedicated cores only for workstation and/or HPC markets.

Because this time they couldn't count with the benefits a node shrink usually brings, a fact that together with GK110 being already close to reticle size limits, limited their options. They may have had to make a decision between boosting DP compute or graphics oriented functions. The DP compute boost might be reserved for a next iteration of Maxwell on FinFet.

To answer your question directly, it would not be a result of a strategic decision by nVIDIA but something "forced" by the circumstances. The later DP compute heavy chip on FinFet would still be used for graphics :)
 
I'd like to stand corrected but estimated 24 SMMs even with 64 FP64 SPs/SMM should fit into roughly 560mm2/28nm if they've pushed transistor density to Hawaii heights. Good luck waiting for TSMC to get 16FF yields under control even for next year.
 
If NVidia has decided to build pure compute chips, then there's no reason for fast double precision in gamer chips. Ever again.

If NVidia can keep within respectable power limits and reticle limits by building compute monsters that have no gaming functionality and make a profit, why not?

With gaming/compute chip introductions slowing to a snail's pace caused by the lag of each new node's very high prices early in its life, it seems to make sense to build dedicated compute chips as a node enters maturity. The costs of getting a chip with gaming functionality deleted and other stuff printed a few extra times really shouldn't be excessive, compared with the cost of getting the gaming chip to market originally or with the profits that a compute chip should generate.
 
I'd like to stand corrected but estimated 24 SMMs even with 64 FP64 SPs/SMM should fit into roughly 560mm2/28nm if they've pushed transistor density to Hawaii heights. Good luck waiting for TSMC to get 16FF yields under control even for next year.

You might be right, I'm just speculating ;)
Well, if TSMC doesn't get it's act together soon nVIDIA might be in a world of hurt, if AMD really goes for 20nm GloFo.
 
If NVidia has decided to build pure compute chips, then there's no reason for fast double precision in gamer chips. Ever again.
The big question have they? previous "flagship"-chips have been both gaming & compute chips
 
You might be right, I'm just speculating ;)
Well, if TSMC doesn't get it's act together soon nVIDIA might be in a world of hurt, if AMD really goes for 20nm GloFo.

Look at it that way: a single FP64 unit at 1GHz under 28nm is for synthesis alone 0.025mm2. For 1536 you're at 38.4mm2 plus all the additional logic you'd need. Let's say 50mm2 for the entire enchilada, how many clusters do you think they could add if they skip say 45mm2 for the more FP64 units?

As for the AMD is there even anything besides a LP process available at Glofo for 20nm? I wouldn't be surprised if AMD moves to Glofo starting with 14nm for high complexity chips, but for now it sounds rather questionable.
 
Rumor is that it's slow.

Any sources/links? I haven't come across anything about it so far.
I'd like to stand corrected but estimated 24 SMMs even with 64 FP64 SPs/SMM should fit into roughly 560mm2/28nm if they've pushed transistor density to Hawaii heights. Good luck waiting for TSMC to get 16FF yields under control even for next year.

+1..I was thinking along similar lines. My only doubts were the ROPs and MC. If they stick with the same architecture as GM204 and go for a 384 bit bus..it will have 96 ROPs..double that of GK110. Seems a bit excessive..and with die area so limited I wonder if they have gone this way.
The big question have they? previous "flagship"-chips have been both gaming & compute chips

Well..GK210 is proof enough. If they taped out a >500 mm2 GPU just for the compute market..seems like the business is either large enough and/or the margins are high enough that it is worth doing.

Makes sense really..and if they follow that strategy then for future chips they can cut down a lot of the gaming specific features and TMUs/ROPs and make a fully compute oriented GPU which would have much higher performance and/or efficiency than a chip designed for both markets.
 
The only thing I could think of to escape the 96 ROPs would had been to decouple ROPs from the MC but I severely doubt it and I don't even think they'll do it anytime soon.
Maybe you shouldn't be thinking that it has twice the ROPs compared to GK110, but 32 more compared to GM204?

GK104 --> GM204 = 100% more clusters (yes smaller blahblahblah), same amount of FP64 units, twice the ROPs, same amount of TMUs, 3.54b ---> 5.2b
GK110 --> GM200 = 60% more clusters (again smaller clusters), 60% more FP64 units, twice the ROPs, 240 vs. 192 TMUs, 7.1b ---> ~8.0b?

As I said I'd like to stand corrected, since as a layman that kind of stuff is pure and quite uneducated speculation, but it's also my understanding that TMUs are quite expensive units. Having 20% less in the latter hypothetical case isn't exactly something to sneeze over either.
 
Last edited:
I'd like to stand corrected but estimated 24 SMMs even with 64 FP64 SPs/SMM should fit into roughly 560mm2/28nm if they've pushed transistor density to Hawaii heights. Good luck waiting for TSMC to get 16FF yields under control even for next year.
It's one thing to increase density when changing architecture, but I don't understand why a gm200 would have better transistor density than gm204. If they can push gm200 to Hawaii densities, then why didn't they do so for gm204 as well.

For actual size: a gm204 is around 400mm2. For gm200, 384 bits MC is a given, and 20 SMs wouldn't make a lot of sense, so let's say 24 SMs as well. That is +50% in both cases. That leaked cache size shows 3MB: another +50%. A naive pure 50% scaling of gm204 results in a die size of 600mm2. That's the number you have to start with. If it's 560mm2, one way or the other, they're going to have find 40mm2 by scaling texture and ROP units less than +50%, yet still add extra FP64 units as well?

I just don't see it.
 
Well..GK210 is proof enough. If they taped out a >500 mm2 GPU just for the compute market..seems like the business is either large enough and/or the margins are high enough that it is worth doing.

And then I believe they'll jump to Pascal (GP100) for that market (followed by Volta), rather than use a 16FF Maxwell refresh.
Pascal would probably be used on really high end gaming products (Titan, Titan Z) but possibly not on a "geforce GTX 1080 Ti" kind of card. I speculate "regular high end " gaming cards will still use GDDR5 rather than HBM memory (or which it is). If Pascal can only use the stacked-memory dies and not GDDR5 (no idea, flexible memory controllers are theoretically possible) then you need a GPU such as GM200, possibly followed by a similar GM300.

(That "GM300" for a 16FF Maxwell is just a putative name, maybe it'd be about a Pascal with less features and branded as a Pascal, giving you a situation like GK104 vs GK110. Maybe that last parenthesis of mine is useless : if "GM300" comes out before Pascal then it shall be named "GM300", "3rd generation Maxwell")
 
Last edited:
Assuming the GM200 is meant for compute and GM204 runs circles around the GK110 (even the GM207 beats the GK104) in luxmark, BTC mining, ... the only place GM204 does a bad job is when it comes to double. if NV would focus to get 1/2 speed DP and keep 16SMM, that would be theoretically 2.3TF/s DP, more than the 1.7TF/s DP of GK110. if the efficiency of Maxwell is preserved, it would be twice as fast in compute benchmarks (float and double, compared GK110).

GM200 with broken DP units could then still be sold for GTX980 to increase the margins, if they came really with 384 bits, it could be a neat GTX980 TI

how much bigger would the GM204 get if they'd just focus on DP (if someone knows the kepler sizes as reference)? If it would end up smaller than GK110, then NV could keep higher clocking (if they'd go for 220W like Tesla).
 
Assuming the GM200 is meant for compute and GM204 runs circles around the GK110 (even the GM207 beats the GK104) in luxmark, BTC mining, ... the only place GM204 does a bad job is when it comes to double. if NV would focus to get 1/2 speed DP and keep 16SMM, that would be theoretically 2.3TF/s DP, more than the 1.7TF/s DP of GK110. if the efficiency of Maxwell is preserved, it would be twice as fast in compute benchmarks (float and double, compared GK110).

GM200 with broken DP units could then still be sold for GTX980 to increase the margins, if they came really with 384 bits, it could be a neat GTX980 TI

how much bigger would the GM204 get if they'd just focus on DP (if someone knows the kepler sizes as reference)? If it would end up smaller than GK110, then NV could keep higher clocking (if they'd go for 220W like Tesla).

https://forum.beyond3d.com/posts/1814236/

I'm pretty sure the synthesis die area rate for FP64 units is accurate; the rest is an educated guess from me.
 
I speculate "regular high end " gaming cards will still use GDDR5 rather than HBM memory (or which it is).
I think we can confidently say that NVidia will have no choice but to abandon GDDR5 for the high end gaming market. It's a dead end. It sucks up way too much power for a start (10s of watts on current cards when GPU interfacing + memory power is all added up). And it can't remotely compete in terms of raw bandwidth. It's just a matter of time.

The more interesting question, to me, is whether AMD will fuck it up on the first go, R600 re-run style.
 
I think we can confidently say that NVidia will have no choice but to abandon GDDR5 for the high end gaming market. It's a dead end. It sucks up way too much power for a start (10s of watts on current cards when GPU interfacing + memory power is all added up). And it can't remotely compete in terms of raw bandwidth. It's just a matter of time.

The more interesting question, to me, is whether AMD will fuck it up on the first go, R600 re-run style.
R600 re-run style? R600 didn't bring new memory type?
X1950 already used GDDR4 and GDDR5 didn't come 'till RV7xx
 
It was a new architecture (ring bus) and had vastly more bandwidth than was usable by the rest of the chip.
 
Back
Top