NVIDIA Maxwell Speculation Thread

I think they're targeting GM107/GM108 at the notebook refresh cycle where power efficiency is a huge advantage. We'll get GM200/204/206 in H2 which are more targeted at the desktop - I wonder what the timeframe will be for GM200 vs GM204 given process maturity (unless it's still on 28nm which seems unlikely at this point but interesting).

I think the launch of GM107 @28nm makes a lot of sense. With rising wafer prices @20nm, it makes sense to launch the value card on an older process, gives you time to work out the kinks with the new architecture, and bigger cards with higher margins can launch on a newer process.

What I've heard from a source is that there are no 20nm GPU's coming at all (from NV at least, no idea about AMD). Nvidia will skip 20nm(there is only one 20nm process at TSMC apparently, 20SoC) and go straight to 16nm(16FF). I was told that 16FF is an optical shrink of 20nm, so it will follow 20SoC rather quickly, sort of like a 65nm->55nm transition.

As far as reasons go, I can only speculate and add on to the points raised by both of you. Cost per transistor on 20nm is likely quite high at the moment and the process will probably not be mature enough for a GM200 sized chip in H2'14. In addition I was also told that there is virtually no performance increase with 20SoC vs 28HP. 16FF is supposed to bring a significant performance increase over 20SoC and hence would be a better process for GPU's.

So given how much more efficient Maxwell seems to be, I suppose GM200 on a mature 28HP process is probably a safer bet and the architectural improvements along with the process improvements over the last 2 years will provide enough of a performance increase compared to Kepler.

Even assuming they could get a big 20nm GPU out by Q1'15, since 16FF follows so quickly, does it make sense to release 20nm chips and quickly follow up with a supposedly much better 16nm process. So to maintain the release cadence and refresh their lineup this year, they release Maxwell on 28nm and then maybe have 16nm GPU's out by H2'15?
 
Last edited by a moderator:
There's one FMA32-block less contending for register space.

Actually the extra two SIMD blocks nearly never worked in Kepler in the first place due to register bank constraints, thats why the computing output of FMA-32 is only 2/3 of the "peak" on Kepler.

So in theory its not hard for Maxwell to achieve comparable performance per SM with only 128 "CUDA cores", if all their cores can actually work together, the reg pressure should be comparable for a given computing output.
 
Last edited by a moderator:
What I've heard from a source is that there are no 20nm GPU's coming at all (from NV at least, no idea about AMD). Nvidia will skip 20nm(there is only one 20nm process at TSMC apparently, 20SoC) and go straight to 16nm(16FF). I was told that 16FF is an optical shrink of 20nm, so it will follow 20SoC rather quickly, sort of like a 65nm->55nm transition.

As far as reasons go, I can only speculate and add on to the points raised by both of you. Cost per transistor on 20nm is likely quite high at the moment and the process will probably not be mature enough for a GM200 sized chip in H2'14. In addition I was also told that there is virtually no performance increase with 20SoC vs 28HP. 16FF is supposed to bring a significant performance increase over 20SoC and hence would be a better process for GPU's.

So given how much more efficient Maxwell seems to be, I suppose GM200 on a mature 28HP process is probably a safer bet and the architectural improvements along with the process improvements over the last 2 years will provide enough of a performance increase compared to Kepler.
Suppose the GM206, GM204, and GM200 will be on 28 nm. The GK107 to GM107 transition resulted in a die size increase of 25%, and I wouldn't be surprised if the GM206 and GM204 have somewhat similar size increases over GK106 and GK104. Since these aren't large chips I would expect that the size increase would not be very problematic—+25% on a GK104 would be around GF104-size. Given the above speculation, I think that GM206 and GM204 on 28 nm could work, at least in terms of prices (although if we see a 680-type flagship with Maxwell then GM204 on 20 nm wouldn't be a bad idea).

GK110 on the other hand is a big chip and there's little room to enlarge it. In practice that might be less of a problem than I thought since I expect a 28 nm GM200 to come out of the gate with all SMMs enabled (unlike with GF100 and GK110) and with clock speeds closer to those of the smaller chips. The downside would be the initial part based on GM200 probably has less room to (figuratively) grow compared to earlier chips (no SMMs to enable, etc.).

I'm also curious as to how the DP rate of Maxwell factors into this discussion. Kepler has 1/(3·2^n) DP rate and GM107 has 1/32 DP rate. If Maxwell is limited to 1/4 DP then GM200 would need 33% more SP FLOPS than GK110 just to reach DP parity, so even with architectural improvements the DP increase may not be large. If Maxwell can reach 1/2 DP then it's a different story.

Even assuming they could get a big 20nm GPU out by Q1'15, since 16FF follows so quickly, does it make sense to release 20nm chips and quickly follow up with a supposedly much better 16nm process. So to maintain the release cadence and refresh their lineup this year, they release Maxwell on 28nm and then maybe have 16nm GPU's out by H2'15?
2015 H2 isn't too far away from Volta assuming a 2-year cadence (they didn't have any date for Volta on the NVIDIA roadmap slide). If so, then we might see some sort of "big Volta" in 2016 H2 after a year of 16 FF maturing. Given that even a disabled version of GK110 took nearly a year after Tahiti to come out, a 20 nm GM200 could be released in 2015 H2 and last all of one year before Volta. That being said, the alternative is 28 nm for two more years, also big Volta could be a 2017 (or later) product.

EDIT: fixed math fail
 
Last edited:
Suppose the GM206, GM204, and GM200 will be on 28 nm. The GK107 to GM107 transition resulted in a die size increase of 25%, and I wouldn't be surprised if the GM206 and GM204 have somewhat similar size increases over GK106 and GK104. Since these aren't large chips I would expect that the size increase would not be very problematic—+25% on a GK104 would be around GF104-size. Given the above speculation, I think that GM206 and GM204 on 28 nm could work, at least in terms of prices (although if we see a 680-type flagship with Maxwell then GM204 on 20 nm wouldn't be a bad idea).
I think the problem with that idea is that the +25% area only resulted in a rather large increase because gk107 was rather underspecced in the ALU area compared to the rest of the chip. I'm not sure how big a SMX really was and how large a SMM now is but if the SMX was around 20mm² and the SMM around 15mm² 5 SMM instead of 2 SMX will only add ~35mm². And since gk107 wasn't memory bandwidth limited at all (well not the gddr5 version) this translated into a very large performance improvement.
But for other chips, the increase in ALUs has to be smaller (percentage-wise), as they weren't "too low" on ALU count. nvidia could probably build a 15 SMM GM204 on 28nm with that 25% die size increase but I'm not sure it would really look all that impressive (even considering that Maxwell is more bandwidth efficient). I guess they've got some other improvements in store...
 
Is the L2 still per memory partition?
The whitepaper says this:

ROPs are still aligned with L2 cache slices and Memory Controllers. Internally, all the units and crossbar structures have been redesigned, data flows optimized, power management significantly improved, and so on.

Later on, it also says:
On-chip memory system bandwidth was increased along with improvements in efficiency of the design.
 
I'm also curious as to how the DP rate of Maxwell factors into this discussion. Kepler has 1/3^n DP rate and GM107 has 1/32 DP rate. If Maxwell is limited to 1/4 DP then GM200 would need 33% more SP FLOPS than GK110 just to reach DP parity, so even with architectural improvements the DP increase may not be large. If Maxwell can reach 1/2 DP then it's a different story.

Careful only GK110 has a 1:3 ratio, while it's 1:24 on anything GK10x. On the first core you have 64 FP64 SPs/SMX, while for all the latter cores you have 8 FP64 SPs/SMX.

Now on GM107 instead of 8 FP64 units you have only 4, but you have also 5 clusters compared to just 2 on GK107. GM107 has 20 FP64 units all together, while GK107 16.

Only the top dog ie GM200 would have to have a high number of FP64 units per SMM; whether now they'll be again 64 units/cluster or more is something we'll have to find out.

2015 H2 isn't too far away from Volta assuming a 2-year cadence (they didn't have any date for Volta on the NVIDIA roadmap slide). If so, then we might see some sort of "big Volta" in 2016 H2 after a year of 16 FF maturing. Given that even a disabled version of GK110 took nearly a year after Tahiti to come out, a 20 nm GM200 could be released in 2015 H2 and last all of one year before Volta. That being said, the alternative is 28 nm for two more years, also big Volta could be a 2017 (or later) product.

TSMC has 10FF projected for 2016 which might or might not change in the meantime ;)
 
This is TSMC I can't remember the last time a process shrink was on time though. If they claim 2015 for 16nm I wouldn't expect it before 2017 to be quite frank and that may be me being nice.
 
Actually the extra two SIMD blocks nearly never worked in Kepler in the first place due to register bank constraints, thats why the computing output of FMA-32 is only 2/3 of the "peak" on Kepler.

So in theory its not hard for Maxwell to achieve comparable performance per SM with only 128 "CUDA cores", if all their cores can actually work together, the reg pressure should be comparable for a given computing output.

"Nearly never" seems to be depending on the workload, but since they could be scheduled with work, their Warps counted toward register contention, I'd say.
 
This is TSMC I can't remember the last time a process shrink was on time though. If they claim 2015 for 16nm I wouldn't expect it before 2017 to be quite frank and that may be me being nice.

Wasn't the official saying that 20nm comes fscking long, and then 16nm quite rapidly after it?
Or they said that 28nm was really late, but 20nm was on schedule and thus shortly available after 28nm :p. I think it was the latter. Same pattern could be applied at each generation.

But joke aside, "16nm" is 20nm with FINFET stuff applied to a subset of the features.
Who knows, maybe the FinFET stuff is going okay (it's sort of a hack that allows to do some smaller stuff easier) but the 20nm is holding it back.
 
"Nearly never" seems to be depending on the workload, but since they could be scheduled with work, their Warps counted toward register contention, I'd say.
As Warps are not tied to SIMD blocks but to schedulers (also in Kepler) it doesn't change register allocation at all, especially as the register file size and the maximum number of Warps per SMX/M didn't change. The additional ALUs of Kepler could only lead to a (most of the time) marginally faster execution, that's all.
 
do we have same bench under Kepler ?
Strange number, but ~68 shifts per SM and clock should be higher than Kepler, especially as all sifts are serially dependent (could limit throughput).

Edit to clarify:
While according to nV's description one should be able to achieve peak throughput with serially dependent instructions, I would think it is better to test if varying amounts of ILP affects throughput. Same applies to the number of threadblocks. The code uses the maximum size (1024) meaning just two blocks can run on one SMM. Without knowing more about the internals of Maxwell and how this works exactly with the four subunits, I would feel safer with more and smaller blocks (at least 4) to draw definitive conclusions.
 
Last edited by a moderator:
Suppose the GM206, GM204, and GM200 will be on 28 nm. The GK107 to GM107 transition resulted in a die size increase of 25%, and I wouldn't be surprised if the GM206 and GM204 have somewhat similar size increases over GK106 and GK104. Since these aren't large chips I would expect that the size increase would not be very problematic—+25% on a GK104 would be around GF104-size. Given the above speculation, I think that GM206 and GM204 on 28 nm could work, at least in terms of prices (although if we see a 680-type flagship with Maxwell then GM204 on 20 nm wouldn't be a bad idea).

Yes, I would not be surprised to see a similar increase in die size for GM204 and GM206. Note that while there is a 25% increase in die size from GK107 to GM107, the transistor count increased by 44%. This has a lot to do with how fast GM107 is compared to GK107. Density has also increased substantially from 11m/mm2 to 12.64m/mm2. This would partly be due to process maturity but another big reason for the increase in density was the huge increase in L2 cache from 256 KB to 2 MB (an 8X increase). I do not think we will see a similar increase in cache size for all members of the Maxwell family and hence density may be a tad lower, although more transistors could be spent on the rest of the GPU instead of cache. I would not be surprised if we see a 680 type flagship with GM204, but I think it will remain on 28nm. (See the end of my post for some info on why)
GK110 on the other hand is a big chip and there's little room to enlarge it. In practice that might be less of a problem than I thought since I expect a 28 nm GM200 to come out of the gate with all SMMs enabled (unlike with GF100 and GK110) and with clock speeds closer to those of the smaller chips. The downside would be the initial part based on GM200 probably has less room to (figuratively) grow compared to earlier chips (no SMMs to enable, etc.).

Exactly. Titan Black has shown us how much progress they made with the 28nm process and a fully enabled GM200 on 28nm with high clocks could pack quite a punch. I dont think that is a downside as such, I'd say its actually an upside.
2015 H2 isn't too far away from Volta assuming a 2-year cadence (they didn't have any date for Volta on the NVIDIA roadmap slide). If so, then we might see some sort of "big Volta" in 2016 H2 after a year of 16 FF maturing. Given that even a disabled version of GK110 took nearly a year after Tahiti to come out, a 20 nm GM200 could be released in 2015 H2 and last all of one year before Volta. That being said, the alternative is 28 nm for two more years, also big Volta could be a 2017 (or later) product.

Yes, H2'15 seems like a likely release date for Volta..though I wouldn't be surprised to see a Maxwell shrink on 16FF as a test chip before going for an all new architecture on a comparatively new process. Apart from the info I heard regarding no 20nm GPU's, a 20nm GM200 that late would seem unlikely IMO. And 2017 is too late for a big Volta, it should be out in 2016.
I think the problem with that idea is that the +25% area only resulted in a rather large increase because gk107 was rather underspecced in the ALU area compared to the rest of the chip. I'm not sure how big a SMX really was and how large a SMM now is but if the SMX was around 20mm² and the SMM around 15mm² 5 SMM instead of 2 SMX will only add ~35mm². And since gk107 wasn't memory bandwidth limited at all (well not the gddr5 version) this translated into a very large performance improvement.
But for other chips, the increase in ALUs has to be smaller (percentage-wise), as they weren't "too low" on ALU count. nvidia could probably build a 15 SMM GM204 on 28nm with that 25% die size increase but I'm not sure it would really look all that impressive (even considering that Maxwell is more bandwidth efficient). I guess they've got some other improvements in store...

Good points. And as I mentioned above, remember that the cache size went up substantially and also accounts for a significant part of the increase in transistor count/die size.

Well, given that with GK107 to GM107, performance went by by around 70-80% (As per the performance summary from Techpowerup), a GM204 with a similar increase in die size could bring a substantial improvement in performance as well. Maybe not 70-80%, but 40-50% should be possible dont you think. And since its on the same process, seems impressive enough IMO.
This is TSMC I can't remember the last time a process shrink was on time though. If they claim 2015 for 16nm I wouldn't expect it before 2017 to be quite frank and that may be me being nice.

Agreed but I think TSMC learnt their lesson with 40nm and they haven't as aggressive with 28nm and 20nm (especially marketing wise). But remember that the last two optical shrinks, i.e. 80nm and 55nm were either on time or ahead of schedule. So as 16FF is based on 20nm, I would expect that 16FF could be ready by next year. Risk production on 16FF started in Q4'13 and TSMC announced that volume production is ahead of schedule and would be pulled in from Q1'15 to Q4'14.

Source - http://www.electronicsweekly.com/ne...6nm-finfet-and-20nm-planar-processes-2013-12/
Wasn't the official saying that 20nm comes fscking long, and then 16nm quite rapidly after it?
Or they said that 28nm was really late, but 20nm was on schedule and thus shortly available after 28nm :p. I think it was the latter. Same pattern could be applied at each generation.

But joke aside, "16nm" is 20nm with FINFET stuff applied to a subset of the features.
Who knows, maybe the FinFET stuff is going okay (it's sort of a hack that allows to do some smaller stuff easier) but the 20nm is holding it back.

Yep..that is what I remember as well, and my source also said the same thing..16nm will follow 20nm rather quickly.

With regards to your second point, see my reply to to eastmen above.



I found some more good info on the 20/16nm transition on a Kaveri breifing by AMD product CTO Joe Macri. I'm quoting a few paragraphs from pg3 of the article and have highlighted the important parts in bold. Source - http://www.theregister.co.uk/2014/01/14/amd_unveils_kaveri_hsa_enabled_apu/?page=1
When is 28 nanometers faster than 22?

Kaveri is baked in a 28-nanometer, planar, bulk silicon process, which is nowhere near as efficient as state-of-the-art FinFET (what Intel calls "Tri-Gate") or even the less-than-TriGate, more-than-bulk – and somewhat expensive – silicon-on-insulator (SOI) process that was used in Kaveri's predecessor.

There were reasons to go with 28nm rather than 22nm, Macri told us, that were discovered during the design process. That process was run by what he identified as a "cross-functional team" composed of "CPU guys, graphics guys, mixed-signal folks, our process team, the backend, layout team."

That cross-functional crew identified a boatload of process variants, and members of the team each ran tests based on their areas of interest, examining such factors as power curves and die-area needs.

"What we found was with the CPU with planar transistors, when we went from 28 to 22, we actually started to slow down," he said, "because the pitch of the transistor had to become much finer, and basically we couldn't get as much oomph through the transistor."

The problem, he said, was that "our IDsat was unpleasant" at 22nm, referring to gate drain saturation current*. In addition, the chip's metal system needed to be scaled down to fit within the 22nm process, which increased resistance.

"So what we saw was the frequency just fall off the cliff," he said. "This is why it's so important to get to FinFET."
This again substantiates some of the points I had stated in my earlier post regarding 20SoC and 16FF w.r.t GPU's. Maybe some of our more technically qualified members could elaborate further and add some more information.
 
Last edited by a moderator:
What I've heard from a source is that there are no 20nm GPU's coming at all (from NV at least, no idea about AMD). Nvidia will skip 20nm(there is only one 20nm process at TSMC apparently, 20SoC) and go straight to 16nm(16FF). I was told that 16FF is an optical shrink of 20nm, so it will follow 20SoC rather quickly, sort of like a 65nm->55nm transition.

As far as reasons go, I can only speculate and add on to the points raised by both of you. Cost per transistor on 20nm is likely quite high at the moment and the process will probably not be mature enough for a GM200 sized chip in H2'14. In addition I was also told that there is virtually no performance increase with 20SoC vs 28HP. 16FF is supposed to bring a significant performance increase over 20SoC and hence would be a better process for GPU's.

So given how much more efficient Maxwell seems to be, I suppose GM200 on a mature 28HP process is probably a safer bet and the architectural improvements along with the process improvements over the last 2 years will provide enough of a performance increase compared to Kepler.

Even assuming they could get a big 20nm GPU out by Q1'15, since 16FF follows so quickly, does it make sense to release 20nm chips and quickly follow up with a supposedly much better 16nm process. So to maintain the release cadence and refresh their lineup this year, they release Maxwell on 28nm and then maybe have 16nm GPU's out by H2'15?

Soo, is posible something like this "GTX 765 Ti" in a chip size similar to GTX 760 (296 mm2 (148x2)<> 294 mm2) and around of 75% power use.

If I am not wrong. I have read L2 cache is ever 2MB. What I do not know is if 2 memory controllers take more or less die than 2MB.

Power use is based on average/peak/maximum from here
http://www.techpowerup.com/reviews/NVIDIA/GeForce_GTX_750_Ti/23.html.

I count a 4Gb 750 Ti, but I think there will be a loss of efficiency in a bigger chip.

14022612552516268.jpg
[/URL][/IMG]

I am just an aficionado ;)
 
Tridam's excellent gtx 750 ti review is finally ready: http://www.hardware.fr/articles/916-1/nvidia-geforce-gtx-750-ti-gtx-750-maxwell-fait-ses-debuts.html
Interestingly, the fillrate test there does not mirror the uber high bandwidth efficiency of the fillrate test of 3dmark (as seen by anandtech). (Though Tridam's conclusion there are wrong, as he wrongly assumed number of ROPs were doubled. Especially fp32 blending is definitely just very slow, completely ROP bound and not bandwidth limited.)
 
Last edited by a moderator:
Back
Top