NVIDIA Maxwell Speculation Thread

I think we can confidently say that NVidia will have no choice but to abandon GDDR5 for the high end gaming market. It's a dead end. It sucks up way too much power for a start (10s of watts on current cards when GPU interfacing + memory power is all added up). And it can't remotely compete in terms of raw bandwidth. It's just a matter of time.

The more interesting question, to me, is whether AMD will fuck it up on the first go, R600 re-run style.


You're saying GDDR5 is a dead end for high-end cards because it sucks too much power?
I think power consumption is usually not the highest concern when they're building a high-end graphics card...
 
The only thing I could think of to escape the 96 ROPs would had been to decouple ROPs from the MC but I severely doubt it and I don't even think they'll do it anytime soon.
Maybe you shouldn't be thinking that it has twice the ROPs compared to GK110, but 32 more compared to GM204?

GK104 --> GM204 = 100% more clusters (yes smaller blahblahblah), same amount of FP64 units, twice the ROPs, same amount of TMUs, 3.54b ---> 5.2b
GK110 --> GM200 = 60% more clusters (again smaller clusters), 60% more FP64 units, twice the ROPs, 240 vs. 192 TMUs, 7.1b ---> ~8.0b?

As I said I'd like to stand corrected, since as a layman that kind of stuff is pure and quite uneducated speculation, but it's also my understanding that TMUs are quite expensive units. Having 20% less in the latter hypothetical case isn't exactly something to sneeze over either.

I dont see them decoupling the ROPs and MC either. I speculated that they could reduce the number of ROPs per MC but this does not seem too feasible either. True..if you look at it that way, it has 50% more ROPs than GM204. But from what I understand, ROPs are extremely bottlenecked by memory bandwidth so increasing the number of ROPs without increasing bandwidth will not always yield a performance benefit. So I wonder if the large increase in ROPs is justified.

Well GM204 beats GK110 handily despite having only about half the TMUs. GM200 would have 50% more TMUs than GM204 so it should be plenty. And yes..given that its 20% less than GK110, its a decent saving in die area.
It's one thing to increase density when changing architecture, but I don't understand why a gm200 would have better transistor density than gm204. If they can push gm200 to Hawaii densities, then why didn't they do so for gm204 as well.

Historically, the transistor density has always increased with chip size. If we take Kepler, the densities for GK107, GK104 and GK110 were 11.02 M/mm2, 12.04 M/mm2 and12.89 M/mm2. I suppose its because the bigger chips have a higher proportion of ALUs and cache as a percentage of overall die and these are denser.

The figures for Maxwell are GM107 with 12.63 M/mm2 and GM204 with 13.06 M/mm2. If we apply the same scaling as GK110 vs GK104 for GM200 vs GM204, GM200's density should be ~13.98 M/mm2.

This would still be shy of Hawaii btw, which has a density of 14.15 M/mm2. Just as a comparison, Tahiti was only 11.81 M/mm2 so you can see just how much AMD was able to increase their density.
For actual size: a gm204 is around 400mm2. For gm200, 384 bits MC is a given, and 20 SMs wouldn't make a lot of sense, so let's say 24 SMs as well. That is +50% in both cases. That leaked cache size shows 3MB: another +50%. A naive pure 50% scaling of gm204 results in a die size of 600mm2. That's the number you have to start with. If it's 560mm2, one way or the other, they're going to have find 40mm2 by scaling texture and ROP units less than +50%, yet still add extra FP64 units as well?

I just don't see it.

Lets remember that you cant simply scale by percentages as the entire chip is not duplicated. Take GM107 vs GM204 for example. GM204 has an additional MC, but has 4x the ROPs, 3.2x the SMMs (where each SMM also has 96KB of shared memory versus 64KB on GM107), and has some additional DX12 features, but is only 2.69x the die size.
And then I believe they'll jump to Pascal (GP100) for that market (followed by Volta), rather than use a 16FF Maxwell refresh.
Pascal would probably be used on really high end gaming products (Titan, Titan Z) but possibly not on a "geforce GTX 1080 Ti" kind of card. I speculate "regular high end " gaming cards will still use GDDR5 rather than HBM memory (or which it is). If Pascal can only use the stacked-memory dies and not GDDR5 (no idea, flexible memory controllers are theoretically possible) then you need a GPU such as GM200, possibly followed by a similar GM300.

(That "GM300" for a 16FF Maxwell is just a putative name, maybe it'd be about a Pascal with less features and branded as a Pascal, giving you a situation like GK104 vs GK110. Maybe that last parenthesis of mine is useless : if "GM300" comes out before Pascal then it shall be named "GM300", "3rd generation Maxwell")

Yea..I dont see a Maxwell refresh on 16FF either. From what I've heard, 16FF will be Pascal only. But given how long it took for GK210 to come out, I don't see another compute only part for another ~2 years, or at least one year after the first Pascal chip is released.

Indeed, the type of memory each chip ends up using will be very interesting to see.
 
Last edited:
I think power consumption is usually not the highest concern when they're building a high-end graphics card...
NVIDIA's high-end GPUs have all stayed within 250 W (listed) TDP so I presume that power savings from HBM can be transferred to higher core clocks or more SMs.

I dont see them decoupling the ROPs and MC either. I speculated that they could reduce the number of ROPs per MC but this does not seem too feasible either. True..if you look at it that way, it has 50% more ROPs than GM204. But from what I understand, ROPs are extremely bottlenecked by memory bandwidth so increasing the number of ROPs without increasing bandwidth will not always yield a performance benefit. So I wonder if the large increase in ROPs is justified.
If 64 ROPs are justified for GM204, then I don't see why 96 ROPs aren't justified for a chip with 50% more bandwidth.

Also, since I'm curious… if the GM206 has a 192-bit bus, do you think it will have 24 or 48 ROPs?
 
Historically, the transistor density has always increased with chip size. If we take Kepler, the densities for GK107, GK104 and GK110 were 11.02 M/mm2, 12.04 M/mm2 and12.89 M/mm2. I suppose its because the bigger chips have a higher proportion of ALUs and cache as a percentage of overall die and these are denser.
Yes, density is determined by random logic vs cache vs analog blocks (PLLs, unique IOs, etc.) (And standard cell library and metal stackup, but let's assume those are constant within the same family.)

The analog part is why very small chip (like gm107) are always worse in density than bigger. I say 'unique' IO because IOs that scale with functional units don't count. (So MC IOs don't count.)

When you look at gm107 vs 204, the MC doubles, the L2 cache doubles, the SMs triple. The ROPs quadruple. But the register files also increase (compensating for the L2 doubling more or less?) The small density increase (compared to similar gk) is probably mostly due to the analog stuff.

With gk104 to gk110, the SMs are less than double, the register files go bigger, the L2 cache triples: there is a major shift, relatively speaking, going from less dense random logic to denser RAM.

For our speculated gm200, everything simply goes up by 50%: SMs, cache, MC. And we're adding less dense FP64. There is a shift from dense to less dense! And since you're already dealing with a very large chip (both 204 and 200), the unique stuff is already small so it won't have a big impact.

The figures for Maxwell are GM107 with 12.63 M/mm2 and GM204 with 13.06 M/mm2. If we apply the same scaling as GK110 vs GK104 for GM200 vs GM204, GM200's density should be ~13.98 M/mm2.
Even disregarding what I said earlier, notice that gm107 to 204 is only increasing by .5 where it was 1 for equivalent Kepler. So a naive conclusion for Maxwell would be 0.5 for 204 to 200 as well.

None of this is surprising: as you get denser and more optimal, you'll run into a wall eventually.
 
Last edited:
NVIDIA's high-end GPUs have all stayed within 250 W (listed) TDP so I presume that power savings from HBM can be transferred to higher core clocks or more SMs.

If 64 ROPs are justified for GM204, then I don't see why 96 ROPs aren't justified for a chip with 50% more bandwidth
Yes, that sounds right. And, while 64 ROPs may sound a bit like overkill, don't forget the chip has the 4x16 pixel rasterizer (and enough shader export) to go along with it. And for the typical bandwidth limited nature of ROPs, certainly the improved frame buffer compression helped too. Plus, some operations are very slow with nvidia's ROPs and never even close to being bandwidth limited (fp32 rgba blend), so doubling up would help there too (instead of having more complex ROPs).

Also, since I'm curious… if the GM206 has a 192-bit bus, do you think it will have 24 or 48 ROPs?
A good question... I suspect it will be the same arrangement as other GM2xx chips (hence 48). Assuming 2 GPCs that would be more ROPs than really needed but not too bad (certainly not as much overkill as some older chips). But all things considered that's probably a better tradeoff than having "not enough" ROPs (and more in tradition with nvidia's past chips).
 
You're saying GDDR5 is a dead end for high-end cards because it sucks too much power?
I think power consumption is usually not the highest concern when they're building a high-end graphics card...

Turbo functionality raises clocks and ramps power consumption to target values for a reason. They are power-limited in all but the most non-standard situations, so getting more done with less power means more performance.
The chips that go into high-end graphics cards are not designed to cater exclusively to niche cooling markets or standard-breaking power delivery systems. There isn't the market to support a chip like that.

This means what even the craziest high-end cards that are willing to ignore board standards or require non-standard setups are going to get chips whose peak capabilities are primarily determined by what can be provided for the power-limited "mere mortals" portion of the market. Raising the ceiling of the more pedestrian tiers means a higher floor for the more extreme high end to start from.
 
Looks like what myself and others here have been thinking is happening.. no Maxwell HPC chip. So GM200 should indeed be a graphics oriented chip...

http://wccftech.com/nvidia-planning...e-tesla-line-pascal-2016-volta-arriving-2017/

The reason behind not deploying Maxwell in Tesla Accelerators is said to be the lack of FP64 Floating Point Units and additional double precision hardware. And the reason behind not including DP FPUs in Maxwell might have to do with the extremely efficient design that NVIDIA was aiming for. This however means that NVIDIA’s upcoming Maxwell core, the GM200 which is the flagship core of the series might just remain the GeForce only offering unlike Kepler which was aimed for the HPC market first with the Titan supercomputer and launched a year later after the arrival of the initial Kepler cores as the GeForce GTX Titan.

Since the DP FP64 FPU hardware blocks will be removed from the top-tier cards that are rumored to arrive next year, they will include several more FP32 FPUs to make use of the remaining core space and that means better performance for the GeForce parts since games have little to do with Double precision performance.

This fits exactly my expectations....
 
Wccftech? Seriously? :runaway: As a close second let's assume GM200 has just 4 FP64 SPs as all other GM2xx cores per SMM. Now be bold enough and do the math how many extra SMMs they could add while skipping something like 1440 FP64 SPs. One more?
 
Ailuros, read the article -.-
The source is Kenichi Hayashi (NVIDIA platform business headquarters Director), based on a presentation made by him at a conference. Not some rumour from Chiphell...
 
Whoa! I wonder if it has anything to do with the introduction of the GK210? I'm not sure why the GK210 would be introduced at all if it would be succeeded by GM200 around half a year later.

EDIT (to avoid double posting): In other news, WCCFTech cites DG Lee who claims the existence of three different GTX 960 variants.

Nvidia-Geforce-GTX-960-GTX-960-Ti-GTX-960-Ti-Ultra.jpg
 
Last edited:
Ailuros, read the article -.-
The source is Kenichi Hayashi (NVIDIA platform business headquarters Director), based on a presentation made by him at a conference. Not some rumour from Chiphell...

I've read it already; after that I'd be a fool to deny the possibility of a GM200/3D only chip. However the above still stands. What I still find hard to believe is that wccftech's rubbish like this:

Since the DP FP64 FPU hardware blocks will be removed from the top-tier cards that are rumored to arrive next year, they will include several more FP32 FPUs to make use of the remaining core space and that means better performance for the GeForce parts since games have little to do with Double precision performance.

It's probably just me that FP32 SPs don't get just randomly added into an architecture like Maxwell. They come in clusters ie SMMs, for which clusters come with area heavy units like TMUs.

imacmatician is right; the GK210 was the first sign for it but I chose to ignore it. My gut feeling had told me a long time ago that they might go for mGPU solutions for HPC, but I was not expecting it so soon and as clumsy to be honest. And to avoid misunderstandings, clumsy stands amongst others for:

k80-accelerator-performace.jpg
 
Last edited:
It's probably just me that FP32 SPs don't get just randomly added into an architecture like Maxwell. They come in clusters ie SMMs, for which clusters come with area heavy units like TMUs.

Hey, GF100 and GF104 had different SMs cluster sizes.. 32 vs 48. Why can't they do that again? It would be in the opposite direction sure but it's not like nvidia is shy of doing unexpected changes, quite the contrary (first 384bit memory bus in G80, double pumped SPs on three generations, followed by its removal, odd number of clusters in GK110, etc..). nVIDIA does whatever it's needed to achieve their performance goals, so I would not be surprised to see quite some changes in GM200, rather than just an upscaled GM204. It should support exactly the same features tough.
 
Last edited:
If GM200 is indeed lacking most of the FP64 units then surely the most likely scenario is that NV has just chosen to add more of the same SMMs that we see in GM204 instead of changing the SMM itself.

I do have to say though, personally I'd be thrilled to see a big NV chip that's not also meant for teslas etc. If NV thinks that it's cost effective to support two different big dies for different markets then that's great news. I have my doubts though.
 
If GM200 is indeed lacking most of the FP64 units then surely the most likely scenario is that NV has just chosen to add more of the same SMMs that we see in GM204 instead of changing the SMM itself.

I do have to say though, personally I'd be thrilled to see a big NV chip that's not also meant for teslas etc. If NV thinks that it's cost effective to support two different big dies for different markets then that's great news. I have my doubts though.

The consensus seems to be that if it indeed not made for FP64 support, they will not do one before next architecture.

This said, it seems everything is based on the lack of information about GM200 in this tesla presentation.

If the leap is not big enough, they have maybe just choose to dont speak about maxwell on this presentation.. and instead speak about the future ones.

Hardware wise, thoses company dont change their mind on last minutes, so whatever is planned is planned since a long time.
 
Last edited:
If GM200 is indeed lacking most of the FP64 units then surely the most likely scenario is that NV has just chosen to add more of the same SMMs that we see in GM204 instead of changing the SMM itself.

I do have to say though, personally I'd be thrilled to see a big NV chip that's not also meant for teslas etc. If NV thinks that it's cost effective to support two different big dies for different markets then that's great news. I have my doubts though.

Yes. I was going by Ailuros assertion that removal of FP64 units is not enough to add new SMs.
 
The consensus seems to be that if it indeed not made for FP64 support, they will not do one before next architecture.

This said, it seems everything is based on the lack of information about GM200 in this tesla presentation.

If the leap is not big enough, they have maybe just choose to dont speak about maxwell on this presentation.. and instead speak about the future ones.

Hardware wise, thoses company dont change their mind on last minutes, so whatever is planned is planned since a long time.

Yes, but the elephant in the room is GK210. Why do it if you would have a great GM200 coming in?

They might have been constrained by 28nm to the point the increase in DP computing power would be less than what would be possible with an optimised Kepler core (GK210) in a dual chip card. They could have compared the numbers in advance and since Kepler is already a mature architecture, it could have made more sense and safe to do that, rather than pushing for a larger GM200 chip on a new architecture.
 
Hey, GF100 and GF104 had different SMs cluster sizes.. 32 vs 48. Why can't they do that again? It would be in the opposite direction sure but it's not like nvidia is shy of doing unexpected changes, quite the contrary (first 384bit memory bus in G80, double pumped SPs on three generations, followed by its removal, odd number of clusters in GK110, etc..). nVIDIA does whatever it's needed to achieve their performance goals, so I would not be surprised to see quite some changes in GM200, rather than just an upscaled GM204. It should support exactly the same features tough.


Because without any deeper architectural changes they'll move back to Kepler efficiency per cluster. How did they gain again up to 40% more efficiency going from SMX to SMM?


Yes, but the elephant in the room is GK210. Why do it if you would have a great GM200 coming in?

They might have been constrained by 28nm to the point the increase in DP computing power would be less than what would be possible with an optimised Kepler core (GK210) in a dual chip card. They could have compared the numbers in advance and since Kepler is already a mature architecture, it could have made more sense and safe to do that, rather than pushing for a larger GM200 chip on a new architecture.

My problem remains that despite GK210 gets on paper only 2.9 TFLOPs DP, it's real time efficiency (K80 vs. K40 see NV's own graph above) leaves a lot left to be desired and it's not particularly convincing with its 300W TDP either. From a hypothetical GM200 you should get at least 2.5TFLOPs at the usual 225W TDP for Teslas.

Not that I care that much since as a consumer DP is worthless to me, but this sounds like a tough slaughter against AMD and Intel.
 
Last edited:
Back
Top