NVIDIA Maxwell Speculation Thread

Makes sense. What if the difference between Tesla and commodity is Denver? If the area difference for 64 dp vs sp alus per smx is small enough, there'd be no reason to castrate 64bit performance. Aren't they at a competitive disadvantage to amd as things currently stand (wrt dp)?

Kepler at the moment is:

GK10x = 192 FP32 SPs + 8 FP64 SPs (24:1)
GK110 = 192 FP32 SPs + 64 FP64 SPs (3:1)

From the top of my head for synthesis alone you need 0.25mm2 under 28nm for each FP64 unit at 1GHz. At 960 total FP64 SPs of a GK110 you're at ~ 24mm2. It's not the final die area but it gives a rough idea how "big" those FP64 units really are in the end or even better that the FP64 unit percentage even for a GK110 ALU is quite small.

Why would they have a competitive disadvantage to AMD regarding DP?
 
From the top of my head for synthesis alone you need 0.25mm2 under 28nm for each FP64 unit at 1GHz.

Yeah, so, remove 64 sp alus and add (64 - 8) dp alus, issue sp through dp, size of SMX grows by a small amount. Mind you, I never understood why they didn't do that in the first place, maybe sp issue through dp wasn't finished/optimized in time for Kepler?

Why would they have a competitive disadvantage to AMD regarding DP?

1/24 vs 1/4 issue rate?
 
If that is truly the case, then I would be really surprised, because I didn't expect any heavily rearchitected Maxwell GPU's to be announced until GTC 2014 at the end of March. Note that each rearchitected CUDA core for 750 Ti would need to be capable of much more work (at least 50% more) than a Kepler CUDA core in order for that GPU to be a worthy successor to 650 Ti.

Who gives a rats snout what the marketing name is of the final product. GM107 should be compare to GK107 because that is where the hierarchy if the chip will fall in when the rest of the Maxwell family comes. If the leaked benches are true, then it will end up 65-75% faster than GK107 on the same node size. Accounting for TSMC's projections of 30% performance improvement with the same power consumption when moving to 20nm, that puts GM107 at ~100+% faster than GK107 when its all said and done.
 
Last edited by a moderator:
I just wonder if NVIDIA is interested into countine their Titan product line (a.k.a the desktop Tesla) in the era of Maxwell.

It may hurt the sells of more expensive Tesla lines, but it will put Intel's MIC in a very unhappy position as well, tough choice I guess.
 
Didn't anandtech have a rumor that part of Maxwell was that dp could also be used for sp instructions?

Making a hybrid ALU that can compute both 32 and 64 bit IEEE FP math is quite possible.
Such shared designs save significant transistors compared to two independent dedicated units, but at the expense of extra power use to handle the switching between modes. GPUs are power constrained already, so hybrid ALUs are not an attractive design.
 
Yeah, so, remove 64 sp alus and add (64 - 8) dp alus, issue sp through dp, size of SMX grows by a small amount. Mind you, I never understood why they didn't do that in the first place, maybe sp issue through dp wasn't finished/optimized in time for Kepler?

To save quite a bit of power for double precision.

1/24 vs 1/4 issue rate?
I honestly doubt any IHV gains or loses a worth mentioning amount of sales over a few pissat GFLOPs of double precision on mainstream desktop GPUs.

Making a hybrid ALU that can compute both 32 and 64 bit IEEE FP math is quite possible.
Such shared designs save significant transistors compared to two independent dedicated units, but at the expense of extra power use to handle the switching between modes. GPUs are power constrained already, so hybrid ALUs are not an attractive design.

What do you mean possible? NV used to have ALUs capable of both FP32 and FP64 and AMD still does. Scroll up and re-read what each FP64 unit costs roughly in die area and yes it's times better to dedicate a few dozen of mm2 more in order to save power. If you don't understand why it saves power to have dedicated units, you might want to have a look at the exact same reasoning in ULP SoCs.
 
What do you mean possible? NV used to have ALUs capable of both FP32 and FP64 and AMD still does. Scroll up and re-read what each FP64 unit costs roughly in die area and yes it's times better to dedicate a few dozen of mm2 more in order to save power. If you don't understand why it saves power to have dedicated units, you might want to have a look at the exact same reasoning in ULP SoCs.
Why the aggressiveness? The way I read it, you completely agree with his statement...
 
It's still wild season and there's still not really any reliable information that I'd personally trust anywhere, which means it's the perfect time for me to do my traditional "make random guesses that turn out horribly wrong" post!

- 128 ALU/SMX, 8 TMU/SMX, Hierarchical RF & Scheduling
-- 4xDispatch 3xIssue (vs Kepler 8xDispatch 2xIssue) in NVIDIA Speak.
-- 64KB L1/Shared Memory (higher effective bandwidth / fewer dispatchers).
-- Advantages: Better locality for power efficiency, better GPGPU performance.
-- Disadvantages: 3xIssue efficiency but fundamentally synergistic with hierarchical RF
--> Overall only needs 2 MADDs to be co-issued with other port for everything else (potentially allows decoder savings rather than full duplication as well). Absolutely not a problem *IF* you have the register file throughput for it (which Hierarchical RF should allow in typical use-cases).

- Multiple parts on 28nm but full family will wait for 16nm FinFET.
-- Most chips except low-end will include 1+ Denver core to push developer adoption.
-- 20nm is not sufficiently cost efficient for some time and not a big power improvement.
-- 16nm obviously won't be either but at will have a significant power advantage they can't miss.
--> Obviously the big question is whether Big Maxwell will be on 28nm, 20nm, or 16nm. Given the new Titan SKU I'm betting it'll be on 16nm but a bit earlier in the lifecycle of the node than GK110.
 
It's still wild season and there's still not really any reliable information that I'd personally trust anywhere, which means it's the perfect time for me to do my traditional "make random guesses that turn out horribly wrong" post!
Can I do one too?

GM108 / GM107
Maxwell architecture
28 nm HPM
128 CCs per SMX
3 / 5 SMX
64-bit / 128-bit memory interface
CUDA compute capability 3.7
No ARM cores

GTX 750 / GTX 750 Ti
GM107 with only 1 SMX disabled / GM107 with nothing disabled
Core clock in the range [950, 1050) / [1000, 1050)
Memory speed 5.0 Gbps / 5.4 Gbps
50 W / 60 W TDP
 
Who gives a rats snout what the marketing name is of the final product. GM107 should be compare to GK107 because that is where the hierarchy if the chip will fall in when the rest of the Maxwell family comes.

Umm no. Codename of a GPU is irrelevant for a consumer. What matters is market positioning relative to current or near future products. Like it or not, GM107 in 750 Ti will be compared and should be compared to GK106 in 650 Ti, because the implication here is that the former will replace the latter.
 
It's still wild season and there's still not really any reliable information that I'd personally trust anywhere, which means it's the perfect time for me to do my traditional "make random guesses that turn out horribly wrong" post!

- 128 ALU/SMX, 8 TMU/SMX, Hierarchical RF & Scheduling
-- 4xDispatch 3xIssue (vs Kepler 8xDispatch 2xIssue) in NVIDIA Speak.
-- 64KB L1/Shared Memory (higher effective bandwidth / fewer dispatchers).
-- Advantages: Better locality for power efficiency, better GPGPU performance.
-- Disadvantages: 3xIssue efficiency but fundamentally synergistic with hierarchical RF
--> Overall only needs 2 MADDs to be co-issued with other port for everything else (potentially allows decoder savings rather than full duplication as well). Absolutely not a problem *IF* you have the register file throughput for it (which Hierarchical RF should allow in typical use-cases).

- Multiple parts on 28nm but full family will wait for 16nm FinFET.
-- Most chips except low-end will include 1+ Denver core to push developer adoption.
-- 20nm is not sufficiently cost efficient for some time and not a big power improvement.
-- 16nm obviously won't be either but at will have a significant power advantage they can't miss.
--> Obviously the big question is whether Big Maxwell will be on 28nm, 20nm, or 16nm. Given the new Titan SKU I'm betting it'll be on 16nm but a bit earlier in the lifecycle of the node than GK110.

Where's the Uttergram from hell? :runaway:
 
yFmzSjx.jpg


http://www.chinadiy.com.cn/html/48/n-13048.html
 
To save quite a bit of power for double precision ... If you don't understand why it saves power to have dedicated units, you might want to have a look at the exact same reasoning in ULP SoCs.

More than happy to read any paper you (or spworley [thanks for your kind explanation!] or anyone else for that matter) want to point me at, but Dally's presentation had numbers for DP on 28nm for an unnamed nvidia product -- 20pJ for DP operation, 50pJ for register reads and 26pJ for local bus costs. I have no doubt that DP op costs far exceed SP op costs (int op was quoted earlier at .5pJ to a theoretical 50pJ dpmad), but I'm also assuming that SP memory read and transportation costs scale linearly, which means that the energy costs remain higher for reading and transporting the args than for running the wider unit. Which isn't to say that the costs are insignificant, but the whole argument of the presentation seems to indicate that op-cost isn't the cost that nvidia are focused on. It certainly did not leave me with the impression that the power cost of those units are as onerous as you are suggesting.

The point is further underlined by noting that in scaling from 40nm to 10nm, dp costs are forecast to improve by a factor of 8, while transport is only expected to improve by a factor of 2. Maxwell was supposed to be a 20nm design, scaling benefits should be tipping in favor of optimizing for local access rather than alu sizing.

I honestly doubt any IHV gains or loses a worth mentioning amount of sales over a few pissat GFLOPs of double precision on mainstream desktop GPUs.

...today.
Today, you can still sell desktop GPUs. I'd argue we're moving into a world where the number of desktops that aren't workstations is minimal. Tablets are eating laptops and desktops ( http://www.computerworld.com/s/arti...ments_will_surpass_desktops_and_laptops_in_Q4 ). I agree with you, though, that the issue here is a business decision. I would argue that it is nearly 100% a business decision. Nvidia needs to maintain margin by delivering a tiered product set. The question I wonder about is whether dp op-rate is the right feature to focus on. The question nvidia should be asking is, how do they best preserve their workstation market. My argument is that they are vulnerable to competition at the workstation level that charges less, not that they are likely to lose desktop gpu sales based on dp rate (which, I agree, would be silly).

[Edit: and note, it appears that we are looking at 128-wide/640 sp alus, so I'm happy to take any links you want to offer and shut up in the Maxwell thread and wait for the next iteration :>]
 
Making a hybrid ALU that can compute both 32 and 64 bit IEEE FP math is quite possible.
Such shared designs save significant transistors compared to two independent dedicated units, but at the expense of extra power use to handle the switching between modes. GPUs are power constrained already, so hybrid ALUs are not an attractive design.
Unless you can power gate these dedicated units a hybrid design that uses fewer transistors is likely to be better.
 
They can be clock gated as well
You can probably use fine grained clock gating in hybrid designs so coarse clock gating above that isn't going to buy a lot unless there's a lot of added area.

If there are companies using hybrid and separate units we can be sure hybrid designs can be attractive and there's not a clear winner.
 
http://videocardz.com/49557/exclusive-nvidia-maxwell-gm107-architecture-unveiled

T33Q8h3.png


GM107 has a TDP of 60W
The GM107 will not even utilize the full power delivered by PCI-E connector (75W). While operating at default frequencies it won’t need any additional power source. Although manufacturer will still add the power connector, for the sake of stability or increasing the overclocking headroom.

Larger L2 cache.
This is the main difference between Kepler and Maxwell. Larger L2 cache will limit the queries to the GPU. GM107 L2 cache has 2MB. GK107′s cache has 256KB.
Workload balancing and complier-based scheduling has been improved.
The number of instructions per clock cycle has been increased.
SM has been redesigned into four processing blocks (as explained above).
Maxwell introduces even faster H.264 encoding and decoding with improved NVENC (which is used, for instance, in ShadowPlay).
New GC5 power state (low sleep state).

As opposed to previous leaks, the die size of GM107 is even smaller, not 156 but 148mm2. Compared to GK107 the density of CUDAs per mm2 has increased roughly by 30%. The density of transistors increased by 15%. Remember, this is all on the same fabrication process.
 
Last edited by a moderator:
Back
Top