NVIDIA Maxwell Speculation Thread

At this stage it's just a theory but I can't help of thinking that they did not got for dedicated DP units in Maxwell; I'd love to stand corrected but revamping clusters with smaller and more efficient datapaths and theoretically going for hybrid units (which burns more power) doesn't sound like it's enough to reach twice the perf/W.

I agree -- that's where I would place my bets as well. The numbers don't work favorably for dp anything. If we assume 40W for the GPU, we wind up with ~10W for dp, and for the alu section, a LOT less. Full-rate dp is 20pJ/op, which would be ~13W. Area-wise, we'd be looking at ~64mm^2 for 640 units, which also seems too large. From a business perspective, this is a scaled-up mobile design, and dp alus in mobile are pointless. I don't see room for full-rate dp at all.

In theory, quarter-rate dp mul is pretty cheap on top of sp mads, but it's a pain to optimize for power using that kind of design for HPC (hence the dedicated units in Kepler, presumably). NV's two optimization problems are at the extremes, and they have two different market needs -- one would expect the designs in the middle borrow from one side or the other and don't have their own market-specific optimizations. Partial-rate dp seems unlikely from that perspective.

One is left to wonder what the marketing guy was smoking when they used a GK110 block diagram instead of a GK107 one. The Anand article is interesting as well, because I don't see how HPC is really a scaled mobile design given the different market needs. There's a tension between optimizing for mobile and hpc, and reusing work across the product line. Given they're stuck on 28nm for awhile, I'd be surprised if they pushed aggressively on the reuse side at this point. Pleasantly surprised, but surprised nonetheless.
 
dnavas said:
because I don't see how HPC is really a scaled mobile design given the different market needs. There's a tension between optimizing for mobile and hpc, and reusing work across the product line.
Eh, not really.... Perf/Watt is your sun, moon, and stars in either case.
 
Eh, not really.... Perf/Watt is your sun, moon, and stars in either case.

Yes, but your measurement of perf is different in the two different scenarios. All of NV's slides up to this point were talking about dp-ops/w, which makes sense for hpc. The optimal dp rate on mobile is epsilon of zero....
Mobile cares very much about extremely low idle, hpc not so much. HPC could care less about better nvenc implementations, while mobile would benefit from a more complex, inclusive system for video taking and photo enhancement. Once you get rid of the low-hanging fruit, optimization becomes highly domain specific.

The stated low-hanging fruit was operation gathering, and presumably the register file has moved closer to the alus with the re-partitioning. But there has to be a path beyond the low-hanging fruit....
 
dnavas said:
The optimal dp rate on mobile is epsilon of zero....
This doesn't really impact architectural design.

dnavas said:
HPC could care less about better nvenc implementations,
It isn't as if this is particularly costly (especially when it isn't being used) and will only become less significant over time.

You seem to be focusing on things which are order(s) of magnitude beneath primary concerns.
 
This doesn't really impact architectural design.

It's rather critical for the part of architectural design under debate -- whether or not dp is implemented.

You seem to be focusing on things which are order(s) of magnitude beneath primary concerns.

Oh, certainly possible. You're right, from a power budget, nvenc is de minimus in hpc. From an implementation (worker resource) budget, better camera support is pretty important in phone design, and people hired to work on that are not working on (say) hpc-specific problems (maybe >64bit fp support, which would be a boon for EM/plasma physics, at least). That said, low idle is absolutely critical for mobile, and I wouldn't consider that to be a non-primary concern at all. In fact, it kind of demonstrates my point -- the items of primary concern in mobile are not the same as the items of primary concern in hpc.
 
dnavas said:
people hired to work on that are not working on (say) hpc-specific problems (maybe >64bit fp support, which would be a boon for EM/plasma physics, at least).
They wouldn't be anyway...

dnavas said:
It's rather critical for the part of architectural design under debate -- whether or not dp is implemented.
Well, we will have to agree to disagree on this one.

I will say I find your perception of the importance of DP in mobile applications to be rather short-sighted.
 
I think I'd go with Cayman. 40nm. A lot different. Initially planned for 32nm.


I'm really curious about all this talk of significant perf/watt improvement... Can't wait to see a review.
 
http://images.anandtech.com/doci/5699/GeForce_GTX_680_SM_Diagram_FINAL.png

Comparing to Kepler, the L1 is not shared anymore with the shared memory? Separate 64KB shared memory. Interconnect network & uniform cache isn't in Maxwell SMM either.

The interconnect network is obviously there, just not shown.

The absence of uniform/constant cache and the decoupling of L1 and share memory is interesting. Perhaps the biggest changes would be on the memory side.

I am disappointed at the meagre increase in shared memory. May be they'll surprise us with the L1 sizes.
 
Hmm so instead of ditching the SFUs they actually increased the amount of them :). (Or rather, a more accurate description giving how things look like on the diagrams, it would be more accurate to say that the the number of "normal alu cores" decreased compared to everything else.)
Is there something bad about having SFUs?
 
nvenc is most needed for Grid. The current GRID product (http://www.techpowerup.com/gpudb/1699/grid-k1.html) is using 4 GK107 chip. It stands to reason that GM107 will be used for this in the future and that this is the main reason for improving nvenc.

DDR4 support on GM107 : would that be likely? It would give a useful bandwith increase.
Then in 2015 when availabilty is better, DDR4 would go on low end consumer cards.
 
Is there something bad about having SFUs?
Not inherently, it's just worth noting that AMD integrated the special function handling into the main ALUs back with VLIW4 (it never was fully separate with VLIW5 neither but clearly the 5th alu lane there you could consider as a SFU), whereas they remain separate in nvida's design, apparently. Could have similar reasons to as why DP units aren't integrated, though maybe that has changed with Maxwell.
 
Not inherently, it's just worth noting that AMD integrated the special function handling into the main ALUs back with VLIW4 (it never was fully separate with VLIW5 neither but clearly the 5th alu lane there you could consider as a SFU), whereas they remain separate in nvida's design, apparently.
I always wondered why NVIDIA never added SFUs to their ALUs count. the only reason I can think of is that they are very limited in function.

I also wonder if AMD is counting SFUs in their GCN line up.
 
Back
Top