NVIDIA Maxwell Speculation Thread

By crippling 970, they may have reduced the avaible connections for crossbar(s) so that after a certain amount of ram is consumed / adressed, the crossbar(s) dont have enough free connections to connect SMM's with L2, causing stalls / delays.
 
On 12 SMM - 2MiB@SiSoft - 256-Bit GTX 980M you get a full 4/8GB memory pool. Its probably this additional deactivation of L2-Cache on GTX 970 GM204-200, which brings this problem. SiSoft read through CUDA from the first day, that there are only 1.8MB L2 on GTX 970.
This is more mystery to me, since L2 cache is tied to MCs (and ROPs), how can they disable L2 but not ROPs/MCs? It's true that the GTX 970 has always shown some odd results compared to GTX 980 in some synthetic tests which is likely related to that (check hardware.fr fillrate results for instance), but I still fail to understand how this actually works. I guess the nice diagrams need some more details to make sense there...
 
It now seems likely GM200's die is 600+ mm^2. That very large size gives room for extra pins, possibly enough to allow a 512 bit memory bus. NVidia hasn't used such a wide bus since GT200's similar sized 572 mm^2 die. Fermi and Kepler largest chips were both smaller (529 and 561 respectively) and both had only had room for 384 bit busses. A 512 bit bus would give big Maxwell a rather dramatic and welcomed memory bandwidth boost.

A 512 bit wide bus would give big Maxwell memory capacities of 4, 8, or even 16 GB. 384 wide gives 3, 6, and 12.

GT200 was using GDDR3, not GDDR5, but I doubt that change limited bus widths.

I never understood why bw and memory capacity are linked each others.. somebody can explain to me?
 
I never understood why bw and memory capacity are linked each others.. somebody can explain to me?

Bus width is tied to the number of memory chips, because a typical GDDR5 chip has a bus width of 32 bits. So a 512-bit bus would lead to 512/32 = 16 memory chips, while a 384-bit bus would lead to 12 chips.

Memory chips usually (always?) have a capacity that is a power of two, e.g. 256MB or 512MB. So if the number of chips is a power of two, you'll have a memory capacity that's a power of two as well. And since 32 = 2^5, it really depends on whether the bus width is a power of two.

Some GPUs support mixed capacities, e.g. with 12 chips (384-bit bus) you might have 8 256MB chips and 4 512MB chips, for a total of 8×256 + 4×512 = 4096MB or 4GiB, so it's not a hard rule. But memory controllers usually can't handle this optimally, so it may lead to reduced effective memory throughput.
 
@fellix
That explanation doesn't jive with what has been said from NV officially. Also, the guy doesn't seem to be an NV employee, so his credentials are unknown to me - why should I trust what he says? Has he really any idea at all what he's talking about? ;)
 
AnandTech also explores the memory of the GTX 970.

I found the following note on memory bandwidth particularly interesting:

(2nd page) said:
This in turn is why the 224GB/sec memory bandwidth number for the GTX 970 is technically correct and yet still not entirely useful as we move past the memory controllers, as it is not possible to actually get that much bandwidth at once on the read side. GTX 970 can read the 3.5GB segment at 196GB/sec (7GHz * 7 ports * 32-bits), or it can read the 512MB segment at 28GB/sec, but not both at once; it is a true XOR situation. Furthermore because the 512MB segment cannot be read at the same time as the 3.5GB segment, reading this segment blocks accessing the 3.5GB segment for that cycle, further reducing the effective memory bandwidth of the card. The larger the percentage of the time the crossbar is reading the 512MB segment, the lower the effective memory bandwidth from the 3.5GB segment.
 
http://www.pcper.com/reviews/Graphi...Full-Memory-Structure-and-Limitations-GTX-970

There's more on it
GTX 970 was advertised to have 64 ROPs and 2048KB L2 (on reviewer's guide at least), in reality it has 56 ROPs and 1792KB L2 (and of those 56 ROPs, only 52 are effectively in use due limitations in SMM department)
Now this makes sense (anandtech's article is pretty good too). So L2 cache is still directly tied to ROPs, but instead of 4x16 ROPs as was believed initially it is really still octo-rops like previous generations (8x8). And of course the new ability to have only one L2/ROP partition active per 2x32bit MC channel is pretty interesting.
 
So, does that mean the access to the smaller segment is essentially "uncached" (or partially cached)?

The Anandtech article states that the 7th crossbar port and its L2/ROP slice can access both memory controllers. That seems to indicate the ability to cache from the 8th channel.
The way accesses to the 8th channel block the rest of the crossbar is interesting.
Some kind of hash would be used to stripe addresses across the clients, but it's like some portion of the pipeline is wedded to the assumption that the memory partitions are an all or nothing affair.

I wonder what would happen if the 7th channel were not included in the high-performance stride. If there's a routing concern at the partition level, perhaps the serialization penalty goes away if a strided access doesn't run into an ambiguous situation where a request needs to be interpreted with one hash function versus the other.
There would be a larger hit to capacity (an unintended consequence of being able to get away with a narrower bus is that yield measures become more expensive in capacity and bandwidth), and possibly the ROPs in that partition may not be flexible enough to coordinate with the main stride.

It's curious as well because AMD's Tahiti added a secondary crossbar between ROPs and memory channels (gone with Hawaii, if I recall correctly), but it may be the case that it was easier because those did not tie so tightly to the L2, and the assumed 1:1 link between L2 and controller was not broken.
 
Looks like the disabling of an L2 partition is the reason for the split memory pools. That would mean the particular L2 partition in this case can't be shared between two channels with the short link interface at the same time, so there goes the XOR access switch.
 
So, what it boils down to is that NVidia engineering found a way to make a <256-bit bus chip, after de-activation, that could be advertised as a 256-bit chip. Pretty useful for the marketing department, eh? "Same bus width as the GTX980."
 
So, what it boils down to is that NVidia engineering found a way to make a <256-bit bus chip, after de-activation, that could be advertised as a 256-bit chip. Pretty useful for the marketing department, eh? "Same bus width as the GTX980."

3GB is straddling an awkward threshold for available memory capacity before resorting bus transfers, so there is a decent benefit to keeping 25% of capacity, even if half of that remainder is significantly behind primary bandwidth.

It does seem like tiptoeing around the limits of the secondary link is a fair amount of hassle for what appears to be, of all things, a cache yield recovery measure. It might not have so much benefit if the bus widths hadn't been so narrow and kept the card right at the cusp of capacity issues.
A more flexible memory system may have been deferred for after Nvidia starts transitioning to newer memory types.
 
That Anandtech article was good and seems to have solved most questions.
But what i take away from this is that Nvidia has a lot of options for releasing new products based on GM204 in the future.
For example a GTX970Ti and GTX960Ti...

GTX980 2048cores, 64rops, 256bit, 224GB/s, 4GB, $549
GTX970Ti 1792cores, 64rops, 256bit, 224GB/s, 4GB, $399-449
GTX970 1664cores, 56rops, 224+32bit, 196+28GB/s, 3.5+0.5GB, $329
GTX960Ti 1536cores, 56rops, 224bit, 196GB/s, 3.5GB, $249-279
 
Wonderful! They (NVidia) did such marvel with the GTX's 970 memory subsystem from a tech point of view!. But what about that they are damn liars (and persist with the lies for months!) , false advertising and misleading their own customers and even the IT press? Isn't that marvelous to?
 
That Anandtech article was good and seems to have solved most questions.
But what i take away from this is that Nvidia has a lot of options for releasing new products based on GM204 in the future.
For example a GTX970Ti and GTX960Ti...

GTX980 2048cores, 64rops, 256bit, 224GB/s, 4GB, $549
GTX970Ti 1792cores, 64rops, 256bit, 224GB/s, 4GB, $399-449
GTX970 1664cores, 56rops, 224+32bit, 196+28GB/s, 3.5+0.5GB, $329
GTX960Ti 1536cores, 56rops, 224bit, 196GB/s, 3.5GB, $249-279
What about a chip where half the L2s are turned off?... Still 256-bit...

Why is L2 so prone to failure that a SKU arises with one broken? Surely L2 should be easy to keep working. e.g. with Bulldozer, a 6-core processor has the full 8MB of cache.

I wonder if this is NVidia's strategy to keep people/AIBs from overclocking 970 so that it exceeds 980? Hobble an L2 arbitrarily (it isn't actually broken) and the chips will always be slower than 980...
 
Back
Top