NVIDIA Maxwell Speculation Thread

Instead L1 is just a local reordering buffer for both SHFL commands (shuffling, swizzling, and copying words within a warp) and memory reads (a 32 word line is pulled from L2, and different threads within the warp select the words they want within the line.


One minor correction: SHFL doesn't use any L1 space. If it did use L1, SHFL would have been implemented on Fermi (which has an L1, but no SHFL.)
 
One minor correction: SHFL doesn't use any L1 space. If it did use L1, SHFL would have been implemented on Fermi (which has an L1, but no SHFL.)

Maxwell's L1 isn't the same design or function as Fermi or Kepler. In Maxwell, it's limited to one-line at a time transient reordering.
 
Maxwell's L1 isn't the same design or function as Fermi or Kepler. In Maxwell, it's limited to one-line at a time transient reordering.


I understand Maxwell's L1 is quite different than Fermi or Kepler L1. I'm just pointing out that the SHFL functionality does not use the L1.
 
I understand Maxwell's L1 is quite different than Fermi or Kepler L1. I'm just pointing out that the SHFL functionality does not use the L1.
So Maxwell's L1's only function is for reordering transient 32 word L2 cache reads? It's suprising that SHFL wouldn't use that same machinery since SHFL gives the nearly identical functionality of reordering 32 words.
 
So Maxwell's L1's only function is for reordering transient 32 word L2 cache reads? It's suprising that SHFL wouldn't use that same machinery since SHFL gives the nearly identical functionality of reordering 32 words.


I don't know exactly how Maxwell's L1 cache is intended to be used. SHFL uses the crossbar connecting the register file to shared memory to redistribute data. However, it doesn't require allocating space in either L1 or shared memory.
 
After reading the reviews I bought one of these for my wife /daughters machine, they don't do highly stressful games like latest FPS, more like online games, tower defence and for my daughter Minecraft. I did have an Ati HD4850 in that machine but recently it has been showing flickering triangles in games which I took to mean perhaps the video memory was on the way out. I then looked at when this Ati card was released and it was 2008 ! So I have been sadly neglecting my familiy gpu wise.

What struck me with the new nvidia card is the low power rating plus that is nice and quiet. So that I bought this

http://www.scan.co.uk/products/2gb-...189mhz-boost-1268mhz-cores-640-dport-dvi-hdmi

That a good increase on stock clocks and not too much more money extra. Interestingly this comes with a 6 pin power connector also, perhaps it needs the extra juice at that speed?

You know what I am about to say next, having bought it for something quiet and not much power use I decided to be an overclocker and see how far it would go. Even in Furmark the temps only hit 48C, so obviously this was not an issue with Boost 2, but the power target is stuck at 100%. I guess for the reason most cards don't have extra 6 pin power connector? So I couldn't get to 1400MHz stable and so left it as it was. *it's not my machine after all )

Having said that if someone cracks the power limitation it could be really flying, we will have to see what those folks manage to do.

It all bodes well for the bigger , faster, stronger, Maxwell cards to come I feel.

I was puzzled by nvidia releasing the slower card first, but maybe they were peaking out interest at the top end, as people have mentioned a lot of folk and OEM's will be upgrading their systems with these I feel. I know I did for a machine that was not my own.

Interesting year I think. It's quite a comparison, here is nVidia doing a powerful, cool, quiet card whilst AMD do a powerful hot and noisy card . In the past I seem to recall it being the other way around .... :)
 
If L1 is a true rw cache, why wouldn't it hold the spilled variables?
Working backwards logically: we know Maxwell's L1 does not hold spilled variables. Therefore it is likely not a true rw cache.

I suspect that Maxwell's "L1" is really a misnomer and the transient reordering buffer (used for coalescing non-aligned global memory reads) is just called L1.
 
There would be no benefit to spill (2x)128KB of RF on a (2x)12KB cache. So that doesn't mean it's not a true RW cache. But hopefully we'll know soon enough.
 
I was puzzled by nvidia releasing the slower card first, but maybe they were peaking out interest at the top end, as people have mentioned a lot of folk and OEM's will be upgrading their systems with these I feel. I know I did for a machine that was not my own.
I think they're targeting GM107/GM108 at the notebook refresh cycle where power efficiency is a huge advantage. We'll get GM200/204/206 in H2 which are more targeted at the desktop - I wonder what the timeframe will be for GM200 vs GM204 given process maturity (unless it's still on 28nm which seems unlikely at this point but interesting).
 
There would be no benefit to spill (2x)128KB of RF on a (2x)12KB cache. So that doesn't mean it's not a true RW cache. But hopefully we'll know soon enough.

RW cache of what? Maxwell first generation L1 does not cache register spills, or local/stack data, or global memory, or shared memory, or instructions. What data is it caching? There's nothing left.

It's an odd fact I keep thinking over and can't resolve, and I keep digging though the clues NVidia has left in their documentation. My current feeling is that Maxwell (1st gen) really has no L1 cache, just the transient reorder buffer for coalescing L2 reads.
 
RW cache of what? Maxwell first generation L1 does not cache register spills, or local/stack data, or global memory, or shared memory, or instructions. What data is it caching? There's nothing left.
I must have missed the fact it doesn't cache global memory, sorry - do you have a source for that?
 
I must have missed the fact it doesn't cache global memory, sorry - do you have a source for that?

Neither Kepler nor Maxwell 1st Gen cache global memory in L1.

From the Kepler Tuning Guide:
1.4.4.2. L1 Cache
L1 caching in Kepler GPUs is reserved only for local memory accesses, such as register spills and stack data. Global loads are cached in L2 only (or in the Read-Only Data Cache).
Maxwell 1st Gen removes even the caching of local memory.

From the Maxwell Tuning Guide:
As with Kepler, global loads in first-generation Maxwell are cached in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler. Local loads also are cached in L2 only


Both GK110 (sm_35 Kepler) and Maxwell sm_50 allow using the texture cache as a manually hinted read-only cache, but this is not L1. (See section 1.4.4.3 in the Kepler tuning guide for info.)
 
[strike]Is that maybe just a confusion of the terminology?[/strike] I mean the OpenCL local memory is shared memory for CUDA and CUDA local memory is basically memory private to a workitem either sitting physically in registers or in global memory. So there is a mismatch. But it would hint to a more limited use of the L1 cache.

Edit: Oops, somehow I missed, that Maxwell is supposed to cache also local loads only in L2. That would basically mean, that the L1 is really just a read only (texture) cache.
 
Last edited by a moderator:
I was puzzled by nvidia releasing the slower card first, but maybe they were peaking out interest at the top end, as people have mentioned a lot of folk and OEM's will be upgrading their systems with these I feel. I know I did for a machine that was not my own.
I think the launch of GM107 @28nm makes a lot of sense. With rising wafer prices @20nm, it makes sense to launch the value card on an older process, gives you time to work out the kinks with the new architecture, and bigger cards with higher margins can launch on a newer process.
 
It's an odd fact I keep thinking over and can't resolve, and I keep digging though the clues NVidia has left in their documentation. My current feeling is that Maxwell (1st gen) really has no L1 cache, just the transient reorder buffer for coalescing L2 reads.

I don't know about the transient nature of L1, but moving all rw cachelines to L2 is interesting as it dramatically simplifies coherency at the cost of some latency. Latency hiding shouldn't be much of a problem with GPU scale threading, and having a big L2 means good hit rates anyway.

KNL and Maxwell seem to be converging quite a bit now, compared to KNC and Kepler.
 
If this is indeed the case, it could have some problems with applications with high reg pressure.

On the other hand you'll be able to spill a lot more to the L2 cache.

While that might introduce more latency, the number of potential active blocks has now doubled, meaning the opportunities for latency hiding has increased.

Of course if you've already eaten through your registers then this won't matter :)
 
Thanks for the links!

Is the L2 still per memory partition? If so, does that mean there's no caching even for local memory without potentially going to the other end of the chip? That seems slightly inefficient in terms of data locality given what Bill Dally has been talking about for Echelon. I wonder how they'll evolve the cache hierarchy in 2nd Generation Maxwell if at all...
 
Back
Top