NVIDIA Maxwell Speculation Thread

http://research.nvidia.com/sites/default/files/publications/IEEE_Micro_2011.pdf

It's only a research architecture, but LOTS of good ideas in there. Fingers crossed.
Variable vector length (minimum is 1), so it can indeed be configured as MIMD (and the moniker SIMT may start to make sense after all ;)).

And they do static scheduling by the compiler into "LIW" instruction groups (2 arithmetic instructions + 1 L/S). SP is twice as fast as DP (each arithmetic instruction slot can operate on two SP values, it's basically a mini 64bit SIMD).
 
As AMD abandons (V)LIW NVIDIA embraces it, always said it was the way to go for MIMD.
 
Last edited by a moderator:
It's only a research architecture, but LOTS of good ideas in there. Fingers crossed.
It seems noteworthy how many of these ideas bear some resemblance to the PowerVR SGX/Rogue shader core. While there are clear differences and some things (once again) being proposed like the configurable cache hierarchy are truly novel and exciting, I don't think IMG could be asking for a better validation of their design choices than this. Anyway, back on subject...

Variable vector length (minimum is 1), so it can indeed be configured as MIMD (and the moniker SIMT may start to make sense after all ;)).
Obviously I've been hoping for MIMD shader cores to go mainstream (partly for selfish reasons) for a very long time so there's not much I can say here but a simple "yay" :)

Their basic argument for MIMD seems to be that while it may increase area, it won't hurt power efficiency by anywhere near as much if you're clever about it, so if you're power-limited anyway the greater programming flexibility and performance is very much worth the cost.

That certainly makes a lot of sense although it's interesting that they hardly mention leakage at all in their entire paper (only as a limiter to Vdd scaling). Yes, you can save wire energy and improve locality by coalescing on a MIMD architecture, but you've still got some extra leakage to contend with. Hmm... I know some companies are (over?)optimistic over extremely fine-grained power gating, but I've never seen any indication that NVIDIA is one of them.

And they do static scheduling by the compiler into "LIW" instruction groups (2 arithmetic instructions + 1 L/S). SP is twice as fast as DP (each arithmetic instruction slot can operate on two SP values, it's basically a mini 64bit SIMD).
Do you have any idea how the 'mini 64bit SIMD' would work? Surely it's not effectively true Vec2 for SP?!
 
Do you have any idea how the 'mini 64bit SIMD' would work? Surely it's not effectively true Vec2 for SP?!
I don't get the question. Each of the two arithmetic LIW slots is Vec2 for SP. It should work like the obsolete 3DNow! in AMD CPUs. It will probably have the same problem to get the 32bit components into one 64 bit location as CPUs have with their vector extensions. So you need to vectorize a bit (or you need a clever compiler for more than some simple cases), the instruction packing into the LIWs for SP is less flexible if you compare it to AMD's VLIW architectures for instance. I guess it is done to reduce the (power) overhead that would be necessary to individually adress 16 operands of 32bits in the register files (AMDs VLIW architectures are doing of that). In Einstein/Echelon this will be reduced to 8 operands of 64bits per clock cycle.
 
I don't get the question. Each of the two arithmetic LIW slots is Vec2 for SP. It should work like the obsolete 3DNow! in AMD CPUs.
That was also my initial assumption but I can't seriously believe that's what NVIDIA is going to do here given how important FP32 for graphics will remain going forward. This looks like the key sentence to me: "Second, within a lane, threads can execute independently, or coalesced threads can execute in a SIMT fashion with instructions from different threads sharing the issue slots".

So my guess is they must have found a way to do 4 scalar FP32 instructions per clock per shader when using a warp size of 2 or more (at least in certain cases). This is not a bad solution at all, especially if they continue focusing on quads for pixel shaders. I'm really not sure how the whole "instructions from different threads sharing the issue slots" thing would work in practice though...
 
So my guess is they must have found a way to do 4 scalar FP32 instructions per clock per shader when using a warp size of 2 or more (at least in certain cases). This is not a bad solution at all, especially if they continue focusing on quads for pixel shaders. I'm really not sure how the whole "instructions from different threads sharing the issue slots" thing would work in practice though...
Could be as simple as that the compiler packs 2 or 4 work items ("threads") into one lane (one needs some kind of "slot masking" in that case). If it would be 4, this could be the traditional quad. Effective Warp size would be 4 for SP in this case. Doing it dynamically during runtime would defeat the purpose of the whole LIW decision.
 
Gipsel - given that you can run 4 SP fmas per clock, running a quad per-thread makes perfect sense to me. But it's also not clear to me that the exact same lane configuration would get used for graphics SKUs. Maybe for HPC, SP is much less important than DP.
 
Where does this lead us in terms of software and the 3D pipeline? Lots of parallelism + short vectors + MIMD seems like a good time to seriously rethink the rasterization process. Somehow I doubt graphics APIs will change dramatically in the next five years though.
 
Lots of parallelism + short vectors + MIMD (+ cache) = ray tracing ;)

What about the mess of incoherent memory access? Ray tracing never really had lots of problems with wide vectors or SIMD compared to it's insatiable need for cache size.
 
Last edited by a moderator:
It seems noteworthy how many of these ideas bear some resemblance to the PowerVR SGX/Rogue shader core. While there are clear differences and some things (once again) being proposed like the configurable cache hierarchy are truly novel and exciting, I don't think IMG could be asking for a better validation of their design choices than this. Anyway, back on subject...
Speaking of validating competitor's design choices, I think the most important change is the possibility to configure an SRAM pool as registers, cache or scratchpad. For all of Larrabee's flaws, it got this bit exactly right, IMO.
Obviously I've been hoping for MIMD shader cores to go mainstream (partly for selfish reasons) for a very long time so there's not much I can say here but a simple "yay" :)

Their basic argument for MIMD seems to be that while it may increase area, it won't hurt power efficiency by anywhere near as much if you're clever about it, so if you're power-limited anyway the greater programming flexibility and performance is very much worth the cost.

That certainly makes a lot of sense although it's interesting that they hardly mention leakage at all in their entire paper (only as a limiter to Vdd scaling). Yes, you can save wire energy and improve locality by coalescing on a MIMD architecture, but you've still got some extra leakage to contend with. Hmm... I know some companies are (over?)optimistic over extremely fine-grained power gating, but I've never seen any indication that NVIDIA is one of them.

The real big deal is just letting tools allocate storage for arbitrary data structures.

Do you have any idea how the 'mini 64bit SIMD' would work? Surely it's not effectively true Vec2 for SP?!
My guess is that it will be able to do 2 SP ops, but it will not require them to use contigous registers. IOW, conventional static dual issue of SP ops and just sharing the ALU datapath.

WIth the focus on handling control incoherence, I don't think they are about to mandate manual or compiler driven workitem packing for performance.
 
Where does this lead us in terms of software and the 3D pipeline? Lots of parallelism + short vectors + MIMD seems like a good time to seriously rethink the rasterization process. Somehow I doubt graphics APIs will change dramatically in the next five years though.

I hope we get efficient micropolygons though. That is something that doesn't need radical change and looks great on screen. Though chances are, at this point, MS doesn't care where DX goes as long as xbox3 is nice at compute.
 
Speaking of validating competitor's design choices, I think the most important change is the possibility to configure an SRAM pool as registers, cache or scratchpad. For all of Larrabee's flaws, it got this bit exactly right, IMO.
How did Larrabee do that?
 
What about the mess of incoherent memory access? Ray tracing never really had lots of problems with wide vectors or SIMD compared to it's insatiable need for cache size.
With proper acceleration structures that's not really a problem on a GPU where you can easily hide latency with throwing a ton of threads at it. Much bigger problem is still in algorithms, namely paralleliziation of building that acceleration structure so that you could have dynamic worlds.
 
With proper acceleration structures that's not really a problem on a GPU where you can easily hide latency with throwing a ton of threads at it.
Hm, is that's why Cayman performs so well in LuxMark, despite the lack of coherent caching? Large register file and properly accelerated data structures.
 
In Luxball HDR, even 5870 is almost as fast. Other tests within Luxmark seem to favor the Radeons a little less though
 
With proper acceleration structures that's not really a problem on a GPU where you can easily hide latency with throwing a ton of threads at it. Much bigger problem is still in algorithms, namely paralleliziation of building that acceleration structure so that you could have dynamic worlds.

The primary context storage today in a GPU is in registers. And you can't really cache acceleration structures in registers, no matter how well it is built.

Since a GPU needs a ton of threads and has tiny caches, effective cache per thread is really small. Larrabee would have been better balanced in this regard.
 
How did Larrabee do that?

As I understood Larrabee's implementation, they were allocating all the registers and shared memory in cache lines, and then marking those lines as" don't replace" using special cache management instructions in the driver. Whatever was left of L2, after subtracting the code and the per core runtime, was available for caching.
 
Back
Top