Almost like all multiple outstanding in-order completion memory interfaces.MfA said:Almost like J.P. Grossman suggested over 5 years ago.
Almost like all multiple outstanding in-order completion memory interfaces.MfA said:Almost like J.P. Grossman suggested over 5 years ago.
Variable vector length (minimum is 1), so it can indeed be configured as MIMD (and the moniker SIMT may start to make sense after all ).http://research.nvidia.com/sites/default/files/publications/IEEE_Micro_2011.pdf
It's only a research architecture, but LOTS of good ideas in there. Fingers crossed.
It seems noteworthy how many of these ideas bear some resemblance to the PowerVR SGX/Rogue shader core. While there are clear differences and some things (once again) being proposed like the configurable cache hierarchy are truly novel and exciting, I don't think IMG could be asking for a better validation of their design choices than this. Anyway, back on subject...It's only a research architecture, but LOTS of good ideas in there. Fingers crossed.
Obviously I've been hoping for MIMD shader cores to go mainstream (partly for selfish reasons) for a very long time so there's not much I can say here but a simple "yay"Variable vector length (minimum is 1), so it can indeed be configured as MIMD (and the moniker SIMT may start to make sense after all ).
Do you have any idea how the 'mini 64bit SIMD' would work? Surely it's not effectively true Vec2 for SP?!And they do static scheduling by the compiler into "LIW" instruction groups (2 arithmetic instructions + 1 L/S). SP is twice as fast as DP (each arithmetic instruction slot can operate on two SP values, it's basically a mini 64bit SIMD).
I don't get the question. Each of the two arithmetic LIW slots is Vec2 for SP. It should work like the obsolete 3DNow! in AMD CPUs. It will probably have the same problem to get the 32bit components into one 64 bit location as CPUs have with their vector extensions. So you need to vectorize a bit (or you need a clever compiler for more than some simple cases), the instruction packing into the LIWs for SP is less flexible if you compare it to AMD's VLIW architectures for instance. I guess it is done to reduce the (power) overhead that would be necessary to individually adress 16 operands of 32bits in the register files (AMDs VLIW architectures are doing of that). In Einstein/Echelon this will be reduced to 8 operands of 64bits per clock cycle.Do you have any idea how the 'mini 64bit SIMD' would work? Surely it's not effectively true Vec2 for SP?!
That was also my initial assumption but I can't seriously believe that's what NVIDIA is going to do here given how important FP32 for graphics will remain going forward. This looks like the key sentence to me: "Second, within a lane, threads can execute independently, or coalesced threads can execute in a SIMT fashion with instructions from different threads sharing the issue slots".I don't get the question. Each of the two arithmetic LIW slots is Vec2 for SP. It should work like the obsolete 3DNow! in AMD CPUs.
Could be as simple as that the compiler packs 2 or 4 work items ("threads") into one lane (one needs some kind of "slot masking" in that case). If it would be 4, this could be the traditional quad. Effective Warp size would be 4 for SP in this case. Doing it dynamically during runtime would defeat the purpose of the whole LIW decision.So my guess is they must have found a way to do 4 scalar FP32 instructions per clock per shader when using a warp size of 2 or more (at least in certain cases). This is not a bad solution at all, especially if they continue focusing on quads for pixel shaders. I'm really not sure how the whole "instructions from different threads sharing the issue slots" thing would work in practice though...
Lots of parallelism + short vectors + MIMD (+ cache) = ray tracing
Speaking of validating competitor's design choices, I think the most important change is the possibility to configure an SRAM pool as registers, cache or scratchpad. For all of Larrabee's flaws, it got this bit exactly right, IMO.It seems noteworthy how many of these ideas bear some resemblance to the PowerVR SGX/Rogue shader core. While there are clear differences and some things (once again) being proposed like the configurable cache hierarchy are truly novel and exciting, I don't think IMG could be asking for a better validation of their design choices than this. Anyway, back on subject...
Obviously I've been hoping for MIMD shader cores to go mainstream (partly for selfish reasons) for a very long time so there's not much I can say here but a simple "yay"
Their basic argument for MIMD seems to be that while it may increase area, it won't hurt power efficiency by anywhere near as much if you're clever about it, so if you're power-limited anyway the greater programming flexibility and performance is very much worth the cost.
That certainly makes a lot of sense although it's interesting that they hardly mention leakage at all in their entire paper (only as a limiter to Vdd scaling). Yes, you can save wire energy and improve locality by coalescing on a MIMD architecture, but you've still got some extra leakage to contend with. Hmm... I know some companies are (over?)optimistic over extremely fine-grained power gating, but I've never seen any indication that NVIDIA is one of them.
My guess is that it will be able to do 2 SP ops, but it will not require them to use contigous registers. IOW, conventional static dual issue of SP ops and just sharing the ALU datapath.Do you have any idea how the 'mini 64bit SIMD' would work? Surely it's not effectively true Vec2 for SP?!
Where does this lead us in terms of software and the 3D pipeline? Lots of parallelism + short vectors + MIMD seems like a good time to seriously rethink the rasterization process. Somehow I doubt graphics APIs will change dramatically in the next five years though.
How did Larrabee do that?Speaking of validating competitor's design choices, I think the most important change is the possibility to configure an SRAM pool as registers, cache or scratchpad. For all of Larrabee's flaws, it got this bit exactly right, IMO.
With proper acceleration structures that's not really a problem on a GPU where you can easily hide latency with throwing a ton of threads at it. Much bigger problem is still in algorithms, namely paralleliziation of building that acceleration structure so that you could have dynamic worlds.What about the mess of incoherent memory access? Ray tracing never really had lots of problems with wide vectors or SIMD compared to it's insatiable need for cache size.
Hm, is that's why Cayman performs so well in LuxMark, despite the lack of coherent caching? Large register file and properly accelerated data structures.With proper acceleration structures that's not really a problem on a GPU where you can easily hide latency with throwing a ton of threads at it.
With proper acceleration structures that's not really a problem on a GPU where you can easily hide latency with throwing a ton of threads at it. Much bigger problem is still in algorithms, namely paralleliziation of building that acceleration structure so that you could have dynamic worlds.
How did Larrabee do that?