Intel Gen Architecture Discussion

Discussion in 'Architecture and Products' started by Rys, Jul 14, 2015.

Tags:
  1. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,298
    Likes Received:
    137
    Location:
    On the path to wisdom
    Yes, but then it makes no difference to the number of registers available as the compiler has to use the register limit for the widest case.
     
  2. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    If the pixel shader code needs the maximum amount of registers (128), you shouldn't even be able to compile 16 and 8 lane versions of it. Unless of course you are willing to spill registers to memory (which is a HUGE slow down). Andrew didn't tell how Intel handles this case.

    Speaking of Intel cross lane operations and various SIMD execution modes, I found this interesting forum thread:
    https://software.intel.com/en-us/forums/topic/541632

    Intel cross lane swizzle OpenCL extensions discussed here are exactly those what I want to have in HLSL compute shaders. Nvidia and AMD also support cross lane swizzles. If nothing else, HLSL should expose the quad swizzles (as the quad swizzle doesn't require exposing hardware wave width, because all GPUs have wave sizes dividable by 4).

    I was supposed to ask Andrew whether there is "HLSL extension" (hack) to compile a compute shader with SIMD4x2. The use case for this is complex shaders with lots of ILP, but not as much TLP (not many SIMT threads). Examples: Real time DXT compression (each 4x4 tile is a single SIMT thread, meaning that compressing a 128x128 virtual texture page is only total of 1024 SIMT threads = hard to utilize the whole GPU). Xbox 360 GPU (and the old AMD Terascale VLIW GPUs) were actually faster in real time DXT compression than modern scalar architectures. Also coarse tiled algorithms (light/particle tile culling) and low res passes (coarse shadow mask) would benefit from SIMD4x2 execution. Same is true for low res conservative rasterized stuff in DX12.1. Intel would get big gains if there was a "HLSL extension" for the developer to tell the shader compiler to use SIMD4x2 mode for these shaders.

    I have been toying around with the idea of using GCN quad swizzles to emulate vec4 execution for real time DXT compression. Quad swizzle executes on LDS port, meaning that it doesn't use ALU slots (it is co-issued with ALU). I wonder if the new Fiji swizzles (and the new register indexing support) allow the future GCN 1.2 compiler to exploit similar optimizations for some shaders. GCN load/sample instructions are still loading 64 items from memory (unless execution mask is abused). Andrew, how does Intel handle this? Do you have separate load/fetch instructions for 8/16/32 lanes? SIMD4x2 would also benefit from load/fetch that only sends two UVs and fetches two items of data. Is there a full ISA reference document available for Intel GPUs? I would like to check the available memory instructions and the instruction latencies, to understand better about the advantages/disadvantages of different lane sizes.
     
    #22 sebbbi, Jul 31, 2015
    Last edited: Jul 31, 2015
  3. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,298
    Likes Received:
    137
    Location:
    On the path to wisdom
    I assume you meant you shouldn't be able to compile 16 and 32 wide versions.

    But there's a trade-off between register use and other optimisations. My question was really about whether "shaders used with small triangles intrinsically get a lot of registers" means anything other than "shaders used with small triangles often leave a lot of register space unused", and you seem to imply that the answer is no.
     
  4. Kaarlisk

    Regular Newcomer Subscriber

    Joined:
    Mar 22, 2010
    Messages:
    293
    Likes Received:
    49
  5. Kaarlisk

    Regular Newcomer Subscriber

    Joined:
    Mar 22, 2010
    Messages:
    293
    Likes Received:
    49
    Is graphics L3 on the GPU in Atom CPUs and a dedicated part of the LLC on Core CPUs, or is graphics L3 always a part of the GPU, separate from the LLC?
    Also (and this may be a silly question), how come Gen GPUs have an L3, while in the case of AMD and Nvidia, L2 is the large cache mentioned recently?
     
  6. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Obviously it is the other way around (face palm) :)

    ... and this means that every shader compiles fine at 8 wide (as 8 wide has full 128 registers available). Scaling up to 16 and 32 lanes only work if the register count is <=64 or <=32. So in the end, this doesn't require any extra effort. You just use the same shader code, but if it needs more registers, you always run it at 8 wide (or at 16 wide), reducing the occupancy (like other GPUs do). No need to compile multiple shader versions.

    8 wide wave has 128 registers, 16 wide wave has 64 registers and 32 wide wave has 32 registers. The wider you go, the less registers you have. There is no waste for wide waves (as it goes as wide as the register count allows). It is true that simple shaders executed for small triangles waste registers (part of the register file for the EU thread is unused), but these waves also should finish fast, freeing the registers fast for other uses. This is a limitation of Intel's register file allocation scheme (but that scheme otherwise seems to be very good, so the compromise isn't that great). And I am sure that Intel shuts down the power of the leftover EU thread register hardware if it executes a simple shader with 8 lanes.
     
    #26 sebbbi, Jul 31, 2015
    Last edited: Jul 31, 2015
  7. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    If I understood correctly the tiny subslice L1 texture cache contains uncompressed data (ready for filtering) and the subslice L2 texture cache contains (DXT/ASTC) compressed data. This makes the third cache level L3. Also the L3 is the third cache level for the CPU (so that same name suits both GPU and CPU fine).

    Old AMD/ATI and NVIDIA L1 texture caches were uncompressed (meaning that they were practically 4x smaller when storing DXT data). Nowadays AMD and NVIDIA L1 texture caches keep data in compressed format. Intel's design is a hybrid between the old and new. Most likely the tiny uncompressed L1 makes filtering hardware simpler and more efficient, while the bigger L2 ensures that the subslice texture cache capacity is practically 4x larger for DXT compressed data (and even more for ASTC).
     
  8. Kaarlisk

    Regular Newcomer Subscriber

    Joined:
    Mar 22, 2010
    Messages:
    293
    Likes Received:
    49
    Thank you for this explanation, especially the context :)

    I finally remember where I read about graphics L3 being separate from LLC:
    "Intel also added a graphics-specific L3 cache within Ivy Bridge. Despite being able to share the CPU's L3 cache, a smaller cache located within the graphics core allows frequently accessed data to be accessed without firing up the ring bus." (Anandtech).
    http://images.anandtech.com/reviews/cpu/intel/IvyBridge/IDFarchitecture/ivbgpu5.jpg
    http://images.anandtech.com/reviews/cpu/intel/IvyBridge/IDFarchitecture/gpupower.jpg (speaks of L3$ as distinct from LL$)
    http://meseec.ce.rit.edu/551-projects/spring2012/1-1.pdf (page 8)
    http://www.realworldtech.com/ivy-bridge-gpu/6/
     
  9. Kaarlisk

    Regular Newcomer Subscriber

    Joined:
    Mar 22, 2010
    Messages:
    293
    Likes Received:
    49
  10. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Unfortunately no, it's not nearly powerful enough if you're still stuck inside the same execution model. Pretty much anything you can do in SPIR-V you could do by compiling to GLSL as well to be honest, it does not introduce any real fundamentally new capabilities and they basically punted on anything interesting what-so-ever (i.e. real shared IL between compute/graphics, pointers, etc).

    Probably nothing... I have one of each and they both seem to operate the same (same frequencies, etc). Personally while the Haswell chips could rarely sit at max turbo (1.3Ghz) for very long unless there was almost no CPU wall but it happens far less than on Haswell.

    All in all they are very nice chips for development. Despite lower frequencies they are not noticeably slower than the 4790K (the EDRAM does seem to help the CPU in more cases than I originally anticipated from benchmarks + some IPC gains on BDW) and you get a pretty decent iGPU to play with. Great CPU choice for DX12 multiadapter work as well.

    Aside, but in terms of fast compiles it was recently pointed out to me that the new 45W 8 core Xeon D (Broadwell cores @ 2Ghz) crush Haswell Xeons while using half the power... pretty awesome performers and with the D being for "density" you can fit a pile of them in a small space if desired.

    Yes, if provided.

    Up to 2 versions - there's no SIMD32 for pixel shaders. If the SIMD16 version spills too many registers the compiler may elect to not provide it to the hardware though, in which case everything will run SIMD8.
     
    BRiT likes this.
  11. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    The compiler/EU can - and does - freely mix instructions of different SIMD widths in the same kernel. Given the flexible register addressing modes this is pretty straightforward and theoretically much more powerful than the current shading languages expose. It's also very useful when doing mixed precision (fp16, fp32) stuff.
     
  12. pixelio

    Newcomer

    Joined:
    Feb 17, 2014
    Messages:
    47
    Likes Received:
    75
    Location:
    Seattle, WA
    Yes, subgroups are excellent. :)

    I spent a few solid weeks porting some CUDA code to GEN. This work included working with the Intel subgroups extension.

    There is a write-up here: HotSort 2.0 – Kernel Generation and Autotuning

    A few other observations:
    • The OpenCL compiler still has nasty bugs. The worst are on Broadwell. (Bug+repro reported months ago but apparently the Intel OpenCL team doesn't fix bugs in the summer time)
    • Not being able to launch a CTA that can entirely utilize a subslice is disappointing. Is there a rationale?
    • Almost all my mid-to-high register count kernels were compiled to SIMD8 and only a few to SIMD16. (See clGetKernelSubGroupInfoKHR).
    • GEN's sheer amount of resources—GRF and SLM—really ease development of kernels.
    There is a lot to love about the GEN arch!
     
    BRiT and Rys like this.
  13. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    SIMD4x2 compute shaders are not something that is "natively" supported by the hardware per se. Conceptually the compiler could pack things differently and execute stuff that way, but there's no support for that in the current shader compiler to my knowledge and it would likely require a bit of cleverness when interacting with the dispatch hardware.

    I think someone linked the ISA docs, but you want to look at the "data port" stuff. There's quite a myriad of modes from 8/16-wide (I don't think there are 32-wide memory ops, compiler just emits to 16 ones in SIMD32 compute shaders) and also stuff like loading 4-wide chunks and so on. The hardware will collapse stuff into the minimum number of cache line accesses as you'd expect and so on, but for the details you'll have to look in the spec.

    The answer is yes. It's weird, but like I said it simplifies a bunch of other stuff so it's just a different design point.

    It's separate. The graphics L3 is part of the GPU and it's the main chunk of memory through which everything goes on that side. Once you go through the GTI onto the ring bus (on the big core CPUs) you can then get backed by LLC/eLLC, or even snoop data from CPU caches and all that regular stuff.

    Where you start and how you count is kind of arbitrary (hence the proper "LLC/eLLC" vs the sometimes confusingly-used "L4 cache" that tech sites use to describe the EDRAM), so it's not really comparable. The "L3" in gen is roughly equivalent to the L2's on AMD/NVIDIA.
     
  14. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Very cool stuff, thanks for the link! I always get somewhat depressed when I see those sorts of optimization space graphs as you know that 90% of non-auto-tuned code is off in some random valley somewhere most of the time and it's basically impossible to write performance portable code without auto-tuning. Part of the reason why I'm not convinced that the current compute models and in fact GPU hardware design point with registers/occupancy represent an ideal end point.
     
    Kaarlisk likes this.
  15. pixelio

    Newcomer

    Joined:
    Feb 17, 2014
    Messages:
    47
    Likes Received:
    75
    Location:
    Seattle, WA
    I also don't like the idea of auto-tuning or hunting for a magic configuration but the CL compiler doesn't give you any feedback (yet) on register usage and spillage so this was just an easy job to kick off and let run for hours.

    I just wanted to see how close I could get to fully utilizing the GRF without spilling.

    I didn't actually find anything surprising. The kernel configurations that I expected to run well actually did run well.

    So kudos to the architecture and documentation... it was close to WYCIWYG—"What You Code Is What You Get". :)
     
  16. pixelio

    Newcomer

    Joined:
    Feb 17, 2014
    Messages:
    47
    Likes Received:
    75
    Location:
    Seattle, WA
    I've been fascinated by GEN's region addressing and register indirect modes (Section 3.3.5) for a long time.

    Check it out because I think it's unique (is it?) and could enable some very efficient algorithm implementations.

    The open question is whether current models/compilers/developers can actually harness these capabilities.
     
  17. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    That's a bummer. So... no good IL exists that would allow cross vendor compilation from a custom interesting shader language. Since SPIR-V is brand new, there will be nothing new coming from Khronos in a decade or so :(
    Yes, these desktop CPUs are perfect for multiadapter development, and in generally for optimizing/benchmarking/validating rendering code with Intel. I am getting tired of borrowing laptops from our test department and installing builds to them. It's much easier to change a command line parameter to select a different GPU on the same computer.
    Xeon D is awesome. Intel has packed huge amount of multithreaded performance inside that tiny 45W envelope. It has 12 MB of L3 too. The price is very reasonable (581$ is roughly half of the higher clocked 999$ Haswell i7 extreme with 8 cores), making this the most affordable Intel 8 core CPU. It would practically impossible for AMD to match Xeon D in perf/watt with the forthcoming 8 core Zen.

    Let me introduce you to the Xeon D v2:
    - 8 Skylake cores at 2.0 GHz
    - Dual channel memory controller (just like the current Xeon D)
    - 72 EU iGPU (DX 12.0 compatible. Easy to fit into the thermal budget as the CPU part needs only ~45W)
    - 128 MB of EDRAM (solves the memory bandwidth issue)

    Pretty please :)
    Sorting is a prime example why cross lane swizzles are important. It is impossible to write a fast GPU sorter without them. All fast CUDA implementations heavily use subgroup / cross lane operations. Your research shows that OpenCL sorters are starting to do the same. Unfortunately with PC DirectX 12 compute shaders we still need to stick with the slow and inefficient ancient sorting methods (consoles obviously support cross lane operations as well) :(
    If I understood everything correctly, SIMD32 should provide better latency hiding, as one instruction from one running wave is enough to saturate one of the SIMD4 execution units for 8 cycles. As long as you have two waves running (not waiting for a memory stall), you should be OK. With smaller waves (SIMD8 or SIMD16), a single instruction from a single wave is split to fewer SIMD4 operations, meaning that the perceived latency hiding is worse.

    I am also wondering why you chose SIMD4 execution units instead of SIMD8, since the narrowest wave width is 8. With two SIMD8 execution units, the EU could achieve IPC of 1.0 instructions (from a single SIMT thread's POV). With two SIMD4 units the IPC is just 0.5 (as you cannot issue an instruction from the same wave on the same cycle to both SIMD4 units). Of course the downside of SIMD8 execution units would be that the perceived instruction latency increases by 2x. Now a new SIMD8 instruction (from a single wave) only starts every other cycle (it takes two cycles to issue it to SIMD4 execution units). This obviously means that shader compiler can more freely arrange the instructions, and there are less cases where nops need to be added in between instructions (if ILP is not available).

    8 wide execution units would give more gains from SIMD16 and SIMD32 shaders (right now SIMD32 seems to be mainly there for compatibility reasons). These modes would hide the instruction latency better than SIMD8 with 8 wide units (SIMD16 would one instruction per 2 cycles and SIMD32 would be one instruction in 4 cycles). 8 wide execution units would obviously increase the need for separate shader code for different SIMD widths (as shader instruction reordering would be more important for narrow SIMDs), but as your hardware and compiler already support this, it shouldn't be a problem at all. I am just wondering why this hasn't been done, since it doubles the ALU performance of an EU (with a small extra transistor cost and a small added complexity to shader compiling). Of course it would increase the register pressure a bit (as wider waves would be more common) and would likely cause bottlenecks elsewhere (as the EU sampler / memory ports wouldn't get any faster).
    Seems very nice. I doubt the current compilers exploit these features much (except for indexing local arrays).

    Two extra questions to Andrew:

    Did I understand correctly that vertex shaders run always in SIMD4x2 mode? Does this mean that VS branching granularity is 2 vertices?

    I didn't find any instruction latency charts in the OpenSource documents, but I found this:
    "If none of the two instructions is send, there CANNOT be any destination hazard. This is because instructions within a thread are dispatched in order (single-issued) and the execution pipeline is inorder and has a fixed latency."
    Does this mean that all instructions have identical fixed latency? (I assume this is 2 cycles and fully hidden by the SIMD4 execution of the SIMD8+ lanes)

    I did browse through some examples in the OpenSource PRM and I really like how Intel's flexible register files allow nice optimizations, such as storing wave invariant data (such as constant buffer loads) using just a single 32 byte register lane. Nice hardware indeed :)
     
    #37 sebbbi, Aug 1, 2015
    Last edited: Aug 1, 2015
  18. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Are you sure about the 8 cycles part? The document I provided mentions something called IMT (interleaved multi-threading) in addition to SMT, and I interpreted this as being a fixed multiplexing of threads... could be wrong though. Here's a quote:

    "The architecture of an EU is combination of Simultaneous Multi-Threading (SMT) and fine grained Interleaved Multi-Threading (IMT). These are compute processors that drive multiple issue Single Instruction Multiple Data Arithmetic Logic Units (SIMD, ALUs) pipelined across multiple threads, for high-throughput floating-point and integer compute. The fine grain threaded nature of the EUs ensures continuous streams of ready to execute instructions, while also enabling latency hiding of longer operations such as memory scatter/gather, sampler requests, or other system communication."
     
  19. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Sorry should've put this with the above post, forgot. Doesn't the document mention that one of the two SIMD4 units is "beefier" than the other. I think it mentions transcendental function and the like being handled with the beefier of the two units. Found the quote:

    "Finally, one of the FPUs provides extended math capability to support high-throughput transcendental math functions and double precision 64-bit floating-point."

    I'm not saying it fully explains why but maybe it's a part of the reason.
     
  20. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    I don't decide these things but I'd love one of those too :) I'll pass it along in any case, for what it's worth!

    The more "stuff" you have in flight the better latency hiding you can get if all else is equal. And really, it's best to think of SIMD32 as just some syntactic sugar in the EUs - mainly wider mask registers and so on for branching. Otherwise it's not very different than the compiler simply emitting 2x interleaved SIMD16 instructions (and indeed it does that a lot still since not all instructions and messages have SIMD32 versions - ex. there's no SIMD32 texture sampling messages).

    I will say that in my experience latency hiding is not super commonly an issue in Gen due to a pretty deep and efficient cache hierarchy and a fairly ample number of hardware threads. Obviously never say never but for instance I don't think I ran into any regressions at all from Gen7.5->Gen8 with fewer HW threads/EU.

    The decision making there is a bit beyond my knowledge or expertise to be honest... those sorts of things get into the very broad but interconnected design space of GPUs. If you look and AMD and NVIDIA they've gone back and forth on this sort of thing a fair number of times too. Gen has the additional trade-offs of SIMD4x2 shaders and not packing multiple primitives into the same PS invocation as well, both of which affect the trade-offs.

    What Infinisearch mentioned is true as well though - the second SIMD4 "pipe" does not support all of the same operations. In fact I think beyond Ivy Bridge it didn't even support MAD, so you had more of a MAD + MUL/ADD setup in terms of ALUs. Transcendental stuff is only support on one of them as well IIRC.

    It's heavily used in media kernels as you might imagine. 3D shader stuff does occasionally use it (again, you can imagine how it might sometimes be useful for different data sizes, fp16, etc) but not heavily to my knowledge.

    Gen7.5 (Haswell) always ran vertex shaders in SIMD4x2. Gen8 runs them SIMD8 (the hardware can still run them 4x2 if desired but by default they all run SIMD8). I believe some of the tessellation stages actually still run 4x2 though even in Gen8.

    So yes the branch granularity in VS *was* 2 vertices, but now it's 8.

    I'd have to dig into this but off the top of my head I think all the math instructions do all have fixed latency and you'll never hit it if you are running 2 HW threads of SIMD8 which simplifies the scheduling a lot of course. Transcendentals might be longer latency, or maybe just multi-instruction sequences, I don't remember off the top of my head.

    It's pretty crazy... some might go as far as to say it's over-engineered for what it is typically doing in 3D which probably has a grain of truth to it. Still, we built it so might as well use it to its potential where possible :)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...