Unfortunately no, it's not nearly powerful enough if you're still stuck inside the same execution model. Pretty much anything you can do in SPIR-V you could do by compiling to GLSL as well to be honest, it does not introduce any real fundamentally new capabilities and they basically punted on anything interesting what-so-ever (i.e. real shared IL between compute/graphics, pointers, etc).
That's a bummer. So... no good IL exists that would allow cross vendor compilation from a custom interesting shader language. Since SPIR-V is brand new, there will be nothing new coming from Khronos in a decade or so
All in all they are very nice chips for development. Despite lower frequencies they are not noticeably slower than the 4790K (the EDRAM does seem to help the CPU in more cases than I originally anticipated from benchmarks + some IPC gains on BDW) and you get a pretty decent iGPU to play with. Great CPU choice for DX12 multiadapter work as well.
Yes, these desktop CPUs are perfect for multiadapter development, and in generally for optimizing/benchmarking/validating rendering code with Intel. I am getting tired of borrowing laptops from our test department and installing builds to them. It's much easier to change a command line parameter to select a different GPU on the same computer.
Aside, but in terms of fast compiles it was recently pointed out to me that the new 45W 8 core Xeon D (Broadwell cores @ 2Ghz)
crush Haswell Xeons while using half the power... pretty awesome performers and with the D being for "density" you can fit a pile of them in a small space if desired.
Xeon D is awesome. Intel has packed huge amount of multithreaded performance inside that tiny 45W envelope. It has 12 MB of L3 too. The price is very reasonable (581$ is roughly half of the higher clocked 999$ Haswell i7 extreme with 8 cores), making this the most affordable Intel 8 core CPU. It would practically impossible for AMD to match Xeon D in perf/watt with the forthcoming 8 core Zen.
Let me introduce you to the Xeon D v2:
- 8 Skylake cores at 2.0 GHz
- Dual channel memory controller (just like the current Xeon D)
- 72 EU iGPU (DX 12.0 compatible. Easy to fit into the thermal budget as the CPU part needs only ~45W)
- 128 MB of EDRAM (solves the memory bandwidth issue)
Pretty please
Yes, subgroups are excellent.
I spent a few solid weeks porting some CUDA code to GEN. This work included working with the Intel subgroups extension.
There is a write-up here:
HotSort 2.0 – Kernel Generation and Autotuning
Sorting is a prime example why cross lane swizzles are important. It is impossible to write a fast GPU sorter without them. All fast CUDA implementations heavily use subgroup / cross lane operations. Your research shows that OpenCL sorters are starting to do the same. Unfortunately with PC DirectX 12 compute shaders we still need to stick with the slow and inefficient ancient sorting methods (consoles obviously support cross lane operations as well)
Up to 2 versions - there's no SIMD32 for pixel shaders. If the SIMD16 version spills too many registers the compiler may elect to not provide it to the hardware though, in which case everything will run SIMD8.
If I understood everything correctly, SIMD32 should provide better latency hiding, as one instruction from one running wave is enough to saturate one of the SIMD4 execution units for 8 cycles. As long as you have two waves running (not waiting for a memory stall), you should be OK. With smaller waves (SIMD8 or SIMD16), a single instruction from a single wave is split to fewer SIMD4 operations, meaning that the perceived latency hiding is worse.
I am also wondering why you chose SIMD4 execution units instead of SIMD8, since the narrowest wave width is 8. With two SIMD8 execution units, the EU could achieve IPC of 1.0 instructions (from a single SIMT thread's POV). With two SIMD4 units the IPC is just 0.5 (as you cannot issue an instruction from the same wave on the same cycle to both SIMD4 units). Of course the downside of SIMD8 execution units would be that the perceived instruction latency increases by 2x. Now a new SIMD8 instruction (from a single wave) only starts every other cycle (it takes two cycles to issue it to SIMD4 execution units). This obviously means that shader compiler can more freely arrange the instructions, and there are less cases where nops need to be added in between instructions (if ILP is not available).
8 wide execution units would give more gains from SIMD16 and SIMD32 shaders (right now SIMD32 seems to be mainly there for compatibility reasons). These modes would hide the instruction latency better than SIMD8 with 8 wide units (SIMD16 would one instruction per 2 cycles and SIMD32 would be one instruction in 4 cycles). 8 wide execution units would obviously increase the need for separate shader code for different SIMD widths (as shader instruction reordering would be more important for narrow SIMDs), but as your hardware and compiler already support this, it shouldn't be a problem at all. I am just wondering why this hasn't been done, since it doubles the ALU performance of an EU (with a small extra transistor cost and a small added complexity to shader compiling). Of course it would increase the register pressure a bit (as wider waves would be more common) and would likely cause bottlenecks elsewhere (as the EU sampler / memory ports wouldn't get any faster).
I've been fascinated by GEN's
region addressing and register indirect modes (Section 3.3.5) for a long time.
Check it out because I think it's unique (is it?) and could enable some very efficient algorithm implementations.
The open question is whether current models/compilers/developers can actually harness these capabilities.
Seems very nice. I doubt the current compilers exploit these features much (except for indexing local arrays).
Two extra questions to Andrew:
Did I understand correctly that vertex shaders run always in SIMD4x2 mode? Does this mean that VS branching granularity is 2 vertices?
I didn't find any instruction latency charts in the OpenSource documents, but I found this:
"If none of the two instructions is send, there CANNOT be any destination hazard. This is because instructions within a thread are dispatched in order (single-issued) and the execution pipeline is inorder and has a fixed latency."
Does this mean that all instructions have identical fixed latency? (I assume this is 2 cycles and fully hidden by the SIMD4 execution of the SIMD8+ lanes)
I did browse through some examples in the OpenSource PRM and I really like how Intel's flexible register files allow nice optimizations, such as storing wave invariant data (such as constant buffer loads) using just a single 32 byte register lane. Nice hardware indeed