Yup... we have some good prototypes of much nicer execution models that support stuff like produce/consumer and work stealing efficiently, but like I said the industry is kind of set in its ways at the moment, despite us knowing these issues for many years now.
I am always interested about new programming models for GPUs. I was kind of disappointed with the DX11 tessellator design as it added two new hardcoded shader stages (hull and domain) instead of allowing us to flexibly configure shader stages and the communication between them. It seemed like a big kludge. The scalar unit in AMD GPUs is nice as it allows to perform operations at more coarse (1/64) granularity, but you could achieve the same (in a much more flexible way) if you could spawn multiple kernels (of different shader and thread counts) with fine grained synchronization (to the same CU) and communicate between them though LDS (or through the big caches or register file in Intel's design).
Yeah a somewhat related advantage is that Gen handles dynamic branching extremely efficiently in my testing. For instance, it's basically always best to branch around texture or memory operations that are going to be discarded whereas doing that on other GPUs can cause you more harm than good (in terms of the shader compiler's ability to move stuff around, registers and so on). A common case is for texture compositing where you should always branch out based on weight == 0 on Gen for each sampling request, but few people do.
Good to know. I would assume our vertex skinning would at least benefit from this (we have always 4 bone indices + 4 weights, but some weights might be zero).
You shoulda been around ~3-4 years ago when we were going through all this stuff and pushing it with folks. At the time everyone was still stuck on last generation consoles, hadn't looked at a compute much and thus didn't really see the need for execution model changes, swizzling, etc. despite us showing cases of ~2x or greater improvements to simple screen-space operations even with just a 2x2 pixel shader swizzle. Alas now it's kind of too late for this round of APIs
I have been actively lobbying our needs behind the scenes as well. I am really happy that four of the top priority features for us got included in DX12: multidraw, async compute, GS bypass and typed UAV load. I just hope that soon every vendor has a typed UAV capable hardware, allowing us to drop all the hacks and unnecessary data copies from the code base.
DirectX 12 API (CPU side) full rewrite was awesome news. Now we don't need to have fully separate resource management code for PC and consoles, and we can optimize the CPU side better, as we don't need to guess what the driver might be doing. This was the best API upgrade in the DirectX history (I have been along since DirectX 5.0).
However the GPU side API (=HLSL) received almost zero changes (except for binding stuff related to the CPU side API changes). DirectX 12 almost solely focused around rasterizer/graphics improvements (ROV, conservative raster, programmable stencil output, GS bypass, multidraw). Our new engine is heavily compute shader based, and there was no new features to the compute shader language. With this many awesome changes to the CPU API and the rasterizer/graphics features, I would have expected to have at least lane swizzles and GPU-side kernel enqueue to compute shaders. Also it is a bummer that we didn't get ordered atomics for compute shaders, as we got ROV for pixel shaders (basically these are the same feature).
I hope that the next update to DirectX focuses mainly around compute shader improvements. New CUDA versions and OpenCL 2.0/2.1 added many critical features that are now missing from compute shaders, and new consoles allowed lower level hardware access, allowing the developers to do interesting compute stuff with the GCN GPU that is not possible with PC DirectCompute. A quick stopgap solution (for DirectX 12.3 compute shaders) would be to implement some of the missing CUDA and OpenCL 2.0/2.1 features and adapt some of the console GPU features (implement cross platform versions).
A completely rewritten shader language should be the end goal, as the current one is designed for pixel and vertex shaders (VS/PS need no communication between the threads). SPIR-V low level intermediate language makes it practical for third party developers to write their own shading languages on top of Vulkan and OpenCL 2.1.
If Intel implements a good new shading language that outputs SPIR-V, I would be highly interested about it. I don't however know whether SPIR-V is flexible enough to implement completely new computational models.
I'm personally not going to be at SIGGRAPH this year (presenting at IDF the next week instead on Skylake graphics stuff), but there will be a few folks from my team and one of them was one of the folks heavily involved in the aforementioned prototypes
. I'll definitely fire you an e-mail to set up a meeting with them.
I have been looking at new desktop processors with iGPU, and I lately noticed that the new Broadwell desktop flagship (i7 5775C) has finally got EDRAM. I am wondering what is the difference between the i7 and the Xeon equivalent GPUs (Iris Pro 6200 vs Iris Pro P6300).
http://ark.intel.com/products/88046/Intel-Xeon-Processor-E3-1285-v4-6M-Cache-3_50-GHz
http://ark.intel.com/products/88040/Intel-Core-i7-5775C-Processor-6M-Cache-up-to-3_70-GHz
Xeon has 30W higher TDP. The CPU has 200 MHz higher base clock, but the GPU clocks are identical. I didn't find any GPU benchmarks comparing the Iris Pro 6200 vs Iris Pro P6300. Andrew, do you know whether the higher TDP of the Xeon allows the GPU to run faster? And if so, is this a noticeable difference?
I suppose I should wait until next week to see the official launch of the rumoured GT4e 72 EU Skylake with EDRAM + DX 12.0 feature level. That 95W TDP suggests that the GPU can run at max clocks for long time periods, making it almost a perfect rendering development CPU. The only thing missing is four additional CPU cores to speed up the compile time (the new architecture and the EDRAM alone only boost compile times slightly, meaning that older 8 core Haswells still should finish faster).
edit - also gen 8 graphics is haswell or broadwell? and 7.5 is?
Gen8 = Broadwell. Gen7.5 = Haswell. Both are DirectX 12 compatible.