Intel Gen Architecture Discussion

Rys · Jul 14, 2015

Since there's a bit of discussion in the Arctic Islands thread about ISAs and we diverged a bit to Gen, I thought I'd start a thread. Not sure we've ever really gone in-depth on Gen here, but there's plenty to discuss and talk about.

To get us started: https://01.org/linuxgraphics/documentation/hardware-specification-prms

That's a reasonably full list of reference material for the entire GPU design for almost everything we'd care about here that Intel have made, including descriptions of how the media engine works, the ISAs (quite rich in BSW!) and more.

mczak · Jul 14, 2015

I think on a high level the most interesting bit is actually the different dispatch widths plus the region-based register addressing (which are kind of related). Though the older chips had even more crazy dispatch.

Kaarlisk · Jul 14, 2015

It's interesting how Intel do fine-grained tuning of the architectures.
If one looks at BSW vs BDW, not only does EU/subslice change, but more subtle parameters (that I will not pretend to understand) also change.
For example, HiZ rate and cache size, L3 data cache maximum size (but not L3 total size), GTI write bandwidth, etc. That last one is actually interesting, it is higher for BSW than BDW. Could it be a result of evolution (BSW is a slightly later design, IMHO), or a power tradeoff, or made possible by the lower clocks?
With BDW, they now also can disable a single EU, probably for yield reasons. It seems every vendor now considers "lopsided" salvage parts worth it. With BDW&BSW, they now consider two subslices(samplers?) a requirement (GT1 was 1x10 with HSW, is 2x6 with BDW, and BSW has a 2x4 option instead of 1x8 or 1x6).
What might "F" in GT1F, GT1.5F, GT2F stand for?

So, just to make sure there is some OT, here. Looking at Intel's GPUs makes one curious whether the other vendors also fine-tune their architectures for various performance tiers, instead of just adjusting the counts and proportions of the main units.

Infinisearch · Jul 28, 2015

I found this on intel's compute architecture: https://software.intel.com/sites/de...itecture of Intel Processor Graphics Gen8.pdf
Pretty approachable document.

Andrew Lauritzen · Jul 28, 2015

Kaarlisk said:
That last one is actually interesting, it is higher for BSW than BDW. Could it be a result of evolution (BSW is a slightly later design, IMHO), or a power tradeoff, or made possible by the lower clocks?

Remember that on the big core designs the GPU hooks into the ring, shares LLC/eLLC just like the cores and so on, while on the atom chips its a pretty different setup.

Kaarlisk said:
What might "F" in GT1F, GT1.5F, GT2F stand for?

Likely "fused", but not 100% sure.

Kaarlisk said:
Looking at Intel's GPUs makes one curious whether the other vendors also fine-tune their architectures for various performance tiers, instead of just adjusting the counts and proportions of the main units.

Most folks scale the ratio of ALU:sampler down on the lower end parts. On the sorts of "low" settings that these GPUs tend to run with there's a large reduction to shader complexity, but there's a minimum amount of sampling you really need. Maintaining decent sampler/ROP throughput at the low end is also important to support 2D stuff efficiently at high resolutions.

Kaarlisk · Jul 28, 2015

Andrew Lauritzen said:
Most folks scale the ratio of ALU:sampler down on the lower end parts. On the sorts of "low" settings that these GPUs tend to run with there's a large reduction to shader complexity, but there's a minimum amount of sampling you really need. Maintaining decent sampler/ROP throughput at the low end is also important to support 2D stuff efficiently at high resolutions.

And the thing is, Intel didn't just change the ALU:Sampler ratio, Intel also changed the capabilities of those samplers. (Page 6 vs Page 6 – HiZ rate, for example).
As you pointed out, there is a different performance target, and that does changed the required performance ratios.
The interesting revelation was that it actually pays off for Intel to do the engineering work to customize the building blocks for those GPUs in addition to just changing the ratios of the "large" blocks.

Thank you for explaining even to those who are just curious instead of having a professional interest

Rys · Jul 28, 2015

There's always going to be an amount of reconfigurability baked into a GPU design (and really not just the GPU, but any large block where cost per square millimetre is a consideration for the chip designer). Probably more for someone like PowerVR because we sell it to 3rd parties, versus Intel with Gen where their only customer is themselves, but it's always going to happen.

The main reason (IMHO anyway) is that there's no easy way to tell, at design time, with the needed accuracy, how big the design is going to be in layout. So unless you have great slack in your area budget (and you shouldn't have, and in our case we have no control over area budgets), you're always going to tend towards configuration that lets you make something bigger or smaller to fill a space, usually making it faster or slower as the side effect, but potentially also adding or removing a feature (particularly common with mobile GPUs because the main APIs are traditionally so fluid and heavily optional, although that's changing).

Performance modelling plays into it too, but I would bet that area is actually the biggest reason for reconfigurability to be baked in.

So the chip guy can't say to the GPU architect, "you've got 10mm^2 this time", and the GPU architect can supply something that tightly fits just with major block scaling alone. The margins play out differently.

The way it manifests, at least for PowerVR but I'm sure it's the same for everyone else, is that the configuration is selectable at RTL compile time, much like a build-time configuration in other software. So there are a great number of tuneable options, to take features out or scale performance (and even the large block scaling is done this way).

sebbbi · Jul 30, 2015

Infinisearch said:
I found this on intel's compute architecture: https://software.intel.com/sites/default/files/managed/71/a2/Compute Architecture of Intel Processor Graphics Gen8.pdf
Pretty approachable document.

If I understood everything correctly, Intel's subslice is a rough match to AMDs CU. A single thread group needs to fit inside a subslice, because the barrier synchronization occurs inside a subslice, and a subslice has a 64 KB allocation of groupshared memory (in L3). Intel has relaxed rules for thread groups that do not have barriers and do not use groupshared memory. In this case there are no limits how the waves are spread accross the subslices (making it easier to fully fill the GPU).

The most interesting thing is that Intel has a fixed number of register file storage allocated for each wave (4 KB, out of total 28 KB per EU). The width of the wave shrinks down from 32 to 16 and to 8 as the register count increases. 8 wide waves can use 128 registers, 16 wide waves can use 64 registers and 32 wide waves can use 32 registers. Each EU runs 7 waves concurrently and there are 8 EUs per subslice = 56 waves are running concurrently in a single subslice ("CU"). If your shader uses 32 or less registers, a single subslice runs 1792 threads (="kernel invocations") concurrently, 33 to 64 registers result in 896 concurrent threads and 65-128 registers result to 448 concurrent threads.

DirectX (CS5.0) mandates that the biggest supported thread group size must be at least 1024 threads and there must be at least 32 KB of groupshared memory for each thread block. Intel subslices can only run 1792 threads, so only a single 1024 thread group (for example 32x32 pixel tiles) could run concurrently per subslice, meaning reduced occupancy. 256 thread blocks (16x16 pixel tiles) result in perfect occupancy. Each subslice can run exactly 7 of these thread blocks concurrently. With this configuration each thread block can use 64 KB / 7 = ~9 KB of groupshared memory. As the waves are 32 wide, each thread can use 32 registers.

In comparison GCN can run 40 waves per CU (each 64 threads). Total is thus 2560 threads per CU. A single CU can execute two 1024 thread (32x32 pixel tile) thread groups that use up to 32 KB groupshared memory (=LDS). This doesn't result in full occupancy either (1024 * 2 = 2048, 512 threads are unused). GCN needs 512 thread groups (32x16 pixels for example) for full occupancy. With this setup a single CU runs five of these groups concurrently. Each group can use 12.8 KB of groupshared memory. Each thread can use 25 registers.

If you want to use 256 thread (16x16 pixel) thread groups also on GCN (as the Intel example did), we get the following results: GCN CU can run 10 of these thread groups concurrently. Each thread group can use 6.4 KB of LDS. Each thread group can use 25 registers (it is the same number, as we are still running maximum amount of threads per CU). If you need more registers, a CU runs less concurrent threads (twice as many registers = halved concurrent thread count). Similarly in Intel architecture, if you need more than 32 registers (but no more than 64), you halve the concurrent thread count (etc). Intel's hardware is a little bit less flexible (power of two register counts) in wave allocation, but this architecture decision means that each wave (= N wide hardware thread) can have its own dedicated register file (most likely saving some power). Intel's decision also has an advantage with complex shaders (high GPR count -> low thread count). Narrow waves run code with incoherent branches faster and statistically cache miss slightly less on memory reads.

Both have identical ALU throughput per CU (subslice in Intel terminology):
Intel subslice ALU throughput = (8 EUs) * (MUL+ADD) * 2 * (SIMD4) = 128 FLOP/cycle.
AMD CU ALU throughput = (MUL+ADD) * 4 * (SIMD16) = 128 FLOP/cycle.

Intel's LDS (groupshared memory) uses the processor L3 cache. It is not inside the subslice. On Nvidia and AMD designs the LDS memory is inside the CU. It is as close to the execution units as their L1 caches. Nvidia actually uses the same memory for L1 cache and LDS (allowing flexible split). I am not sure how much slower LDS performance is on Intel GPUs, as the LDS is further away. If I understood the document correctly, it should not be usually a win to load data from memory and store it to LDS for further processing. Instead you should just use zero LDS and let the L3 cache handle it automatically. Obviously using LDS for thread group internal temporary work buffers (that are not read from or written to memory) is a win, as it saves the system memory bandwidth.

Andrew (and others): Any errors in my understanding of the Gen8 GPU?

Kaarlisk · Jul 30, 2015

sebbbi said:
Intel's LDS (groupshared memory) uses the processor L3 cache. It is not inside the subslice. On Nvidia and AMD designs the LDS memory is inside the CU. It is as close to the execution units as their L1 caches. Nvidia actually uses the same memory for L1 cache and LDS (allowing flexible split). I am not sure how much slower LDS performance is on Intel GPUs, as the LDS is further away. If I understood the document correctly, it should not be usually a win to load data from memory and store it to LDS for further processing. Instead you should just use zero LDS and let the L3 cache handle it automatically. Obviously using LDS for thread group internal temporary work buffers (that are not read from or written to memory) is a win, as it saves the system memory bandwidth.

Andrew (and others): Any errors in my understanding of the Gen8 GPU?

I cannot even pretend to understand the rest, as I am not a programmer, but I have been reading a lot of datasheets. Intel's GPUs have their own L3, separate from the CPU's LLC. Could it be that LDS uses that? Especially since CPUs without LLC also are DX11 compatible.

Kaarlisk · Jul 30, 2015

sebbbi said:
Intel subslices can only run 1792 threads, so only a single 1024 thread group (for example 32x32 pixel tiles) could run concurrently per subslice, meaning reduced occupancy.

What also comes to mind is that Intel changed their EU/subslice proportion when going from Haswell to Broadwell.
Haswell had 10EUs/subslice, 7 threads/EU, so if I understand the math correctly:
Broadwell can run at most 1792 threads concurrently (32 x 7 x 8), and the CS utilization is 1024/1792 = 57%
Haswell can run at most 2240 threads concurrently (32 x 7 x 10), and the CS utilization is 2048/2240 = 91% (provided there is enough of other resources, like registers)
So either I am completely misunderstanding this, which is very probable, or CS are not yet the focus

for if I understood correctly, other kinds of graphics workloads do not have this kind of utilization limitations.

sebbbi · Jul 30, 2015

Kaarlisk said:
What also comes to mind is that Intel changed their EU/subslice proportion when going from Haswell to Broadwell.
Haswell had 10EUs/subslice, 7 threads/EU, so if I understand the math correctly:
Broadwell can run at most 1792 threads concurrently (32 x 7 x 8), and the CS utilization is 1024/1792 = 57%
Haswell can run at most 2240 threads concurrently (32 x 7 x 10), and the CS utilization is 2048/2240 = 91% (provided there is enough of other resources, like registers)

This change is likely a perf hit in this special case, but in most cases it should be a performance boost. Broadwell has more subslices (with less EUs each). Subslice resources (memory port, instruction cache, texture sampler, texture caches) are shared between less EUs, meaning that these resources are less often a bottleneck. Maximum number of concurrent threads is not always a direct performance boost. As long as there are enough threads to fully feed the SIMDs, extra threads only increase the latency hiding efficiency (causing indirect performance improvement).

Kaarlisk said:
if I understood correctly, other kinds of graphics workloads do not have this kind of utilization limitations.

Yes, compute only uses thread groups (to allow multiple threads to communicate together and share data). Graphics threads do not have any (programmer controlled) communication methods between them. As an implementation detail, a pixel shader quad (= 2x2 pixel tile = 4 pixel shader threads) needs to run in the same wave, because the gradient operations (ddx, ddy) need to do lane swizzles to get neighbor lane data.

Andrew Lauritzen · Jul 30, 2015

sebbbi said:
The most interesting thing is that Intel has a fixed number of register file storage allocated for each wave (4 KB, out of total 28 KB per EU). The width of the wave shrinks down from 32 to 16 and to 8 as the register count increases. 8 wide waves can use 128 registers, 16 wide waves can use 64 registers and 32 wide waves can use 32 registers.

Right, SIMD width is basically our occupancy equivalent. The compiler could in theory mix different SIMD width instructions and registers and so on in a single kernel but in practice it mostly compiles for different SIMD widths and depending on how much spill/fill is required it may drop certain widths. SIMD32 offers few advantages over SIMD16 in performance - it's mainly there to handle large thread groups. SIMD8 can theoretically get full performance too but it starts to depend a bit more on the instruction stream and latency hiding to shared functions (texture sampling, memory) can become an issue.

The one slight quirk of Gen in 3D is that it also uses the SIMD widths to handle small triangles. i.e. for triangles that are <=8 pixels (more realistically, <=2 2x2 quads) the hardware will dispatch a SIMD8 subspan. We do not pack quads from multiple triangles into the same pixel shader hardware thread which simplifies a number of things and even with small triangles does not leave much performance on the floor in practice (even SIMD8 is fairly efficient). But it has this weird quirk that shaders used with small triangles intrinsically get a lot of registers

sebbbi said:
As the waves are 32 wide, each thread can use 32 registers.

The heuristics vary depending on architecture but I don't believe we tend to dispatch SIMD32 kernels unless we "have to", i.e. for very large thread groups. As I noted there's no real performance advantage to SIMD32 so we'll tend to use SIMD16 most of the time and fall back to SIMD8 where necessary due to register pressure. So most of the time (i.e. thread groups of <=512) you have at least 64 registers per lane.

There's no escaping the fact that occupancy is highly GPU/SKU specific on every architecture and it's a pain in the ass if you want to get it near optimal

See NVIDIA's CUDA spreadsheet and several academic papers on the topic. It's one of the main reasons that performance portability in GPU compute languages is rather poor in general and I think both the hardware and execution models need to improve here.

sebbbi said:
Both have identical ALU throughput per CU (subslice in Intel terminology):
Intel subslice ALU throughput = (8 EUs) * (MUL+ADD) * 2 * (SIMD4) = 128 FLOP/cycle.

Yeah or for simplicity for folks on a given SKU, it's EUs * 16 basically. Note that the 2 MADs in a cycle must come from separate hardware threads so you need at least 2 per EU to get maximum throughput, and in practice more to cover any memory latency.

sebbbi said:
Intel's LDS (groupshared memory) uses the processor L3 cache. It is not inside the subslice. On Nvidia and AMD designs the LDS memory is inside the CU. It is as close to the execution units as their L1 caches. Nvidia actually uses the same memory for L1 cache and LDS (allowing flexible split). I am not sure how much slower LDS performance is on Intel GPUs, as the LDS is further away. If I understood the document correctly, it should not be usually a win to load data from memory and store it to LDS for further processing.

Correct, you can "trust the cache hierarchy" a bit more than on other architectures, even for writes. The one situation in which LDS is a benefit over just letting L3 do its thing is heavy (non-contiguous) gather/scatter as the LDS is highly banked and thus can handle these sorts of operations better than the typical cache hierarchy which ultimately has certain throughputs in terms of cache lines touched. Atomics can also be somewhat better in LDS than general memory.

The interesting thing on Gen is that given the fairly large register files and the rather crazy addressing modes that are possible, optimal code would actually treat the RF itself like a sort of LDS. We've experimented with alternate compute languages and execution models that enable that sort of thing and they are significantly faster than what you can do with LDS on Gen or even AMD/NVIDIA. Alas the world is sort of stuck in this CUDA 1.0-style execution model for the moment and I don't see that changing during the current console generation at least.

sebbbi said:
Andrew (and others): Any errors in my understanding of the Gen8 GPU?

Not that I saw, very thorough analysis

sebbbi said:
This change is likely a perf hit in this special case, but in most cases it should be a performance boost.

Right, it's much more important to rebalance the very-ALU-heavy Gen7.5 design than to cater to relatively rare large thread groups with heavy LDS usage in my opinion, although I don't make these calls anyways

Honestly the further towards the edge of occupancy limits you get on GPUs the more they are simply not suitable computation targets and CPUs (or Xeon Phi honestly) with much larger ratios of cache:instructions are more appropriate.

sebbbi said:
As long as there are enough threads to fully feed the SIMDs, extra threads only increase the latency hiding efficiency (causing indirect performance improvement).

Exactly, and the latency hiding matters more in *narrower* SIMD modes, so it's less important in these SIMD32 kernels to start with.

sebbbi · Jul 31, 2015

Andrew Lauritzen said:
SIMD32 offers few advantages over SIMD16 in performance - it's mainly there to handle large thread groups. SIMD8 can theoretically get full performance too but it starts to depend a bit more on the instruction stream and latency hiding to shared functions (texture sampling, memory) can become an issue.

Figured as much. SIMD32 is required mainly for compliance.

In some cases huge thread blocks can also be a performance improvement, because more threads can cooperate (barriers, LDS) with each other. With the current compute model, thread group is the only way of cooperation. Global atomics help in some algorithms, but there's no (robust deadlock free) way to implement fancy producer & consumer cooperation schemes withing a single compute shader (across all compute units). That said, I would love to have ordered atomics counters (similar to ROV but for compute shaders). Would make operations such as global prefix cost zero memory BW (two counters in L3 cache would be enough).

Andrew Lauritzen said:
The one slight quirk of Gen in 3D is that it also uses the SIMD widths to handle small triangles. i.e. for triangles that are <=8 pixels (more realistically, <=2 2x2 quads) the hardware will dispatch a SIMD8 subspan. We do not pack quads from multiple triangles into the same pixel shader hardware thread which simplifies a number of things and even with small triangles does not leave much performance on the floor in practice (even SIMD8 is fairly efficient). But it has this weird quirk that shaders used with small triangles intrinsically get a lot of registers

Our microbenchmarks indicate that modern GPUs have awfully many different bottlenecks in small triangle rendering. Quad efficiency is just on of them. Try pushing lots of triangles to the same macrotile on Nvidia or AMD GPUs and the throughput absolutely plummets. Triangles in general start to be a bad choice when you have almost one per pixel. I am pleased to see alternative data models (such as voxels and SDFs) starting to get used outside tech demos. Triangle models are impossible to LOD perfectly (= no popping + constant cost per pixel).

Andrew Lauritzen said:
Note that the 2 MADs in a cycle must come from separate hardware threads so you need at least 2 per EU to get maximum throughput, and in practice more to cover any memory latency.

AMD needs four waves per CU to maximum throughput (no latency hiding). And these waves are always 64 wide. Meaning that Intel has a huge advantage in filling the GPU in kernels with small amount of threads.

Andrew Lauritzen said:
The interesting thing on Gen is that given the fairly large register files and the rather crazy addressing modes that are possible, optimal code would actually treat the RF itself like a sort of LDS. We've experimented with alternate compute languages and execution models that enable that sort of thing and they are significantly faster than what you can do with LDS on Gen or even AMD/NVIDIA. Alas the world is sort of stuck in this CUDA 1.0-style execution model for the moment and I don't see that changing during the current console generation at least.

Cross lane swizzles allow big boosts (compared to LDS) for common parallel algorithms that require prefix sums, reduction or voting between threads. CUDA directly exposes these ops, and OpenCL 2.0 is the first cross vendor API that kind of does this. I fully agree with you that better compute languages with better computational model would be needed in to exploit cross lane sharing better. With easy to use (platform independent) cross lane sharing, most use cases for LDS would disappear. I would see a possibility of a fully automatic memory/cache based model (no LDS at all) if this happens.

AMD also added more cross lane operations to GCN 1.2 (Tonga/Fiji). Too bad that on PC these features are practically idle transistors in current programming languages. Quad swizzle being practically the only swizzle that is frequently used (because of pixel shaders). Even that is not exposed in DirectCompute. I don't see any problems in exposing it (as ddx and ddy expose it already in pixel shaders). Pretty please Microsoft

I would love to talk about improving GPU programming languages. Are you guys coming to SIGGRAPH? I am having a presentation this year (http://s2015.siggraph.org/attendees/courses/events/advances-real-time-rendering-part-ii).

Andrew Lauritzen · Jul 31, 2015

sebbbi said:
Global atomics help in some algorithms, but there's no (robust deadlock free) way to implement fancy producer & consumer cooperation schemes withing a single compute shader (across all compute units).

Yup... we have some good prototypes of much nicer execution models that support stuff like produce/consumer and work stealing efficiently, but like I said the industry is kind of set in its ways at the moment, despite us knowing these issues for many years now.

sebbbi said:
Our microbenchmarks indicate that modern GPUs have awfully many different bottlenecks in small triangle rendering.

For sure; Gen not packing multiple primitives into the same SIMD8 invocation was something I initially thought was a big issue but after playing around with it and alternative architectures a bunch it's as you say - that's a pretty small concern compared to other bottlenecks with small triangles. Ultimately you can't "just" do stuff like quad fragment merging, better tessellation etc... if you want to efficiently render micropolygons it really does imply a completely different pipeline design.

sebbbi said:
Triangles in general start to be a bad choice when you have almost one per pixel. I am pleased to see alternative data models (such as voxels and SDFs) starting to get used outside tech demos. Triangle models are impossible to LOD perfectly (= no popping + constant cost per pixel).

Completely agreed on all points.

sebbbi said:
Meaning that Intel has a huge advantage in filling the GPU in kernels with small amount of threads.

Yeah a somewhat related advantage is that Gen handles dynamic branching extremely efficiently in my testing. For instance, it's basically always best to branch around texture or memory operations that are going to be discarded whereas doing that on other GPUs can cause you more harm than good (in terms of the shader compiler's ability to move stuff around, registers and so on). A common case is for texture compositing where you should always branch out based on weight == 0 on Gen for each sampling request, but few people do.

sebbbi said:

Quad swizzle being practically the only swizzle that is frequently used (because of pixel shaders). Even that is not exposed in DirectCompute. I don't see any problems in exposing it (as ddx and ddy expose it already in pixel shaders). Pretty please Microsoft

Click to expand...

You shoulda been around ~3-4 years ago when we were going through all this stuff and pushing it with folks. At the time everyone was still stuck on last generation consoles, hadn't looked at a compute much and thus didn't really see the need for execution model changes, swizzling, etc. despite us showing cases of ~2x or greater improvements to simple screen-space operations even with just a 2x2 pixel shader swizzle. Alas now it's kind of too late for this round of APIs

sebbbi said:

I would love to talk about improving GPU programming languages. Are you guys coming to SIGGRAPH?

Click to expand...

I'm personally not going to be at SIGGRAPH this year (presenting at IDF the next week instead on Skylake graphics stuff), but there will be a few folks from my team and one of them was one of the folks heavily involved in the aforementioned prototypes

. I'll definitely fire you an e-mail to set up a meeting with them.

Infinisearch · Jul 31, 2015

Andrew Lauritzen said:
The interesting thing on Gen is that given the fairly large register files and the rather crazy addressing modes that are possible, optimal code would actually treat the RF itself like a sort of LDS. We've experimented with alternate compute languages and execution models that enable that sort of thing and they are significantly faster than what you can do with LDS on Gen or even AMD/NVIDIA. Alas the world is sort of stuck in this CUDA 1.0-style execution model for the moment and I don't see that changing during the current console generation at least.

You really really have my interest, is there anyway you can discuss these things in this or some other thread. I am extremely interested in alternate execution models and how you were planning on abstracting them. I don't really know if I'll be able to follow since I only semi-recently decided to take on learning compute shaders, but I will try. Please if possible could you start a thread, you hinted at having problems with the current paradigm in another thread... my interest is piqued.

edit - also gen 8 graphics is haswell or broadwell? and 7.5 is?
edit 2 - i think broadwell is gen 8 since it seems to be multiples of 8 as per the document.

sebbbi · Jul 31, 2015

Andrew Lauritzen said:
Yup... we have some good prototypes of much nicer execution models that support stuff like produce/consumer and work stealing efficiently, but like I said the industry is kind of set in its ways at the moment, despite us knowing these issues for many years now.

I am always interested about new programming models for GPUs. I was kind of disappointed with the DX11 tessellator design as it added two new hardcoded shader stages (hull and domain) instead of allowing us to flexibly configure shader stages and the communication between them. It seemed like a big kludge. The scalar unit in AMD GPUs is nice as it allows to perform operations at more coarse (1/64) granularity, but you could achieve the same (in a much more flexible way) if you could spawn multiple kernels (of different shader and thread counts) with fine grained synchronization (to the same CU) and communicate between them though LDS (or through the big caches or register file in Intel's design).

Andrew Lauritzen said:
Yeah a somewhat related advantage is that Gen handles dynamic branching extremely efficiently in my testing. For instance, it's basically always best to branch around texture or memory operations that are going to be discarded whereas doing that on other GPUs can cause you more harm than good (in terms of the shader compiler's ability to move stuff around, registers and so on). A common case is for texture compositing where you should always branch out based on weight == 0 on Gen for each sampling request, but few people do.

Good to know. I would assume our vertex skinning would at least benefit from this (we have always 4 bone indices + 4 weights, but some weights might be zero).

Andrew Lauritzen said:
You shoulda been around ~3-4 years ago when we were going through all this stuff and pushing it with folks. At the time everyone was still stuck on last generation consoles, hadn't looked at a compute much and thus didn't really see the need for execution model changes, swizzling, etc. despite us showing cases of ~2x or greater improvements to simple screen-space operations even with just a 2x2 pixel shader swizzle. Alas now it's kind of too late for this round of APIs

I have been actively lobbying our needs behind the scenes as well. I am really happy that four of the top priority features for us got included in DX12: multidraw, async compute, GS bypass and typed UAV load. I just hope that soon every vendor has a typed UAV capable hardware, allowing us to drop all the hacks and unnecessary data copies from the code base.

DirectX 12 API (CPU side) full rewrite was awesome news. Now we don't need to have fully separate resource management code for PC and consoles, and we can optimize the CPU side better, as we don't need to guess what the driver might be doing. This was the best API upgrade in the DirectX history (I have been along since DirectX 5.0).

However the GPU side API (=HLSL) received almost zero changes (except for binding stuff related to the CPU side API changes). DirectX 12 almost solely focused around rasterizer/graphics improvements (ROV, conservative raster, programmable stencil output, GS bypass, multidraw). Our new engine is heavily compute shader based, and there was no new features to the compute shader language. With this many awesome changes to the CPU API and the rasterizer/graphics features, I would have expected to have at least lane swizzles and GPU-side kernel enqueue to compute shaders. Also it is a bummer that we didn't get ordered atomics for compute shaders, as we got ROV for pixel shaders (basically these are the same feature).

I hope that the next update to DirectX focuses mainly around compute shader improvements. New CUDA versions and OpenCL 2.0/2.1 added many critical features that are now missing from compute shaders, and new consoles allowed lower level hardware access, allowing the developers to do interesting compute stuff with the GCN GPU that is not possible with PC DirectCompute. A quick stopgap solution (for DirectX 12.3 compute shaders) would be to implement some of the missing CUDA and OpenCL 2.0/2.1 features and adapt some of the console GPU features (implement cross platform versions).

A completely rewritten shader language should be the end goal, as the current one is designed for pixel and vertex shaders (VS/PS need no communication between the threads). SPIR-V low level intermediate language makes it practical for third party developers to write their own shading languages on top of Vulkan and OpenCL 2.1. If Intel implements a good new shading language that outputs SPIR-V, I would be highly interested about it. I don't however know whether SPIR-V is flexible enough to implement completely new computational models.

Andrew Lauritzen said:
I'm personally not going to be at SIGGRAPH this year (presenting at IDF the next week instead on Skylake graphics stuff), but there will be a few folks from my team and one of them was one of the folks heavily involved in the aforementioned prototypes . I'll definitely fire you an e-mail to set up a meeting with them.

I have been looking at new desktop processors with iGPU, and I lately noticed that the new Broadwell desktop flagship (i7 5775C) has finally got EDRAM. I am wondering what is the difference between the i7 and the Xeon equivalent GPUs (Iris Pro 6200 vs Iris Pro P6300).

http://ark.intel.com/products/88046/Intel-Xeon-Processor-E3-1285-v4-6M-Cache-3_50-GHz
http://ark.intel.com/products/88040/Intel-Core-i7-5775C-Processor-6M-Cache-up-to-3_70-GHz

Xeon has 30W higher TDP. The CPU has 200 MHz higher base clock, but the GPU clocks are identical. I didn't find any GPU benchmarks comparing the Iris Pro 6200 vs Iris Pro P6300. Andrew, do you know whether the higher TDP of the Xeon allows the GPU to run faster? And if so, is this a noticeable difference?

I suppose I should wait until next week to see the official launch of the rumoured GT4e 72 EU Skylake with EDRAM + DX 12.0 feature level. That 95W TDP suggests that the GPU can run at max clocks for long time periods, making it almost a perfect rendering development CPU. The only thing missing is four additional CPU cores to speed up the compile time (the new architecture and the EDRAM alone only boost compile times slightly, meaning that older 8 core Haswells still should finish faster).

Andrew Lauritzen said:
edit - also gen 8 graphics is haswell or broadwell? and 7.5 is?

Gen8 = Broadwell. Gen7.5 = Haswell. Both are DirectX 12 compatible.

Paran · Jul 31, 2015

There is no GT4e launch next week. i5-6600k and i7-6700k are GT2 SKUs.

Xmas · Jul 31, 2015

Andrew Lauritzen said:
The one slight quirk of Gen in 3D is that it also uses the SIMD widths to handle small triangles. i.e. for triangles that are <=8 pixels (more realistically, <=2 2x2 quads) the hardware will dispatch a SIMD8 subspan. We do not pack quads from multiple triangles into the same pixel shader hardware thread which simplifies a number of things and even with small triangles does not leave much performance on the floor in practice (even SIMD8 is fairly efficient). But it has this weird quirk that shaders used with small triangles intrinsically get a lot of registers

So small triangles actually execute different shader code?

Rys · Jul 31, 2015

Yeah, good point. Does that mean for all pixel shaders you compile three versions of each one?

sebbbi · Jul 31, 2015

Rys said:
Yeah, good point. Does that mean for all pixel shaders you compile three versions of each one?

Shouldn't the same instructions work for all wave widths (as in SIMT each thread in a wave is performing exactly the same instruction). EU threads (waves) know their width, so they know how many 4 lane SIMD executions they need to issue in order to execute a single instruction from the wave. You don't need to encode this information to each instruction (to the instruction microcode in memory).

Intel Gen Architecture Discussion

Rys

Graphics @ AMD

mczak

Kaarlisk

Infinisearch

Andrew Lauritzen

Moderator

Kaarlisk

Rys

Graphics @ AMD

sebbbi

Kaarlisk

Kaarlisk

sebbbi

Andrew Lauritzen

Moderator

sebbbi

Andrew Lauritzen

Moderator

Infinisearch

sebbbi

Paran

Xmas

Porous

Rys

Graphics @ AMD

sebbbi

Similar threads