Direct3D feature levels discussion

trinibwoy · Jun 13, 2015

sebbbi said:
Nothing really "happens" on GCN hardware. Simplified: When you "bind" stuff on CPU side, the driver puts a "pointer" (resource descriptor) to an array. Later when a wave starts running in a CU, it will issue (scalar) instructions to load this array from the memory to scalar registers. Buffer load / texture sample instruction takes a resource descriptor (scalar register(s)) and 64 offsets/UVs as input and returns 64 results (filtered texels or loaded values from a buffer). Texture sampling has higher latency than buffer loads, as the texture filtering hardware is further away from the execution units (buffer loads have low latency as those get the data directly from the CU L1 cache).

Thanks but I didn't quite follow that. Are you describing what happens currently or in a bindless scenario? For bindless I understand the shader gets a descriptor list and textures are loaded on the fly.

I was asking about the current situation with limits on maximum textures bindable in a shader. I thought those limits were initially determined by the number of available physical texture mapping units on chip. I'm trying to understand why the number of physical units is relevant.

Alessio1989 · Jun 13, 2015

pharma said:
June 9, 2015

https://scalibq.wordpress.com/2015/06/09/no-dx12_1-for-upcoming-radeon-3xx-series/

I still do not see what is the big deal. Some hardware capabilities features are not sufficiently spread enough to be mandatory features for games in the following years, others have a share of 0%. If we look at the steam surveys the most popular DX12 capable GPUS are based on NV Fermi architecture. No-one will cut the 90-95% (or even more!) of DX12 capable system just because they want mandatory ROVs, Conservative Rasterization (tier1) and 3D tiled resources.

People blamed Microsoft and IHVs with because before D3D12 every major of Direct3D required strong and straight mandatory hardware support. Now people blame Microsoft and IHVs because D3D12 support a really wide range of hardware (probably the widest of the entire DirectX Graphics history!) because the API is not able to create by magic new hardware capabilities on GPUs architectures that are more then 4 years old.

MDolenc · Jun 13, 2015

Alessio1989 said:
People blamed Microsoft and IHVs with because before D3D12 every major of Direct3D required strong and straight mandatory hardware support.

That's not true. You could create for example D3D8 device on say GeForce 256. The only DX that broke this was DirectX 10.

sebbbi · Jun 13, 2015

trinibwoy said:
Thanks but I didn't quite follow that. Are you describing what happens currently or in a bindless scenario? For bindless I understand the shader gets a descriptor list and textures are loaded on the fly.

I was asking about the current situation with limits on maximum textures bindable in a shader. I thought those limits were initially determined by the number of available physical texture mapping units on chip. I'm trying to understand why the number of physical units is relevant.

(I updated my original post, added some info to make it more clear).

GCN doesn't have global binding state. All resource descriptors are loaded in general purpose scalar registers.

GCN execution simplified:.Each wave (64 threads) reserves predetermined amount of vector registers and scalar registers (based on shader complexity) when it starts executing and frees them when all those 64 threads are finished. Each CU has 256 KB room for vector registers and 8 KB room for scalar registers. The number of waves that can run concurrently on a single CU is determined by the vector register and scalar register needs of a shader (compile time determined numbers). Multiple different shaders can also run on the same CU. All shaders share the same register pools.

GCN can sample infinitely many textures by the same shader. You can write a loop in a shader that loads a new resource descriptor and samples one texel from that. The compiler will reuse the same scalar register for each loop iteration. To hide latency you likely want to unroll the loop N times and load N resource descriptors at the same time and sample them (where N is a small number, likely less than 10). In this case you need more scalar registers. This is not a problem. I have never seem a shader that was limited by the scalar register count.

Recommended reading:
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

3dcgi · Jun 13, 2015

trinibwoy said:
Thanks but I didn't quite follow that. Are you describing what happens currently or in a bindless scenario? For bindless I understand the shader gets a descriptor list and textures are loaded on the fly.

I was asking about the current situation with limits on maximum textures bindable in a shader. I thought those limits were initially determined by the number of available physical texture mapping units on chip. I'm trying to understand why the number of physical units is relevant.

The limit of non-bindless architectures wasn't necessarily tied to the number of physical texture mapping units. There were a limited number of texture resources though and these resources described the texture or vertex buffer data formats, etc.

Alessio1989 · Jun 13, 2015

MDolenc said:
That's not true. You could create for example D3D8 device on say GeForce 256. The only DX that broke this was DirectX 10.

Yes, but you were still limited to T&L fixed functions, no Shader Model 1.x.

trinibwoy · Jun 13, 2015

sebbbi said:
GCN can sample infinitely many textures by the same shader.

Understood. I was referring to older architectures/APIs with texture limits. Trying to understand what determined those limits.

3dcgi said:
The limit of non-bindless architectures wasn't necessarily tied to the number of physical texture mapping units. There were a limited number of texture resources though and these resources described the texture or vertex buffer data formats, etc.

Ok, can't find the link now but I recall reading that OpenGL limits were tied to the number of physical samplers.

sebbbi · Jun 13, 2015

Andrew Lauritzen said:
Yes and I'm surprised they didn't go this route in Mantle, but I would actually disagree with the notion that this represents the clear trend in hardware design. If you want to pass the full descriptor data to samplers then you need very wide SIMD to amortize the cost. GCN obviously has wide SIMD, but that's a fundamental architecture tradeoff that has as many negatives as positives, and I believe GCN is the only architecture that works that way.

Yes, the GCN is unique. The scalar unit + using scalar registers as descriptors differs from other GPUs. The scalar unit also handles branching and control flow and wave-invariant integer math / loads, offloading work from the wide vector SIMDs. With the new rumors about Fiji's improved scalar unit (memory stores, full instruction set, one scalar unit per CU), the GCN architecture seems to be moving even closer to throughput oriented in-order CPUs with wide SIMD.

Knights Landing has a simple in-order scalar pipeline that handles branching and control flow (and uniform integer math and uniform loads/stores can be offloaded to it), 512 bit AVX (16 wide for 32 bit float), 4 way hyperthreading (GCN is 10 way). Similarities are striking.

Andrew Lauritzen said:
Fundamentally the data-paths between the execution units and samplers do have to handle a lot of bandwidth already and "compression" of the data passed between these two is highly desirable. In this case, the compression takes the form of passing an offset/index instead of the full descriptor data (usually ~128-256 bit) or even a 64-bit pointer. While it's convenient from a programming standpoint to have "just data" and "just pointers", I don't necessarily buy the argument that in this particular case it's functionally important enough to scatter descriptors around memory that it's worth constraining hardware designs or hampering performance for.

256 bit (32 bytes) resource descriptor is significant amount of data. However, you also need to send 64 UVs, 64*2*sizeof(float) = 512 bytes. The resource descriptor is only 6.25% of the data. Some sampling instructions also need mip level or gradient (further reducing the resource descriptor percentage). It is not that bad design call. It gives AMD lots of flexibility in the future.

Andrew Lauritzen · Jun 13, 2015

sebbbi said:
256 bit (32 bytes) resource descriptor is significant amount of data. However, you also need to send 64 UVs, 64*2*sizeof(float) = 512 bytes. The resource descriptor is only 6.25% of the data. Some sampling instructions also need mip level or gradient (further reducing the resource descriptor percentage). It is not that bad design call. It gives AMD lots of flexibility in the future.

Yep that's what I mean by it's a reasonable decision if you have wide SIMD, but that has a range of other consequences (most notably in the case of GCN, heavy register/occupancy pressure). There are two real design points in this space... narrower SIMD + indexed descriptors (~20 bits seems to be where most folks are converging), or wide SIMD with descriptor data passed itself. It's hard to argue that one is globally better or the clear trend. I have no complaints about what GCN does here but neither would I argue that everyone else should do the same thing.

(As an aside regarding the "what limits previous hardware descriptor table sizes", it's very often either the number of index bits into the descriptor heap, or in some architectures there is a fixed amount of hardware storage for descriptors in general and they are updated via GPU commands. There are hybrids and variations of course but that's the general idea.)

I honestly don't think it's a big deal in practice either way - Tier 2/3 binding in DX12 is pretty much all you need. There are very few cases where you need more than ~1 million descriptors, especially now that you only need descriptors for textures basically (not buffers), and it's not really much overhead to collect them in a contiguous array, even if you just end up ring buffering a lot of it. i.e. it's not a lot of data for the CPU to reorganize per frame (or more realistically, swap a few in and out), but it is a lot of data for narrower SIMD GPUs to be passing around for every pixel (massive amplification of course).

sebbbi · Jun 14, 2015

Andrew Lauritzen said:
Yep that's what I mean by it's a reasonable decision if you have wide SIMD, but that has a range of other consequences (most notably in the case of GCN, heavy register/occupancy pressure). There are two real design points in this space... narrower SIMD + indexed descriptors (~20 bits seems to be where most folks are converging), or wide SIMD with descriptor data passed itself. It's hard to argue that one is globally better or the clear trend.

Wide SIMD has some benefits as well. Wide SIMD has bigger benefit from a scalar unit. The scalar unit can be used to offload register pressure of wave invariant data and ALU of wave invariant math. Wave invariant data stored in scalar register uses 64x less register storage space. This more than compensates the slightly increased register pressure of the wide SIMD. Occupancy can be improved (even with the current gen GCN hardware) by offloading registers to the scalar unit. Unfortunately the AMD shader compiler is still not very good at doing this automatically, meaning that the programmer must do it manually for full benefit (and unfortunately writing manual scalar loads/math is not possible on PC). Good automatic scalarization is not an impossible problem to solve. There are several good papers about this topic. Of course, if the shader language helped with this (for example OpenGL 2.0 subgroup operations), it would be much easier for the compiler to do a perfect job.

On GPUs that do not have scalar registers (scalar units), the wave (SIMD lane) invariant stuff needs to be loaded and kept in vector registers. This means that the data is replicated N times (where N is the SIMD width). Math also needs to be replicated. Power consumption is increased and occupancy is lowered (leading to worse latency hiding). Keeping resource descriptors in vector registers would be a big hit if the resource descriptor is 256 bits (it would take 8 registers per lane). Even if you can store a resource index to 20 bits, it still needs a vector register per lane (assuming you have no other place to store it). With scalar registers you only need 8 scalar registers (to store a 256 bit descriptor) for 64 lanes.

Traffic to texture unit would of course be reduced if the sampling request used only 20 bits to describe the resource instead of 256 bits. One million descriptors should be enough for most purposes. But unfortunately using an index to memory would mean that the shader cannot create a descriptor out of thin air (programmatically). Thus sampling always needs at least one indirection. This is not a problem with current games, but there is a clear trend towards larger object counts with higher variety of textures, meaning that the resource descriptor loads will miss the cache more in the future.

Silent_Buddha · Jun 14, 2015

pharma said:
June 9, 2015

https://scalibq.wordpress.com/2015/06/09/no-dx12_1-for-upcoming-radeon-3xx-series/

Some of the responses of AMD’s Robert Hallock also point to downplaying the new DX12_1 features, and just pretending that supporting DX12 is supporting DX12, regardless of features.
Clearly that is not the case. But if AMD needs to spread that message, with a new chip being launched in just a few days, I think we know enough.

Does this mean that Nvidia didn't support DX 11 because they couldn't do DX 11.1 or feature level 11_1? Does the fact that Nvidia downplayed the benefits of 11.1/11_1 mean that they didn't support DX 11? That guy just sounds like an idiot/fanboy when he says that even if it may not have been his intention.

Of course if he said the same things about Nvidia back then, well, then he's just an idiot.

Regards,
SB

Alexko · Jun 14, 2015

Silent_Buddha said:
Does this mean that Nvidia didn't support DX 11 because they couldn't do DX 11.1 or feature level 11_1? Does the fact that Nvidia downplayed the benefits of 11.1/11_1 mean that they didn't support DX 11? That guy just sounds like an idiot/fanboy when he says that even if it may not have been his intention.

Of course if he said the same things about Nvidia back then, well, then he's just an idiot.

Regards,
SB

Let's just say that he was banned from these forums a while ago.

Alessio1989 · Jun 14, 2015

Silent_Buddha said:
Does this mean that Nvidia didn't support DX 11 because they couldn't do DX 11.1 or feature level 11_1? Does the fact that Nvidia downplayed the benefits of 11.1/11_1 mean that they didn't support DX 11? That guy just sounds like an idiot/fanboy when he says that even if it may not have been his intention.

Of course if he said the same things about Nvidia back then, well, then he's just an idiot.

Regards,
SB

NVIDIA did not support FL 11.1 for Fermi and Kepler GPUs for what they claimed "non gaming features". One of the "non-gaming features" (lmao) they do not support in FL 11.1 of DX11 family API is UAV support in vertex/geometry/domain/hull shader stages and 64 UAV slots. The funny is that D3D12 requires UAV in all shader stages even on tier 1 of resource binding (ie Fermi) and 64 UAV slot on tier 2 hardware (ie Kepler and Maxell 1.0), this mean NV must support this features in some way. Of course now they are "gaming features"!
Why Microsoft and NVIDIA did not expose such features via caps-bits like all other features of FL 11.1 in the D3D11 API family? I do not know, probably it has something to do with the internal structure and operating principles of D3D 11.* APIs.

pharma · Jun 14, 2015

Alessio1989 said:
NVIDIA did not support FL 11.1 for Fermi and Kepler GPUs for what they claimed "non gaming features".
Why Microsoft and NVIDIA did not expose such features via caps-bits like all other features of FL 11.1 in the D3D11 API family?

I believe Kepler does support 11_1, and most likely Haswell and Broadwell.

Alessio1989 · Jun 14, 2015

pharma said:
I believe Kepler does support 11_1, and most likely Haswell and Broadwell.

And why do not support it from beginning? What are they waiting for? :runaway:

Haswell and broadwell should be FL 11.1 as far I know

pharma · Jun 14, 2015

Alessio1989 said:
And why do not support it from beginning? What are they waiting for?
Haswell and Broadwell should be FL 11.1 as far I know

I misread your original comment. Kepler supports 11_0 and Haswell/Broadwell supports 11_1.

sebbbi · Jun 14, 2015

Fermi and Kepler do not support FL 11_1. There is no such thing as "almost supporting". If you ask DirectX 11 API for 11_1 feature level device it gives you an error on Fermi and Kepler. You must create a FL 11_0 device to make you program work on Fermi and Kepler.

All this talk about DX 11.1 "non-gaming features" and DX 12.1 not being important is pure PR damage control. Obviously hardware manufacturer's marketing departments are going to downplay the importance of features their GPU does not support. Both AMD and NVIDIA are doing the same thing. This started way back with DX8 SM 1.4 (ATI was leading back then). NVIDIA downplayed it. NVIDIA was leading DX9 feature race with SM2.X (Geforce FX 5000), ATI downplayed it. AMD reclaimed the lead with DX10.1 and NVIDIA downplayed it. Surprisingly DX11.1 and (11.2 tiled resources tier 2) continued AMDs feature lead (two times in a row). NVIDIA downplayed these "non-gaming" features. Now NVIDIA will reclaim the feature crown after two DirectX generations with 12.1. And guess what. AMD PR has already said DX 12.1 is not important

This time however there is no clear lead, since AMD is the only one that supports resource binding tier 3. AMDs asyncronous compute implementation is also very good, as the fully bindless nature of their GPU means that the CUs can do very fine grained simultaneous execution of multiple shaders. Don't get fooled by the maximum amount of compute queues (shown by some review sites). Big numbers don't tell anything about the performance. Usually running two tasks simultaneously gives the best performance. Running significantly more just trashes the data and instruction caches.

Rikimaru · Jun 14, 2015

sebbbi said:
NVIDIA was leading DX9 feature race with SM2.X (Geforce FX 5000), ATI downplayed it.

nVidia was leading in DX9 feature race with FX 5000? Too bad you can not actually use it because performance was atrocious

Rikimaru · Jun 14, 2015

Geforce FX 5000 situation got me me thinking. Are maxwell 2 12_1 features even perfomant enough to use in game engines effectively?

pharma · Jun 14, 2015

sebbbi said:
This time however there is no clear lead, since AMD is the only one that supports resource binding tier 3. AMDs asyncronous compute implementation is also very good, as the fully bindless nature of their GPU means that the CUs can do very fine grained simultaneous execution of multiple shaders. Don't get fooled by the maximum amount of compute queues (shown by some review sites). Big numbers don't tell anything about the performance. Usually running two tasks simultaneously gives the best performance. Running significantly more just trashes the data and instruction caches.

As you failed to state in your reply, Nvidia is also currently the only one supporting conservative rasterization and ROV’s as features. Many would find this much more relevant than the differences between resource binding(tier 2 vs tier 3) as a feature, but I guess that depends on which "camp" you sit and who will be providing programming "support" for your game development efforts in the near future.

Direct3D feature levels discussion

trinibwoy

Meh

Alessio1989

MDolenc

sebbbi

3dcgi

Alessio1989

trinibwoy

Meh

sebbbi

Andrew Lauritzen

Moderator

sebbbi

Silent_Buddha

Alexko

Alessio1989

pharma

Alessio1989

pharma

sebbbi

Rikimaru

Rikimaru

pharma