Why is Dynamic Shader Linkage limited to DX11 class hardware?

Novum

Regular
Can somebody shed some light on why this is the case? I used AMD's ShaderAnalyzer and all it apparently does is produce a jump table in the GPU assembly.

Or is this jump table op the exact requirement that D3D10-GPUs lack? Still, even without it, it could be compiled to static jumps.

I'm asking because I first thought it is actually some special hardware in the instruction sequencer. Does anybody know what NVIDIA and Intel are doing?
 
I'm not clear how everyone implements this, but the intention of this is not necessarily even to have the hardware do anything special, but to sort of "move permutation management" into the driver, which it *might* be able to be processed more efficiently than the application changing shaders all the time.

Being tied to DX11 hardware is more because there is a policy not to "back-port" features, which for the most part has been followed (a few exceptions).

I'll also note, just as a FYI, that practically no one uses this feature. Thus I would not be surprised if it isn't particularly optimized.
 
That are not jump-tables but external linker references. The shader binary object does not contain machine code but IL-codes. When it is passed to the driver it's translated from IL to machine code and all references are inlined.
The feature itself isn't limited to any hardware-level, it's just that Shader Model 5.0 and the related API-functions are not back-introduced into DX10 and Shader Model 4.0.
 
I'll also note, just as a FYI, that practically no one uses this feature. Thus I would not be surprised if it isn't particularly optimized.

Who are you calling no one? I used it for the B3D suite:p. It was...interesting.
 
I'll also note, just as a FYI, that practically no one uses this feature. Thus I would not be surprised if it isn't particularly optimized.
Is there a good reason for that? (Besides it not beeing useful, because you still have to support at least DX10 class hardware)

That are not jump-tables but external linker references.
I'm not convinced. (6970 assembly, sadly there is no GCN support yet):
Code:
; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(1) 
      0  z: MOV         R1.z,  0.0f      
01 TEX: ADDR(80) CNT(2) 
      1  VFETCH R1.x___, R1.z, fc149  
         FETCH_TYPE(NO_INDEX_OFFSET) 
      2  VFETCH R2.xy__, R1.z, fc148  
         FETCH_TYPE(NO_INDEX_OFFSET) 
02 ALU: ADDR(33) CNT(1) 
      3  x: MOVA_INT    CF_IDX0,  R1.x      
03 JUMPTABLE: ADDR(8)  R6PLUS_CF_JUMPTABLE_SEL_INDEX_0 
04 CJUMP  CND(FALSE) CF_CONST(0) ADDR(8) 
05 CJUMP  CND(FALSE) CF_CONST(0) ADDR(12) 
06 CJUMP  CND(FALSE) CF_CONST(0) ADDR(16) 
07 CJUMP  CND(FALSE) CF_CONST(0) ADDR(21) 
08 ALU: ADDR(34) CNT(1) 
...
But there could be the possibility, that this is optimized further by the driver in a real world scenario. I agree with that. I'll ask AMD next time I have the chance to do so.
 
Is there a good reason for that? (Besides it not being useful, because you still have to support at least DX10 class hardware)

Even if there were ubiquitous support for it, I don't think I'd use it. Like a lot of other studios we already have a full-featured shader permutation system, and it doesn't require me to use some wacky class/interface/constant buffer setup.
 
You don't really need that, declare your functions first, append their bodies at the end, done.
It's sad to see that many programmers unable to do something that simple and instead relying on shitty ifdef everywhere...
 
You don't really need that, declare your functions first, append their bodies at the end, done.
It's sad to see that many programmers unable to do something that simple and instead relying on shitty ifdef everywhere...
Sorry, what? That has exactly nothing to do with the problem of how to handle shader permutations.
 
I'm not convinced. (6970 assembly, sadly there is no GCN support yet):

I think it's mentioned somewhere in the heaps of MS-documentation, can't find it currently.
Anyway algorithm flow is so restricted under HLSL (no conditions+syncs based on dynamic calculations over memory for Direct Compute fe.), I couldn't believe you have dynamically adressable vtables. Static indices are inlineable and the "vtable" would be removed. Maybe as a special case for the debug-runtime (REF and WARP).
 
I think it's mentioned somewhere in the heaps of MS-documentation, can't find it currently.
Each driver is free to implement the required functionality in any way that it wants. You can certainly do it via "late linking" (basically driver permutation management) but you could also of course do it via a jump table like the AMD ISA code that Novum is showing, regardless of whether or not you could express that code generally in HLSL. The advantage of that method is that it doesn't require multiple shader permutations at all, but there is a (per-invocation) cost to dynamic dispatch, even via a jump table.

There are a lot of problems with it that prevent developers from adopting it heavily. I'm not an expert ultimately a big factor is that the developers need a permutation system anyways even on top of this HLSL "syntactic sugar", so the advantages of using it are fairly minimal. There are some other industry shifts like deferred shading, physically based shading and increased dynamicism in kernels (i.e. looping over light lists) that lessen the need for massive numbers of permutations as well.

Personally I don't think directly writing HLSL code is an efficient way to do an engine anyways... you really need a layer that understands the full rendering algorithm and can spit out bits of shaders optimized as needed for various stages of it. Thus features like this that effectively just try and make certain things "easier" (and don't necessarily provide any performance or functionality improvements) are not really that interesting to me.
 
Permutations are not always the answer, especially if you have many ;)

Of course not, which is why we have dynamic branching. :p

I think Andrew's comments are on the mark as far as I'm concerned: it doesn't really offer any functionality not already available to programmers, so it ultimately just boils down to syntactic sugar. So unless the resulting performance is a whole lot better than my own solutions (and in my limited tests, it hasn't been) there's not really any compelling reason for me to use it.
 
Yes, I also agree on the utility-factor for a human HLSL-programmer. And higher level HLSL-frameworks which could operate like a little OS with dynamic libraries I'm not away of, as Andrew said that would the place were the feature makes sense. MS also has this strange pseudo C++ interface/class system related to this, which can't do much. The whole jump/call instructions are more connected to that than to delayed external functions anyway.
It would help with LGPL licenced shader library code use though. :D
 
Back
Top