Priyadarshi
Newcomer
With a bindless setup, instead of working with slots, you instead directly provide the GPU with pointers that it can follow to find the texture info. The slide is showing a potential data structure that you could set up yourself, where one piece of memory has a pointer to another piece of memory filled with info for more resources.
So its basically like CUDA (or OpenCL) where you just allocate memory on host for the device and use the device pointer as an argument in your kernel.
What he means is that you could allocate a piece of memory and re-use it for many different purposes. In D3D11 the memory for a resource is tied to the ID3D11Texture2D. When you create that texture, you specify certain immutable properties like the format, the size, number of mip levels, etc. and the driver allocates the appropriate amount of memory. Now let's say at the beginning of a frame you render to a render target, but then you're done with it for the rest of the frame. Then immediately after, you want to render to a depth buffer. With full memory control, you could say 'use this block of memory for the render target and then afterwards use it as a depth buffer'. In D3D11 however you can't do this, you must create both the render target texture and the depth buffer as separate resources. This is also something that's very common on consoles, where you have direct memory access.
I can now see how this would save a lot of memory by reuse of already created buffers.
The way that it currently works with D3D11 is that you compile your shaders to D3D assembly, which is basically a hardware-agnostic "virtual" ISA. In order to run these shaders on a GPU, the driver needs to compile the D3D assembly into its native ISA. Since developers can't do this conversion ahead time, the driver has to do a JIT compile when the game loads its shaders. This makes the game take longer to load, and the driver doesn't have a lot of time to try aggressive optimizations. With a hardware-specific API you can instead compile your shaders directly into the hardware's ISA, and avoid the JIT compile entirely.
Currently its like: HLSL -> D3D assembly -> IL/PTX -> ISA
There would still be runtime compilation of shader to native ISA even if it was pre-compiled to IL or PTX, but certainly lower compilation time than D3D.
As for patching, the driver may need to patch shaders in order to support certain functionality available in D3D. As an example, let's say that a hypothetical GPU actually performs its depth test in the pixel shader instead of having extra hardware to do it.. This would mean that the driver would have to look at depth state is currently bound to the context when a draw call is issued, and patch the shader to use the correct depth-testing code. With a hardware-specific shader compiler you can instead just provide the ability to perform the depth test in the pixel shader, and totally remove the concept of depth states.
I am all for less magic happening inside driver and more power to the application developer.
The obvious use is the one they mentioned: culling and occlusion testing. Imagine that the CPU says 'draw all of this stuff', and then the GPU goes through that list and for each one performs frustum and occlusion culling. The GPU can then alter the command buffer to skip over non-visible meshes, and then when the GPU gets around to executing that part of the command buffer it will only draw on-screen geometry.
As for occlusion queries, the main problem with them in D3D/GL is that the data can only be read by the CPU but the data is actually generated by the GPU. The GPU typically lags behind the CPU by a frame or more so that the CPU has enough time to generate commands for the GPU to consume, which means if the CPU wants to read back GPU results they won't be ready until quite a bit of time after it issued the commands. In practice this generally requires having the CPU wait at least a frame for query results. This means you can't really effectively use it for something like occlusion culling, since by the time the data is usable it's too late.
So the GPU checks an object bounding box against the current depth buffer (or Early-Z buffer) and skips it entirely if its not visible. Sounds much faster than what GL and D3D currently offer.
Thanks a lot for these detailed explanations MJP!