Many draw calls with pulling, bindless, multidrawindirect, etc.

sebbbi · Jan 27, 2014

Andrew Lauritzen said:
Are these registers not wiped/unreliable between shader invocations? i.e. that would only be useful for multiple sampling of the same texture in one shader?

Each wave (64 threads) has it's own set of scalar registers (and vector registers). Each wave loads the scalars (including resources descriptors) from the memory separately (usually at the beginning of the shader code).

The registers of each wave are wiped out after the wave is executed. GPU can be executing multiple waves from different shaders at the same time. The resource descriptors (just like any other data) are cached by the GPU L1 and L2 caches. If the resource descriptor is in the CU's scalar L1 cache when a wave tries to load it to a scalar register (no matter if the wave is from the save draw call than the previous one, or from another) it will be loaded very quickly to the register. If that resource descriptor wasn't recently loaded by any wave of the same CU, it will likely be found in the GPU L2 cache (or from the memory).

So basically it doesn't matter how many draw calls you have (or if the draw calls are big or small). If the next draw call uses the same resource descriptors, the descriptors will be likely in the L1 caches (of each CU), and the waves of the next draw call will load it from there without any stalls (just like the additional waves of the current draw call are loading them).

Constants also use the scalar loads / L1 scalar cache in the same way. If you use the same constants in the next draw call, the scalar loads (of each wave) will hit the L1 and get the constants quickly to registers.

This kind of resource handling is very efficient even if the average batch size is small. GPU doesn't need to setup lots of state before it can start executing a draw call.

Andrew Lauritzen · Jan 27, 2014

sebbbi said:
The registers of each wave are wiped out after the wave is executed. ... The resource descriptors (just like any other data) are cached by the GPU L1 and L2 caches.

Right that's how I thought it worked. The comment about cached in the register file is what confused me. Thanks!

sebbbi said:
So basically it doesn't matter how many draw calls you have (or if the draw calls are big or small). If the next draw call uses the same resource descriptors, the descriptors will be likely in the L1 caches (of each CU), and the waves of the next draw call will load it from there without any stalls (just like the additional waves of the current draw call are loading them).

Ah but there's the rub really... if you rewrite your descriptors to be contiguous for a given draw call even the ones that have not changed will not hit those caches. The "bindless" handle/pointer approach allows that data to be shared in the caches (tagged either by address or handle depending on arch).

I imagine Mantle leaves this somewhat "open" for the application to optimize (i.e. the hinting about nested descriptor sets... IMHO too complicated for what it accomplishes, but whatever) but I do wonder in how many cases regular Mantle applications will end up avoiding indirections to find descriptors.

3dilettante · Jan 27, 2014

Andrew Lauritzen said:
It is different in the details though. In an architecture where the TMU takes a handle and caches/reads descriptors itself there is no advantage to having contiguous descriptors like Mantle has (and indeed collecting the data together is some CPU/memory traffic overhead). If there was no advantage on GCN I'm assuming they would not have done it that way (recall I did start this entire conversation by saying I expect it to be very slight).

The likely reasons for this may have been touched on by some comments by AMD about reducing the amount of hidden state in the pipeline with GCN. The texturing pipeline still has some, but the implied end-goal was to someday expose or remove it.

Exposing the process to software incurs the indirection penalties under discussion, but AMD's long-term goal for graphics preemption and QoS likely hinges on the hardware being able to context switch freely.
An unexposed and semi-independent black box of indeterminate latency would get in the way of that.

Gipsel · Jan 27, 2014

Andrew Lauritzen said:
Sure, but *where the lookup happens* is relevant, as is whether the bindless handle is used for other things. Handles do not necessarily have to be full 64-bit addresses (although they are opaque to the user that way for implementation variance) and smaller numbers of bits can make things like cache addressing cheaper.

As the number of individual textures is only limited by the available memory (so one can probably create a few million, if one wants), I would contest that there would be significant savings, especially when weighing it against the overhead of yet another specialized buffer or cache for descriptors compared to just using the general purpose caches and register files. At least for a GCN like design (see below).

Andrew Lauritzen said:
It is different in the details though. In an architecture where the TMU takes a handle and caches/reads descriptors itself there is no advantage to having contiguous descriptors like Mantle has (and indeed collecting the data together is some CPU/memory traffic overhead).
If there was no advantage on GCN I'm assuming they would not have done it that way (recall I did start this entire conversation by saying I expect it to be very slight).

The advantage actually comes from the possibility to address and use the descriptors directly, i.e. skip the additional handle. If an architecture doesn't expose this (TMUs are doing it in a black box), one always needs one level more of indirections when using a bindless handle, not just with GCN:
index into array of handles => use that handle to access descriptor => use descriptor to access texture/buffer
With GCN (and Mantle) it is possible to index into an array of descriptors directly. The descriptors (and handles if used) are cached anyway. I would think it doesn't matter if it is done in some header/descriptor cache in the TMUs or the sL1/L2 hierarchy. The GCN way actually gives you more control (like it is possible to use not only an array of descriptors but really just pointers or an array of 32bit offsets and construct [or reuse] the rest of the descriptor in shader code because the format is known).

Andrew Lauritzen said:
GCNs architecture obviously works fine in either case, but the choices with their texture path were definitely made in the context of fairly wide SIMD and tightly-coupled TMUs. When thinking about future hardware and API evolution, it's hard to claim that is necessarily where everything is going to go, especially given that GCN has yet to prove itself scalable down to low power envelopes. It may be just fine, but we'll have to see.

That's of course true. Eventually it comes down to a balance, how often one has to traverse this indirection hierarchy (and how often one misses the cached [or stored in registers] levels). But it would be perfectly possible to use the Mantle approach also for more loosely coupled TMUs (or small SIMD widths). The shader wouldn't provide a handle, but the offset (32bit is probably enough) into the descriptor array to the TMU (with the TMU handling the access and storage of the descriptor instead of the scalar ALU as with GCN) and still saving one level of indirection. It would sacrifice the additional flexibility of GCN (to manipulate the descriptors beyond just accessing an array of it), but I don't know if that is ever going to be used extensively. Basically, one exchanges the handle array against the descriptor array. Shoudn't take that much.
In principle, the 64bits of the handle should be enough to encode everything necessary for a descriptor if one accepts some restrictions to the number of possible data layout and data format combinations (with 48bits for the base address [probably still enough for a few years] there would be 16 bits left for layout and format; AMD's information storage in the descriptors is optimized for ease of use, not size).

Andrew Lauritzen · Jan 28, 2014

Gipsel said:
The GCN way actually gives you more control (like it is possible to use not only an array of descriptors but really just pointers or an array of 32bit offsets and construct [or reuse] the rest of the descriptor in shader code because the format is known).

Right, and constructing descriptor details in the shader can definitely be useful in certain situations. However while people may or may not be able to do this on the consoles, I haven't seen any indication that this is possible even in Mantle. It would be hard to claim any level of portability (even forward portability on AMD cards) if that level of details was exposed. As far as I can tell from the hints in the various presentations, descriptors are likely black-box structures with only the size known to user applications.

It's hard to completely separate how much of the Mantle design was motivated by hardware considerations (i.e. is it really that much faster to collect descriptors contiguously in memory? I agree that it's probably minimal) vs. wanting to avoid making a new shading language. Obviously providing a contiguous array of descriptors works well with HLSL shaders that reference legacy binding points.

Gipsel said:
In principle, the 64bits of the handle should be enough to encode everything necessary for a descriptor if one accepts some restrictions to the number of possible data layout and data format combinations (with 48bits for the base address [probably still enough for a few years] there would be 16 bits left for layout and format; AMD's information storage in the descriptors is optimized for ease of use, not size).

That's an interesting exercise to go through actually. How would you fit all of the extra data into 16 bits? At the very least you need dimensions (even if denormalized in the shader you still need something for wrapping), format and probably at least a bit or two for layout... seems tight

IIRC most hardware descriptors are somewhere in the range of ~256-512bits. I agree in a lot of cases they could get leaner (although generality is nice for being able to use the same descriptor for multiple uses, such as render targets, UAVs, buffers, textures, etc) but I don't know whether 64-bits would be enough.

I'm curious though; lay it out for me how you'd do it

Gipsel · Jan 28, 2014

Andrew Lauritzen said:
That's an interesting exercise to go through actually. How would you fit all of the extra data into 16 bits? At the very least you need dimensions (even if denormalized in the shader you still need something for wrapping), format and probably at least a bit or two for layout... seems tight IIRC most hardware descriptors are somewhere in the range of ~256-512bits. I agree in a lot of cases they could get leaner (although generality is nice for being able to use the same descriptor for multiple uses, such as render targets, UAVs, buffers, textures, etc) but I don't know whether 64-bits would be enough.

I'm curious though; lay it out for me how you'd do it

After thinking about it, I was too optimistic in that regard. While one could save a bit on the base address (as example, GCN uses a 256 byte alignment for allocating buffer/image resources, so one would need only 32 bits for a 40bit virtual address or 40bits for a 48bit address; the latter option is the actually used one), it could only work with severe limitations (like limiting the texture sizes to quadratic/cubed) and wouldn't be really useful anymore as too much would be lost.

Btw., GCNs buffer descriptors are 128 bit, and image resource descriptors (textures, render targets) are also mostly 128 bits (256 bits for texture arrays, 3D textures, or cubemaps; and the second half of a 256 bit descriptor is half empty; but this is merely an optimization to save register space, as in the descriptor array they are stored 256bit aligned to be able to freely mix them). 512 bits would be on the high side.

Andrew Lauritzen · Jan 28, 2014

Gipsel said:
Btw., GCNs buffer descriptors are 128 bit, and image resource descriptors (textures, render targets) are also mostly 128 bits (256 bits for texture arrays, 3D textures, or cubemaps; and the second half of a 256 bit descriptor is half empty; but this is merely an optimization to save register space, as in the descriptor array they are stored 256bit aligned to be able to freely mix them).

Sure, but as you note the sort of "worst case" size is what's relevant when you need to pack them together and take advantage of known offsets. There are some cases in which you may statically know details (like is it a texture array, 3D, etc. or not) in the kernel and thus be able to account/separately allocate for different sizes, but in practice even if the TMU doesn't need/touch all of the data the size of the descriptors that is relevant is usually the more general case.

Gipsel · Jan 29, 2014

Yes I know, that's why I said the descriptor array is 256bit aligned and the 128bit image descriptors are only used as a possibility to save register space (the shader usually knows, if it samples a 1D, 2D, or a 3D texture for instance [as it has to provide a matching number of texture coordinates], hence the size of the descriptor is also known, which means it is possible to omit the fetch and storage of an unused second half).

But the second part of my last post was only meant to complement your assumption regarding the usual descriptor sizes (256...512bit) with the exact sizes for GCN (I doubt it is publicly known for nV or intel, but would like to learn otherwise), not as some kind of argument.

sebbbi · Jan 29, 2014

Gipsel said:
... regarding the usual descriptor sizes (256...512bit) with the exact sizes for GCN (I doubt it is publicly known for nV or intel, but would like to learn otherwise)...

In the "Beyond Porting" (Kepler) presentation (page 32) McDonald states that "bindless handle" is 64 bits (96 bits total with 32 bit slice index). But this "bindless handle" might be just the pointer to the resource descriptor (not the resource descriptor itself).

Andrew Lauritzen · Jan 29, 2014

Gipsel said:
(I doubt it is publicly known for nV or intel, but would like to learn otherwise)

All the hardware interface docs are available for Intel, for instance at either of these places (may be more):
http://renderingpipeline.com/graphics-literature/low-level-gpu-documentation/
https://01.org/linuxgraphics/documentation
The Intel Linux OpenGL driver is open source as well if you are feeling adventurous

For instance, for Ivy Bridge see the "RENDER_SURFACE_STATE" structure starting on pg 61:
http://files.renderingpipeline.com/gpudocs/intel/hd4000/IHD_OS_Vol4_Part1.pdf
It's 256-bits although as noted, the TMU does not necessarily need to touch all of it in all cases, similar to the possible load/reg optimizations on GCN.

sebbbi said:
In the "Beyond Porting" (Kepler) presentation (page 32) McDonald states that "bindless handle" is 64 bits (96 bits total with 32 bit slice index). But this "bindless handle" might be just the pointer to the resource descriptor (not the resource descriptor itself).

Yeah it's definitely a reference. The reason they are going through the exercise of using texture arrays + sparse allocation to start with is because they want to avoid thrashing the texture header cache. By doing it this way they only have one actual header in memory for each array. Presumably because of how GL works they pack both a descriptor reference and sampler reference into those bits. Elsewhere they have indicated that you can access ~1 million textures with bindless handles which sort of hints at ~20 bits for the former.

To be clear, I don't know any of this for sure, but that's a story that seems to fit the facts we know so far

Andrew Lauritzen · Feb 25, 2014

More relevant material coming up at GDC:
http://schedule.gdconf.com/session-id/828316

rpg.314 · Mar 20, 2014

Anybody know where the pdf, videos, code is?

sebbbi · Mar 21, 2014

rpg.314 said:
Anybody know where the pdf, videos, code is?

http://blogs.nvidia.com/blog/2014/03/20/opengl-gdc2014/

Good to see more people talking about this topic. I have been lobbying this kind of rendering for quite a while now

My comments on the texturing section:

Personally I prefer virtual texturing (with software indirection). Big texture arrays / atlases are good for reducing draw calls somewhat, but are only patchwork solutions. Streaming data to these structures is not as fine grained as doing full blown virtual texturing. Streaming whole textures always causes more loading stalls and uses more memory (and has allocation fragmentation problems). Bindless has additional problems in warp/wave coherency (as described in the AMDs part of the presentation). Because of this it is mostly useful if you can guarantee (at compile time) that each thread inside a wave/warp samples from the same resource (this is the only way to guarantee identical performance to "non-bindless" case, for example vs software virtual texturing). This can be guaranteed when using compute shaders (tiled texturing in lighting pass for example), but it's quite hard to do in a pixel shader (as GPU allocates pixels freely to warps/waves). Bindless also requires you to load more resource descriptors (for small objects covering just a few samples in distance this can be a big additional cost). Software virtual texturing just uses a single texture (single resource descriptor) for all objects.

I also do like the fact that in software virtual texturing I can do the indirection to physical address manually, so I can get hold of the physical addresses and I can store them (for example in my g-buffer). Physical addresses (to the currently loaded texture cache) can be stored with much less bits (~half) than virtual addresses to the huge virtual texture address space. Also you can't change the hardware virtual texture mappings from a compute shader (you need to do that by CPU). This can be problematic for GPU-driven rendering, since CPU round trip has huge latency implications (or requires a synchronization stall = VERY BAD).

I think people generally are too scared about virtual texturing. It solves many problems now (texture data has constant memory footprint, artists don't need to follow any texture budgets anymore, decal rendering is dirt cheap, etc). When people start to use these big batch rendering techniques it also solves the texture binding issue elegantly (and very efficiently).

Rodéric · Mar 21, 2014

Virtual Memory is awesome, once you've tasted it you can only cry when you can't use it.
I want Virtual Memory on the GPU to manage all my resources myself with either a list of page miss or a small page handler called.
With Virtual Memory the only problem you have are granularity (page size) and replacement algorithm (LRU, ARC, CART, ...) and you don't care about fragmentation anymore, you just stream data as you need... Memory management becomes so damn simple it's a shame we haven't gotten it yet !

imaxx · Mar 21, 2014

Rodéric said:
Virtual Memory is awesome, once you've tasted it you can only cry when you can't use it.
I want Virtual Memory on the GPU [...] it's a shame we haven't gotten it yet !

I was under impression that Mantle+AMD Southern/Sea Islands was supporting it, no? Not just PRT, I mean.

KKRT · Mar 21, 2014

Rodéric said:
Virtual Memory is awesome, once you've tasted it you can only cry when you can't use it.
I want Virtual Memory on the GPU to manage all my resources myself with either a list of page miss or a small page handler called.
With Virtual Memory the only problem you have are granularity (page size) and replacement algorithm (LRU, ARC, CART, ...) and you don't care about fragmentation anymore, you just stream data as you need... Memory management becomes so damn simple it's a shame we haven't gotten it yet !

CUDA 6 supports it at least.
https://www.youtube.com/watch?v=UgLShs8JaTM

Davros · Mar 21, 2014

I thought you had virtual memory aka agp texturing ?

Ethatron · Mar 22, 2014

What he's referring to is explicit page fault handling of the virtual to the physical memory page. In the OS you can translate virtual adresses to files fe. Then instead of swapping data in from the paging file, it directly reads locations from disk into memory - if you have mmap() available. This makes the physical memory behave like a cache (if less memory than file size), or very economic read-only-touched-data i/o (if data is only read sparsely, say WAV/RIFF parsing). On the GPU it is quite similar, except that you won't have the possibility to react to page misses in the same instant they occur, because that would require that you can pre-empt shaders indefinitely, which you can't. Instead you have to do what you think makes sense the moment the page fault occurs - you could walk the mip-chain until you have no fault fe. - in addition to recording the faults to a buffer. At the end of the frame you then can provide the contents of those pages, in the hope they are again needed the next frame. You don't need to if you know it's wasted.

Davros · Mar 24, 2014

Ethatron said:
fault fe.

fe ?

Ethatron · Mar 24, 2014

"for example", it really is just a example of what could be done.
You prefer "for instance", fi.?

I'm not native english, not sure about some abbrevations.

Many draw calls with pulling, bindless, multidrawindirect, etc.

sebbbi

Andrew Lauritzen

Moderator

3dilettante

Gipsel

Andrew Lauritzen

Moderator

Gipsel

Andrew Lauritzen

Moderator

Gipsel

sebbbi

Andrew Lauritzen

Moderator

Andrew Lauritzen

Moderator

rpg.314

sebbbi

Rodéric

a.k.a. Ingenu

imaxx

KKRT

Davros

Ethatron

Davros

Ethatron