Many draw calls with pulling, bindless, multidrawindirect, etc.

sebbbi · Jan 22, 2014

Andrew Lauritzen said:
This presentation is relevant to the discussion here:
http://www.slideshare.net/CassEveritt/beyond-porting

Seems other folks have been thinking along similar lines. Note that the hinting that Kepler has some limited texture header (aka. "descriptor") cache for bindless is definitely the sort of thing that I've been talking about when thinking about the trade-offs here for a long term solution. Obviously both HW and SW can change but it's not always as simple as "the most generic/dynamic API is the best".

Earlier Kepler bindless papers also talk about GPU efficiency improvements, but only mention them shortly (without much details). This paper is definitely more deep and has more information about Kepler's implementation. I have been studying bindless more from perspective of AMD GCN hardware (I am mainly a console developer after all). Hopefully AMD releases OpenGL 4.4 beta drivers soon (ARB bindless, etc) so we can compare both GPU manufacturer's implementations on identical PCs.

Andrew Lauritzen · Jan 22, 2014

To be clear, I don't doubt Mantle will be slightly better than bindless on GCN (or else they designed the API poorly

) - although bindless is obviously more general and still necessary - but my point is that does not necessarily extend to all hardware or architectures or even the theoretical best power-efficient case. That's an interesting discussion beyond the scope of the practical issues as hardware can change.

Gipsel · Jan 25, 2014

I thought the GCN hardware actually works bindless internally (the shader just needs to know the memory address of the texture, and its format [usually this information is fetched from memory, but it could be also calculated by the shader code itself, same as the parameters for the sampler]), everything else is fitted on top of it.

RecessionCone · Jan 25, 2014

Andrew Lauritzen said:
To be clear, I don't doubt Mantle will be slightly better than bindless on GCN (or else they designed the API poorly ) - although bindless is obviously more general and still necessary - but my point is that does not necessarily extend to all hardware or architectures or even the theoretical best power-efficient case. That's an interesting discussion beyond the scope of the practical issues as hardware can change.

I'm curious - once Mantle is finally "open" enough for us to read the API, will we see something tailored to GCN to the point that it restricts AMD's hardware designs in the future? Standards give portability not only between vendors, but between different designs from the same vendor. Maybe AMD has decided GCN is all the GPU the world needs.

Alexko · Jan 26, 2014

RecessionCone said:
I'm curious - once Mantle is finally "open" enough for us to read the API, will we see something tailored to GCN to the point that it restricts AMD's hardware designs in the future? Standards give portability not only between vendors, but between different designs from the same vendor. Maybe AMD has decided GCN is all the GPU the world needs.

Could we imagine that in a couple of years, a future architecture won't be able to run the first Mantle games? Or did you design Mantle to be forward compatible? There's always this kind of tradeoff with a lower level access…

Raja Koduri - Those are all great questions but… Frankly we'll see how it goes. At the end of the day forward compatibility and backward compatibility are important aspects but if they're getting in the way of solving a problem at a given point of time, if they're getting in the way of exposing something that the new hardware is capable of that makes the game be hundred times more realistic, we have to be practical about it and that’s how we move things forward. We move technology forward and at some point of time we have to say "out with the old compatibility", and move forward. If not you get stuck.

http://www.hardware.fr/focus/89/amd-mantle-interview-raja-koduri.html

pjbliverpool · Jan 26, 2014

Oh dear, so mantle has the potential to break backwards compatibility in PC games? Not a good compromise IMO.

MJP · Jan 26, 2014

Gipsel said:
I thought the GCN hardware actually works bindless internally (the shader just needs to know the memory address of the texture, and its format [usually this information is fetched from memory, but it could be also calculated by the shader code itself, same as the parameters for the sampler]), everything else is fitted on top of it.

Yes, that's how it works. Shaders just need to provide texture/buffer/sampler descriptors in contiguous scalar registers, and how the data gets in those registers is completely up to the shader program. Everything else is essentially just software conventions imposed by the driver.

Andrew Lauritzen · Jan 26, 2014

MJP said:
Shaders just need to provide texture/buffer/sampler descriptors in contiguous scalar registers, and how the data gets in those registers is completely up to the shader program.

Correct, the subtlety being that it's slightly more optimal to lay out/copy descriptors for a given draw call linearly in memory since the shader unit itself has to read them. If it needs to go through an indirection (i.e. a bindless handle) that's some latency to hide*. Likely not a big deal in most cases, but that's why I said they probably did it that way in Mantle because it's "slighty" better for GCN.

You can sort of think of it as dynamically indexing "bind points" is effectively free on GCN, but pointers to descriptors require an indirection. This isn't necessarily the case on other hardware, hence why I keep noting to avoid such assumptions when analyzing the trade-offs.

* This is based on skimming some of the docs - I wouldn't consider myself a GCN expert, but the principal applies in general even if the details are different.

sebbbi · Jan 26, 2014

Andrew Lauritzen said:
If it needs to go through an indirection (i.e. a bindless handle) that's some latency to hide*

Yes, you would likely have an extra cache miss for the first sampled texel, but after that the resource descriptor should be always in a register or in the L1 cache (the first texture access causes two cache misses, one for descriptor and one for texture data instead of just one). This shouldn't be a problem at all for well behaved use cases (every texture is sampled for reasonable amount of texels).

Gipsel · Jan 26, 2014

And it works always the same way, irrespective if one uses an API with bind points or a bindless model. So I still don't get Andrew's original point that "Mantle will be slightly better than bindless on GCN". There is simply no difference besides API restrictions. A resource descriptor is also just a virtual address (plus the format specification), which can come from anywhere.

MJP · Jan 26, 2014

Andrew Lauritzen said:
Correct, the subtlety being that it's slightly more optimal to lay out/copy descriptors for a given draw call linearly in memory since the shader unit itself has to read them. If it needs to go through an indirection (i.e. a bindless handle) that's some latency to hide*. Likely not a big deal in most cases, but that's why I said they probably did it that way in Mantle because it's "slighty" better for GCN.

You can sort of think of it as dynamically indexing "bind points" is effectively free on GCN, but pointers to descriptors require an indirection. This isn't necessarily the case on other hardware, hence why I keep noting to avoid such assumptions when analyzing the trade-offs.

* This is based on skimming some of the docs - I wouldn't consider myself a GCN expert, but the principal applies in general even if the details are different.

Yes that's a fair point. I certainly would lay out my descriptors in flat arrays if I were working on a platform that allowed me to do so. :smile:

There's also the issue of scalar loads and the constant cache, which further changes the specifics with regards to GCN. Although in that Nvidia presentation they mentioned that Kepler has some sort of "texture header" cache, which sounds like it at least partially fills the same role as the constant cache on GCN.

Gipsel said:
And it works always the same way, irrespective if one uses an API with bind points or a bindless model. So I still don't get Andrew's original point that "Mantle will be slightly better than bindless on GCN". There is simply no difference besides API restrictions. A resource descriptor is also just a virtual address (plus the format specification), which can come from anywhere.

His point was just that Mantle seems to let you use an array of texture descriptors, as opposed to bindless which forces you to use to use an array of pointers to descriptors instead which adds an indirection. It's hard to say how much of a performance difference it would cause in practice without extensive profiling, but I would guess that the difference would be pretty small in most cases.

Andrew Lauritzen · Jan 27, 2014

sebbbi said:
Yes, you would likely have an extra cache miss for the first sampled texel, but after that the resource descriptor should be always in a register or in the L1 cache (the first texture access causes two cache misses, one for descriptor and one for texture data instead of just one).

Ignoring the actual texture data part, the two misses for implementing the GL bindless extension for instance would be more 1) on the lookup of the handle itself (from constant buffer or binding table or whatever) and then 2) lookup of the referenced descriptor data. Both should cache fairly well in most use cases of course, but I'm not super-familiar with the details of the caching hierarchy/policy for scalar memory reads from CUs. Descriptors organized contiguously at a known address avoid the first part above, that was my point.

MJP said:
His point was just that Mantle seems to let you use an array of texture descriptors, as opposed to bindless which forces you to use to use an array of pointers to descriptors instead which adds an indirection.

It adds an indirection *on GCN*. Copying descriptors into contiguous memory is not an intrinsically better way to do this, it's just slightly more preferred in GCN's use case where having a known offset at kernel launch allows it to start reading the descriptors.

I will again note that other architectures are equally valid, and I believe GCN is unique in having the data flow through the shader registers before hitting the texture unit.

Most future architectures will be able to accommodate either situation (i.e. dynamically indexed bind points vs. bindless handles) and there is a continuum between the two obviously, but there are details that cause a given architecture to slightly prefer one way or another.

In any case the details of GCN are just one data point to the general architectural discussion, which I think is ultimately more interesting.

Rodéric · Jan 27, 2014

It should be fine with a minimal API anyway, it would either write pointers in the command buffer or copy data instead.
Every time I reach my memory/assets management code in my engine I become sick of the absolute inefficiency of it all simply because API refuse to let me handle virtual memory.
It's horrible code the behavior of which I can only guess, but cannot know for sure since drivers could do anything/everything in my back

Gipsel · Jan 27, 2014

Andrew Lauritzen said:
It adds an indirection *on GCN*. Copying descriptors into contiguous memory is not an intrinsically better way to do this, it's just slightly more preferred in GCN's use case where having a known offset at kernel launch allows it to start reading the descriptors.

As long as you just have a bindless handle (pointer), it will *always* add an indirection of some kind, not only on GCN. That's because a pointer is simply not enough to specify the texture/buffer completely. The TMUs need to connect the handle/pointer to information about the data format and layout (that is what the descriptors actually are in AMD's case, just a pointer + data format and layout information, so you are able to avoid one indirection step), that has to be stored somewhere and an access to that information has to happen at some point. If that is done by GCN's scalar unit or by the TMUs itself, isn't a principal difference.

keldor314 · Jan 27, 2014

Gipsel said:
As long as you just have a bindless handle (pointer), it will *always* add an indirection of some kind, not only on GCN. That's because a pointer is simply not enough to specify the texture/buffer completely. The TMUs need to connect the handle/pointer to information about the data format and layout (that is what the descriptors actually are in AMD's case, just a pointer + data format and layout information, so you are able to avoid one indirection step), that has to be stored somewhere and an access to that information has to happen at some point. If that is done by GCN's scalar unit or by the TMUs itself, isn't a principal difference.

I can't imagine that you wouldn't have a texture header cache/buffer for each TMU. In fact, you would have had this ever since the beginning of multitexturing, since it needs to know information about whatever texture slot is being sampled.

The difference with bindless is that it's a cache instead of a buffer. For a hit, the latency would probably be unaffected, though at worst, it might add a single pipeline stage for the address comparison. For a miss, it would have to read the texture header from memory, which of course would be an indirection. When conventional texture slots are used, it will conveniently map directly into the cache, meaning no misses.

The more interesting question would be about this cache's associativity.

Gipsel · Jan 27, 2014

keldor314 said:
I can't imagine that you wouldn't have a texture header cache/buffer for each TMU. In fact, you would have had this ever since the beginning of multitexturing, since it needs to know information about whatever texture slot is being sampled.

And in case of GCN the scalar L1 caches (backed up by the L2) fulfill the role of this specialised header/descriptor cache.
My point is, that there is no fundamental difference for the indirections, it doesn't matter if the TMUs or the scalar unit is doing it (one could only argue, that the GCN approach may enable a slightly better latency hiding in some cases, and have a smaller directly adressable descriptor buffer [basically the scalar registers] for the traditional bind slot approach; how relevant these points may be in practice is another question).

Andrew Lauritzen · Jan 27, 2014

Gipsel said:
As long as you just have a bindless handle (pointer), it will *always* add an indirection of some kind, not only on GCN.

Sure, but *where the lookup happens* is relevant, as is whether the bindless handle is used for other things. Handles do not necessarily have to be full 64-bit addresses (although they are opaque to the user that way for implementation variance) and smaller numbers of bits can make things like cache addressing cheaper.

Gipsel said:
The TMUs need to connect the handle/pointer to information about the data format and layout ... that has to be stored somewhere and an access to that information has to happen at some point. If that is done by GCN's scalar unit or by the TMUs itself, isn't a principal difference.

It is different in the details though. In an architecture where the TMU takes a handle and caches/reads descriptors itself there is no advantage to having contiguous descriptors like Mantle has (and indeed collecting the data together is some CPU/memory traffic overhead). If there was no advantage on GCN I'm assuming they would not have done it that way (recall I did start this entire conversation by saying I expect it to be very slight).

GCNs architecture obviously works fine in either case, but the choices with their texture path were definitely made in the context of fairly wide SIMD and tightly-coupled TMUs. When thinking about future hardware and API evolution, it's hard to claim that is necessarily where everything is going to go, especially given that GCN has yet to prove itself scalable down to low power envelopes. It may be just fine, but we'll have to see.

Anyways I don't think any of this is a big deal, I was just using it as another example of a place where it's not clear whether generality comes with a power cost (in either implementation). The power efficiency of the vertex pulling case is more relevant but no one has gotten to testing that yet

(And again, these need to be tested on hardware that has fixed function for the specific case... saying something isn't faster on programmable hardware is not useful unless that hardware is provably more power-efficient at the task to start with.)

Andrew Lauritzen · Jan 27, 2014

Gipsel said:
and have a smaller directly adressable descriptor buffer [basically the scalar registers] for the traditional bind slot approach

Are these registers not wiped/unreliable between shader invocations? i.e. that would only be useful for multiple sampling of the same texture in one shader?

Ethatron · Jan 27, 2014

I expect texture-"handles"/"descriptors" are uniforms on GCN, thus cacheable in a variety of ways. For as long as instances of shaders spawn which keep the descriptor in cache there should no reason to evict it. Maybe bubbles of other shaders or cache-pressure moves it down the hierarchy occationally.

sebbbi · Jan 27, 2014

keldor314 said:
I can't imagine that you wouldn't have a texture header cache/buffer for each TMU. In fact, you would have had this ever since the beginning of multitexturing, since it needs to know information about whatever texture slot is being sampled.

The difference with bindless is that it's a cache instead of a buffer. For a hit, the latency would probably be unaffected, though at worst, it might add a single pipeline stage for the address comparison. For a miss, it would have to read the texture header from memory, which of course would be an indirection. When conventional texture slots are used, it will conveniently map directly into the cache, meaning no misses.

The more interesting question would be about this cache's associativity.

That's not how it works on GCN. You need to load the resource descriptor of a texture to a scalar register before you can sample from that texture (bindless or not, it always works like this). Sampling instruction takes the scalar register as an input parameter (along the vector of 64 UVs, and optionally the 64 mips/gradients). Moving the scalar instruction along the texture coordinates to the TMU shouldn't cost much extra (it's much less data than the 64 UVs + optional mips/gradients).

I don't know how exactly the hardware moves the register contents (64xUV and the resource descriptor) to TMU when it executes a sampling instruction (this isn't described in the AMD Sea Islands instruction set guide). Most likely the register files are very close to the TMU hardware, making this as efficient as possible.

Many draw calls with pulling, bindless, multidrawindirect, etc.

sebbbi

Andrew Lauritzen

Moderator

Gipsel

RecessionCone

Alexko

pjbliverpool

B3D Scallywag

MJP

Andrew Lauritzen

Moderator

sebbbi

Gipsel

MJP

Andrew Lauritzen

Moderator

Rodéric

a.k.a. Ingenu

Gipsel

keldor314

Gipsel

Andrew Lauritzen

Moderator

Andrew Lauritzen

Moderator

Ethatron

sebbbi