Many draw calls with pulling, bindless, multidrawindirect, etc.

Andrew Lauritzen

Moderator
Moderator
Veteran
This is a continuation from some twitter conversation and some of the discussion in the Mantle thread. Basically the question is this... given all of the hype around low overhead "many" draw call submission these days, does having the GPU effectively "pull" state, vertices, texture descriptors bypass the overhead and mitigate the need for large API changes in the short term?

Some background reading...
http://www.openglsuperbible.com/2013/10/16/the-road-to-one-million-draws/
http://www.g-truc.net/doc/OpenGL 4.4 review.pdf
http://on-demand.gputechconf.com/gt...32-Advanced-Scenegraph-Rendering-Pipeline.pdf
https://twitter.com/g_truc/status/409054369967910913 and followup

I think there's some compelling evidence that this is doable (at least 100k levels, maybe much more), but I'll let some of the other guys fill in their experience.

[Edit] More reading/watching
http://www.youtube.com/watch?v=-bCeNzgiJ8I
http://schedule.gdconf.com/session-id/828316
 
I believe so. I mean, if all I have to do is to set up one buffer entry for each draw call on the CPU side, and the shader can pull everything out of this buffer, the number of CPU side state changes etc. basically drops down to next to zero. Biggest problem for me for combining many draw calls together is changing state, buffers, or textures; if this is done inside a shader using a pull model, problem solved. At this point, the CPU is left with preparing & issuing of draw calls and managing the resource. In the next step, the issuing of draw calls should also be moved onto the GPU, at which point there's basically no overhead any more. We're not going to be able to prepare all draw calls on the GPU as gameplay still runs on the CPU, but culling, level-of-detail, determining which shader to bind, creating shadow maps, can be all done on the GPU then.
 
It's worth pointing out that low overhead drawcalls and GPU pull are not mutually exclusive. If we can have 10x as many draw-calls, or multidraw with GPU pull 10x as many instances, why not combine and get 100x? Sure, we will most likely end up hitting the GPUs own limitations, but still, more tools in the toolbox is a good thing. Sometimes it's easy to batch and multidraw, other times in a full blown game engine it's not so easy. I'm working on some instancing stuff right now and it's a pain. Let's not forget that dynamically managing objects comes with a bunch of overhead too. Filling instance buffers with data, and separate lists for different render passes because of culling, it all adds up. Adding extra indirections on the GPU also adds costs. If I could just brute force through the list instead, it would save a lot of data shuffling, but with D3D today it's not realistic.

But anyway, the thing I find the most exciting with Mantle is not so much the number of draws I can do, but rather some of the new features.
 
It's worth pointing out that low overhead drawcalls and GPU pull are not mutually exclusive.
Sure, of course. This is not meant to be a "one or the other" sort of situation in any case, but more a discussion of the current state of GPU-driven rendering. That's indeed one other advantage of pulling stuff... in the long run it's one way to get to stateless APIs and GPU self-dispatch.

Sometimes it's easy to batch and multidraw, other times in a full blown game engine it's not so easy. I'm working on some instancing stuff right now and it's a pain.
No doubt, but I'd be interested in the discussion around the fundamental cases here, not really what the API makes "easy" per se. Also note that while instancing is obviously fairly limited, with vertex pulling and bindless textures there is very little that you can't do in some manner or another. Changing shaders is slightly annoying (uber-shaders basically) but it's not clear that high frequency shader changes are a great plan in any future anyways (seems like shader variation has lessened if anything with PBR, deferred and loops). Opinions?

Let's not forget that dynamically managing objects comes with a bunch of overhead too.
It does, and let's discuss that :) That said, we get into the murky waters of architecture dependence quickly. Obviously stuff like bindless textures will incur some amount of overhead on GCN (an indirection), but that is not necessarily true on other architectures. Similarly pulling vertex attributes on GCN is no more expensive than using the IA, but that's not true on other architectures.

Constants are an interesting case as some hardware still has special handling for them. In the long run I could certainly imagine general purpose caches taking care of that though too, and I'm not convinced it would even be very bad in the short term. Needs testing.

So obviously in the future the hardware could be designed to be efficient in various cases, but even without that the question is really how much GPU performance we're talking about. I think this needs some more serious testing because for all the folks who are complaining about how fast GPUs are and how much the CPU/API is bottle-necking them, I'm sure they'd be willing to trade a few % of GPU performance to basically wipe out the CPU overhead :)

But anyway, the thing I find the most exciting with Mantle is not so much the number of draws I can do, but rather some of the new features.
Sure, this is not meant to be a Mantle thread, that's just what brought up this conversation in the original tweet. Obviously features and extensions are always cool, but they're sort of a tangential discussion even in the context of Mantle (people have been doing GL and DX "extensions" for years).
 
Last edited by a moderator:
I'm working on some instancing stuff right now and it's a pain.
I dislike traditional geometry instancing. It ruins depth ordering. If you have good content (each mesh has 5-10 LODs), there isn't actually that many copies of the same identical mesh to render at once.
Let's not forget that dynamically managing objects comes with a bunch of overhead too. Filling instance buffers with data, and separate lists for different render passes because of culling, it all adds up.
Mapping huge amount of dynamic vertex buffers per frame (to update instancing data) is not fun in my books either. That's why I prefer to have the whole world as GPU accessible data, and skip the management/update stuff completely. GPU can just fetch the data it needs by itself without any CPU side buffer management.
Adding extra indirections on the GPU also adds costs. If I could just brute force through the list instead, it would save a lot of data shuffling, but with D3D today it's not realistic.
The extra cost heavily depends on the access pattern. Fortunately modern GPUs have general purpose L1 and L2 caches, and as long as you can make your indirections cache friendly, the performance is very good (at least in our use cases).
Constants are an interesting case as some hardware still has special handling for them. In the long run I could certainly imagine general purpose caches taking care of that though too, and I'm not convinced it would even be very bad in the short term. Needs testing.
On GCN, constant buffer access is exactly as fast as accessing a dynamic ("SRV/UAV") buffer... assuming of course that both buffers are fetched using the scalar unit. Unfortunately HLSL doesn't allow writing scalar instructions manually, so you have to lean on the compiler... and unfortunately compiler isn't always smart enough. I would definitely want to have some kind of extension to write warp/wave wide scalar code in my shaders (just to be sure that everything goes as planned).
Similarly pulling vertex attributes on GCN is no more expensive than using the IA, but that's not true on other architectures.
Storing vertex data in general purpose buffers and fetching it from there in vertex shader isn't slow either. This way you can efficiently SoA layout your vertex data, and that's usually the most efficient memory layout for GPU, assuming that your access pattern is mostly linear (and you can pretty much ensure that). Unfortunately post transform cache of current GPUs doesn't work without index buffers, so you will lose some efficiency there. Index buffers in most hardware are still a completely fixed function feature (with it's own data pathways and all, and no programmable access from shader cores).
Changing shaders is slightly annoying (uber-shaders basically) but it's not clear that high frequency shader changes are a great plan in any future anyways (seems like shader variation has lessened if anything with PBR, deferred and loops). Opinions?
We have been using deferred shading since 2007 and virtual texturing since 2009. Because of these two technologies, both our data outputs and inputs from our g-buffer shaders has been fixed for long time (12 texture inputs & outputs per pixel). Now that we have also moved to physically based shading, there's no need for artist controllable fake cube reflections, etc odd texture inputs in the g-buffer shaders either. Lighting pass does a few cube lookups for each pixel, because in real world every material has specular, and reflections = specular. No need to fake these things anymore. Faked things always break special cases (glowing reflections in dark etc), and as our games are based on user created content, we don't want that.

If you are using all these three techniques (and I'd say you should), then you don't need that many different g-buffer shader variations. If you want to have more flexibility, just write the virtual texture UV and your material ID to the g-buffer. In the lighting shader, you can then bin (cluster) things based on material ID (in addition to lighting parameters), and run any amount of different per pixel lighting/material formulas you want.
glMultiDrawArraysIndirect is fine if you know the draw call count on CPU side. However that API doesn't support feeding the draw call count from a GPU generated buffer. On a fully GPU-driven rendering pipeline, the CPU has no clue what's happening, and cannot thus even give a ballpark estimate of the draw call count you need. The CPU might not even know how many viewports you are going to render (the shadow map count varies).

Of course you can still zero out the DrawArraysIndirectCommand struct's triangle counts (by GPU) for the excess draw calls (just assume a million draws every time). This however doesn't save GPU time much (confirmed by Riccio's measurements). GPU still needs to start a million empty draws, and each eats almost as many cycles as a tiny draw call (and with almost 1M draws you need to assume that huge majority of your draws are tiny, as you are limited by GPU primitive rate).

OpenGL 4.4 adds a new feature (ARB_indirect_parameters) to fetch the "maximum" draw call count from a GPU buffer, but it doesn't guarantee that the driver will do any less draw calls than the parameter set by CPU. So it's kind of an optimization hint. Also currently only Nvidia supports OpenGL 4.4 (beta driver). Neither AMD or Intel have announced any plans to support OpenGL 4.4. ARB_indirect_parameters is a very good feature indeed (assuming it actually cuts draw calls), but unfortunately a production quality GPU-driven rendering pipeline cannot be build on top of the promise that sometime in the future maybe we will get broad support for a critical enabler feature.
I wish DirectX 11.3 will support at least everything in chapters 1-3, ARB_COMPUTE_VARIABLE_GROUP_SIZE and of course glMultiDrawArraysIndirect from OpenGL 4.3. We need all these features badly.
I think there's some compelling evidence that this is doable (at least 100k levels, maybe much more), but I'll let some of the other guys fill in their experience.
Yes, it is definitely doable. But the reason why you cannot push much more than 500k objects is that 520k objects/frame * 64 triangles/object * 60 frames/second = 2000M triangles/second = the maximum triangle rate of Radeon 7790 / 7970 GE. And this assumes that you have only 64 triangles (on average) in each rendered object. In a realistic scenario however your triangle counts per object are going to be higher (unless you have very long view range and very good LODs). And you need to do lighting and post processing as well, you cant spend 100% of your frame time pushing triangles :)
 
Overall, I think that making each and every API call as cheap as possible and setting things up so that you can run on multiple CPU cores and such is fundamentally not a forward looking approach to high performance graphics. We have these enormously powerful GPUs and in them, relatively powerful microprocessors which are in charge of everything. Having an API call for every little state (even if all it does is put commands in a command buffer and is callable from any thread) is really just micro-management. It's never going to be efficient. This is to draw setup as immediate mode is to vertex buffers.

In my view of things, what we need to be able to do is treat GPUs as data-driven devices, not command driven devices. One might argue that command buffers are data and we're just building an API to construct them. However, you can think of a command buffer as more of an instruction sequence, decoded and acted upon by the GPU in a quite procedural manner. Each command generally results in a sequence of register writes. We need to get graphics state out of physical registers (which are a limited and expensive hardware resource) and into memory (which is effectively free). This is the trend - GCN has more state in memory than its ancestors as did they than theirs. Once everything's in memory, "state changes" (in the traditional sense) are simply pointer swaps. The properties of each surface, material, mesh, or whatever that you want to render is simply a state vector in memory and the only parameter to the draw is which state vector you want to access. Once the state vector is in memory, filling it in is just writing data into a GPU-visible buffer, and no API is required at all. The fastest code is the code that never executes.

We're getting there. UBO (constant buffers) enable a pretty much arbitrary amount of data to be used as constants. If you're changing the content of such a buffer right before the draw, you're doing it wrong. Various buffer and image models give shaders the ability to pull their own data from memory which, if your VS is an uber shader of some form, essentially makes vertex layout data driven. With bindless textures, even texture bindings become a thing of the past. The draw indirect thing just makes parameters to draws plain old data as well.

There's still a fair bit of state that's register-based (talking about GCN here), but there are ways around that. I talked about some of that in the blog post linked at the top of this thread. For example, it's probably fine to leave blending on (GL_ADD, GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA) so long as you can guarantee that opaque fragments have an alpha value of one. Render target (FBO/attachment) switches can be avoided with judicious use of array attachments. UBO and other buffer binds can be eliminated with staging larger amounts of data in memory and then offsetting into it. Bindless textures are also a good thing.

Depth and stencil state is still an issue (changing the sense of the depth or stencil tests). The uber-shader approach works well so long as the variations of the shader have similar complexities. If there's a big disparity, the simpler variations may not perform as well as they could. However, this might be a tradeoff that we're willing to make to avoid program switches which are quite costly. Full pipeline configurations (turning tessellation and geometry shaders on and off, for example) is also pretty expensive from a GPU execution point of view, and simply leaving all the stages on is cost prohibitive (as much as some will advocate "tessellate everything").

In my experiments on AMD hardware, I've been able to easily hit a range of 10 to 20 million draws per second (enough for at least 200K draws per frame at 60Hz) using stock OpenGL (no extensions that aren't in core). To hit this rate, the draws have to be really pretty light weight, otherwise you end up bumping into other limits (generally in the front end). Once you put real, physical state changes in there (the kind that hits registers), the hardware just doesn't go that fast no matter how you drive it.

The beauty of the data-driven GPU paradigm is that the GPU can produce its own data. The low hanging fruit is simply to walk the scene graph on the CPU, stuff parameters in a buffer and throw it at the GPU. However, the exciting part is moving that scene traversal (or even generation) onto the GPU. Culling in a VS is fun. Generate vertex data in a compute shader, stuff references to to it into an indirect draw buffer and then dispatch the list - procedurally generated forest, anyone?

But anyway, the thing I find the most exciting with Mantle is not so much the number of draws I can do, but rather some of the new features.

Which features are most interesting? Not sure if we can talk about them here and that's not what this thread is about, but out of curiosity, what graphics features are you after?

OpenGL 4.4 adds a new feature (ARB_indirect_parameters) to fetch the "maximum" draw call count from a GPU buffer, but it doesn't guarantee that the driver will do any less draw calls than the parameter set by CPU. So it's kind of an optimization hint.

That's not actually true. The exact draw call count does come from the buffer. The CPU-supplied value is the maximum and is a hint. Effectively, it enables implementations that DMA parameters, kicking off the DMA with the 'fixed maximum' and then terminating it early using the loop count.

Also currently only Nvidia supports OpenGL 4.4 (beta driver). Neither AMD or Intel have announced any plans to support OpenGL 4.4. ARB_indirect_parameters is a very good feature indeed (assuming it actually cuts draw calls), but unfortunately a production quality GPU-driven rendering pipeline cannot be build on top of the promise that sometime in the future maybe we will get broad support for a critical enabler feature.

We're working on support and will be rolling it out over the next few driver releases. This includes OpenGL 4.4 and the co-released ARB extensions.

I wish DirectX 11.3 will support at least everything in chapters 1-3, ARB_COMPUTE_VARIABLE_GROUP_SIZE and of course glMultiDrawArraysIndirect from OpenGL 4.3. We need all these features badly.

If you want access to features that are only available in OpenGL... use OpenGL. :)
 
I wonder if NV_bindless_multi_draw_indirect (http://www.opengl.org/registry/specs/NV/bindless_multi_draw_indirect.txt, July 2013) is compatible with the new ARB_indirect_parameters (http://www.opengl.org/registry/specs/ARB/indirect_parameters.txt, June 2013)? I don't have a Nvidia hardware (+ newest beta drivers). These two together would make many great things possible. And it would be even better if we get ARB version of the bindless_multi_draw_indirect (hopefully in OpenGL 5.0). That would mean that all three GPU manufacturer's would eventually support it. And it would be even better to have these in DirectX 11.3 (or 12.0), but it seems that Microsoft's upgrade pace has slowed down dramatically from the good old times. It used to be that OpenGL was always lagging behind, but nowadays they get all the new toys first :(
 
That's not actually true. The exact draw call count does come from the buffer. The CPU-supplied value is the maximum and is a hint. Effectively, it enables implementations that DMA parameters, kicking off the DMA with the 'fixed maximum' and then terminating it early using the loop count.
It's good to hear that the actual implementation will be efficient :)
For example, it's probably fine to leave blending on (GL_ADD, GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA) so long as you can guarantee that opaque fragments have an alpha value of one.
And for even better results, you'd want to set source alpha to one. This old skool premultiplied alpha trick allows you to render both additive and alpha blended (lerp) transparencies using the same draw call. Good for interleaving smoke particles with fire for example.
In my experiments on AMD hardware, I've been able to easily hit a range of 10 to 20 million draws per second (enough for at least 200K draws per frame at 60Hz) using stock OpenGL (no extensions that aren't in core). To hit this rate, the draws have to be really pretty light weight, otherwise you end up bumping into other limits (generally in the front end).
At 200K draws you will be primitive bound very easily. However I firmly believe that game draw distances will rise dramatically as GPU-driven renderering pipelines start seeing large scale adoption. As the visible area (usually terrain surface) grows by n^2 (where n is the view distance), huge majority of the draw calls will actually be very distant ones (very small = low triangle count). Artists need very good (automated) tools for LOD creation (as you'll need 10+ LOD levels for every mesh or you will be badly primitive bound).
The beauty of the data-driven GPU paradigm is that the GPU can produce its own data. The low hanging fruit is simply to walk the scene graph on the CPU, stuff parameters in a buffer and throw it at the GPU. However, the exciting part is moving that scene traversal (or even generation) onto the GPU. Culling in a VS is fun.
Agreed :). However I personally prefer to cull things in a compute shader, since that is a fixed cost operation (with a depth pyramid). And it's pretty much free when combined with other scene setup stages.
Which features are most interesting? Not sure if we can talk about them here and that's not what this thread is about, but out of curiosity, what graphics features are you after?
I would personally want most of the GCN hardware features exposed in a general purpose PC API (preferably next DirectX, but OpenGL is fine as well). It would make porting easier to PC (no need to write slow PC code paths). Things like direct access to GPU depth/color compression data, hi-Z data, AA data, etc. Asynchronous compute would be great as well.
We're working on support and will be rolling it out over the next few driver releases. This includes OpenGL 4.4 and the co-released ARB extensions.
Great news :)
If you want access to features that are only available in OpenGL... use OpenGL. :)
I have thought about entering the dark side, but our DirectX based GPU-driven renderer already pushes 500k visible objects per frame at 60 fps (at 97% of theoretical primitive rate of the AMD hardware). But lately it has been more and more clear that the dark side gets all the new toys, and these new OpenGL 4.4 toys really seem to hit my soft spot :(
 
I have thought about entering the dark side, but our DirectX based GPU-driven renderer already pushes 500k visible objects per frame at 60 fps (at 97% of theoretical primitive rate of the AMD hardware). But lately it has been more and more clear that the dark side gets all the new toys, and these new OpenGL 4.4 toys really seem to hit my soft spot :(

I think to the PC community there's a fairly significant marketing advantage of being able to say "look, we use (API) technology that's not available to all those other developers using "old, cludgy DX11.2". Just look at how excited people are about Mantle.

OGL is no Mantle even in it's latest incarnation but if you're marketing guys just do a few interviews highlighting "what was possible with your groundbreaking new OGL implementation" I think that would be a pretty big selling point. And if the benefits are genuinely great enough to show the results on screen in a noticable way then even better. I think people (and developers) underestimate the importance of graphics hype. People are fond of saying grahpics don't make a great game (and they're correct) but they do give a huge marketing advantage.
 
I wonder if NV_bindless_multi_draw_indirect (http://www.opengl.org/registry/specs/NV/bindless_multi_draw_indirect.txt, July 2013) is compatible with the new ARB_indirect_parameters (http://www.opengl.org/registry/specs/ARB/indirect_parameters.txt, June 2013)?

Not yet, improving such mechanisms that gpu creates its own work efficiently, is work in progress. Currently NV_bindless_multi_draw_indirect provides a slightly optimized path when you set the instanceCount of a draw command to zero.
 
We have been using deferred shading since 2007 and virtual texturing since 2009. Because of these two technologies, both our data outputs and inputs from our g-buffer shaders has been fixed for long time (12 texture inputs & outputs per pixel). Now that we have also moved to physically based shading, there's no need for artist controllable fake cube reflections, etc odd texture inputs in the g-buffer shaders either. Lighting pass does a few cube lookups for each pixel, because in real world every material has specular, and reflections = specular. No need to fake these things anymore. Faked things always break special cases (glowing reflections in dark etc), and as our games are based on user created content, we don't want that.

The problem is that the standard (Blinn-Phong, I believe) specular model is woefully inadequate for many types of materials, namely, just about anything other than plastic.

Some good examples are anisotropic reflectance, such as velvet or brushed metal (pots and pans, anyone?) or hair, retroreflection, as found in ceramics, fresnel effects, such as polished wood or asphalt or sand (think mirages in the distance). Even things like metal look wrong since the specular falloff function is incorrect. And of course, let's not even mention human skin.

The problem is that getting complex BDRF's with deferred rendering is very difficult. In addition to heavy divergence caused by many materials with radically different shading, simply storing the various extra channels associated with various materials becomes a tough problem, especially since you often need multiple samples to avoid aliasing, which may not be blendable before shading (normals come to mind, unless you want all that specular flickering. There are other ways to antialias this, but then you need more channels again...)

I suppose you could have every pixel point to a material description structure, then use it to do all the relevent texture samples (you'd absolutely need bindless textures for this!), but by this point, you've already lost most of the performance benefits of going deferred in the first place. I suppose you still avoid overdraw.

Then there's the issue of transparency, as well as volumetrics.
 
The problem is that the standard (Blinn-Phong, I believe) specular model is woefully inadequate for many types of materials, namely, just about anything other than plastic.
It's not woefully inadequate (when it is normalized and used properly with all the other components). The errors are minimal at most (for metals at least). Most game developers are completely happy with material definition model that has only four inputs (diffuse.rgb, specular.rgb, roughness and normal). This data set can describe most materials seen in real world. It's also intuitive and easy to use for artists (you need to be productive after all, if your art team can't be productive, no matter how fancy techniques you use, the result will look bad).

This is the problem set of game developers:
http://seblagarde.wordpress.com/2011/08/17/hello-world/
http://seblagarde.wordpress.com/2011/08/17/feeding-a-physical-based-lighting-mode/

Physically based rendering model is already a huge step away from the hacks we had in the past generation. Compared to the old techniques it looks quite real. Of course you'd want to add some minor tweaks in the future to support all kinds of special cases, but let's be realistic here, we have much bigger problems in real time graphics to solve first. We can render 90%+ of the surfaces seen in real world with plausible quality, and artist tweaks can handle rest of the cases quite admirably.

Skin of course needs SSS, but that is actually easier to do efficiently with virtual texturing (texture space lighting and light blurring).
... since you often need multiple samples to avoid aliasing, which may not be blendable before shading (normals come to mind, unless you want all that specular flickering. There are other ways to antialias this, but then you need more channels again...)
Toksvig mapping doesn't require extra channels (you can bake it to your roughness). It is just a few extra instructions, and thus becoming very popular in real time graphics. Some LEAN mapping variants are also quite efficient.
I suppose you could have every pixel point to a material description structure, then use it to do all the relevent texture samples (you'd absolutely need bindless textures for this!), but by this point, you've already lost most of the performance benefits of going deferred in the first place. I suppose you still avoid overdraw.
You can cluster pixels requiring different lighting models to different bins, just like you do with lights in "clustered deferred" (http://www.cse.chalmers.se/~uffe/clustered_shading_preprint.pdf). This adds almost no extra cost.

But this thread isn't really about the limitations of deferred rendering. Deferred is still one of the most widely used techniques in real time graphics. GPU-driven processing (or generally pushing huge amount of draw calls without changing state) doesn't add any extra limitations for lighting model handling in deferred rendering.

Then there's the issue of transparency, as well as volumetrics.
These are big problems for all real time rendering approaches. A single full screen transparency in forward rendering doubles your lighting/material cost, and causes a big drop in fps. You'd definitely want to pay a constant cost for your lighting, no matter how many transparencies you might have.
 
Last edited by a moderator:
Great discussion so far guys!

I would definitely want to have some kind of extension to write warp/wave wide scalar code in my shaders (just to be sure that everything goes as planned).
Not to get too far off track here, but while I agree that expressing different frequencies of execution is important going forward, I don't think bandaging on "warp-wide" commands, "dynamically uniform" concepts and other breaks of the conceptual independent execution mode is the right solution. Something like what ISPC does makes a hell of a lot more sense... allow the shader code to be parameterized on the SIMD width and have the "loops" and synchronization expressed in the user code itself.

Storing vertex data in general purpose buffers and fetching it from there in vertex shader isn't slow either.
I agree it's not "slow", but be careful generalizing current GCN design to other architectures. There are trade-offs made in latency hiding and such whenever you switch from "push" to "pull", and in vertex cases where you have a fair amount of gathers/reorganizing and AoS/SoA conversations going on. In the long run I think pulling vertex data is the way to go, but it's not true to categorically state that it's no worse today.

Of course you can still zero out the DrawArraysIndirectCommand struct's triangle counts (by GPU) for the excess draw calls (just assume a million draws every time). This however doesn't save GPU time much (confirmed by Riccio's measurements).
I saw some curious behavior with this on NVIDIA recently where it seemed like it might actually be reading the buffer back to the CPU anyways and doing the setup/loop there. Zeroing everything was way faster than a GPU frontend would be able to do it, and the runtime debugging stuff noted a read-back of the relevant buffer from GPU -> host on the MultiDrawIndirect call... Not sure it would always do this even if it involved generating a stall (in this case the data was dumped from the CPU anyways), but it was still a little worrying in terms of the whether the mechanism is actually doing what is intended to.

Alas you always have this sort of problem with 3D APIs... driver writers tend to have license to reinterpret whatever API calls they want in the name of "optimization". Hard to fight the market forces behind that one though.

Overall, I think that making each and every API call as cheap as possible and setting things up so that you can run on multiple CPU cores and such is fundamentally not a forward looking approach to high performance graphics.
Welcome to B3D Graham! Glad to have you in the discussion here.

As to your note, I spoke to this a bit in the Mantle thread. While I think it's fine and a worthwhile exercise to lower the overhead of submission as much as possible (remember, every bit helps on power-constrained devices and SoCs), when people start talking about millions of draw calls you do have to start asking some questions about the entire submission model. i.e. why are we creating these systems where you go wide parallel to construct a command buffer that then gets consumed serially by the GPU frontend before it goes wide parallel again? I wonder where the bottleneck is going to be eventually guys... :)

We need to get graphics state out of physical registers (which are a limited and expensive hardware resource) and into memory (which is effectively free).
Yes I think this will be forced by a number of trends, including better pre-emption, shared virtual memory, user space submission, etc. Most GPUs are arguably already more like this than the APIs would have you think right now, but it's worth noting that there's still a bit of a mine-field on which architectures read which things from memory, pipeline which state, etc. Creating a portable API will by necessity involve some compromises in the short term.

Once the state vector is in memory, filling it in is just writing data into a GPU-visible buffer, and no API is required at all. The fastest code is the code that never executes.
It is worth noting that pulling state from memory comes with some additional overhead in terms of latency hiding. You typically need somewhat more hardware threads and registers when you have to launch a shader, *then* go off and look up its inputs, then come back to it. It's not usually a huge deal, but it's more pressure on those aspects than being able to push data into registers before even launching the kernels. We've already done this transition to some extent with constant registers -> constant buffers though and it didn't kill us, so it's certainly doable :) Just want to note it's not completely free.

For example, it's probably fine to leave blending on (GL_ADD, GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA) so long as you can guarantee that opaque fragments have an alpha value of one.
Getting architecture specific there, but yeah :) That said, I don't think blending is a big deal TBH... you almost always want to render all your opaque stuff first, then blended stuff. Two draw indirect calls vs. one = meh :) Same with render targets... these are low frequency changes that should not be a problem even on current APIs.

While it's possible to image some uses for high frequency render target changes (rendering multiple shadow cascades, cube faces, voxels, kniss-green-style volume half-angle rendering), many of them involve cache thrashing or complex synchronization that wouldn't be that efficient anyways. People have played with GS-driven viewport index stuff to do this today and it usually doesn't end up being great. And no, it's not simply because "GS is slow" :p

The beauty of the data-driven GPU paradigm is that the GPU can produce its own data. The low hanging fruit is simply to walk the scene graph on the CPU, stuff parameters in a buffer and throw it at the GPU. However, the exciting part is moving that scene traversal (or even generation) onto the GPU. Culling in a VS is fun. Generate vertex data in a compute shader, stuff references to to it into an indirect draw buffer and then dispatch the list - procedurally generated forest, anyone?
Eliminating state brings us nearer to GPU-driven rendering and I think that's definitely a part of the future. On the other hand though, there's still stuff that a CPU does better (more power-efficiently) and I want to be able to bounce around to where things are the most appropriate. Thus another part of the ideal future IMO is a move away from huge buffering of graphics commands. I don't see a fundamental reason that we can't get graphics submission entirely into user space with a granularity down to the range of <1ms "command buffers" and remain efficient on the GPU side, which opens a lot of doors in terms of heterogeneous computing. This sort of thing is going to be necessary to do a good job of VR anyways, so I think it's generally a reasonable direction to go as well.

I really do want both tools though - GPU fork/join/self-dispatch (for graphics, not just compute) and lower-overhead and latency dispatch from the CPU. All this multi-frame buffering of ~50ms+ needs to go away.

I'll stay out of the deferred shading conversation... sebbbi already covered everything I would have said :) Suffice it to say there are still plenty of reasons to do it and I don't see that changing in the near future.
 
I'm all for having the GPU pull as much data as it can on its own, and avoid the need for the CPU to spoon-feed everything. However if that's the future then I think shader languages and debugging tools will have to advance quite a bit for that model to gain wide adoption. I don't know about you guys, but the thought of traversing complex data structures using HLSL or GLSL doesn't exactly fill my heart with joy.
 
I'm all for having the GPU pull as much data as it can on its own, and avoid the need for the CPU to spoon-feed everything. However if that's the future then I think shader languages and debugging tools will have to advance quite a bit for that model to gain wide adoption. I don't know about you guys, but the thought of traversing complex data structures using HLSL or GLSL doesn't exactly fill my heart with joy.
Visual Studio 2013 GPU debugger is actually very good for compute shader (DirectCompute) development. It supports single stepping and variable inspection, and many other good features. I recommend upgrading to it if you haven't already. Working with compute shaders and not having a good debugger is a HUGE pain in the ass.

But even the VS 2013 debugger has it's own limitations and bugs. For anything else than compute shaders if you are using system value semantics, you cannot single step. That's quite a big problem, since we need vertex ID semantics everywhere. You can still hack around this issue by preparing a vertex buffer that just has 0,1,2,3,4... integers in it. Also another funny bug is that the debugger shows NaNs as zeros. This has resulted in some hilarious debugging sessions for us :)

One of the remaining problems with DirectCompute and OpenCL (and OpenGL compute shaders) is that we don't have good optimized libraries around. CUDA has lots of nice template libraries that are both efficient (optimized to death) and easy to use (clean and maintainable code).

A question to the OpenGL programmers in this thread: How good are the OpenGL graphics and compute debuggers right now? Do you have single step debugging (for both), and do the available debuggers support all the latest features (and the latest 4.4 extensions)? How good are the profilers (can you see hardware performance counters, etc to figure out the GPU side bottlenecks)?
 
I don't know about you guys, but the thought of traversing complex data structures using HLSL or GLSL doesn't exactly fill my heart with joy.
For sure, but I think that's just more of a general need as GPUs run larger chunks of code. TBH the "data structures" that will most commonly be used for bindless stuff are very simple... arrays, simple arrays of pointers/offsets, etc. Stuff that is coming up in other areas - like acceleration structures, voxels, etc - is where we really need a lot better tools even today :)
 
A question to the OpenGL programmers in this thread: How good are the OpenGL graphics and compute debuggers right now? Do you have single step debugging (for both), and do the available debuggers support all the latest features (and the latest 4.4 extensions)? How good are the profilers (can you see hardware performance counters, etc to figure out the GPU side bottlenecks)?

Debuggers - not so great. There's some options out there but they're not that mature. For AMD, GPUPerfStudio2 works well with OpenGL and supports all OpenGL features up to 4.3 (we haven't shipped 4.4 yet). It does show the hardware performance counters. We're putting a lot of effort into OpenGL support in that tool and we use it internally for performance work.

On a side note, the performance counters in (AMD) OpenGL are exposed through the GL_AMD_performance_monitor extension. This is what GPUPerfStudio2 uses to read them. I hear that Intel have shipped an implementation of that on their Linux stack.
 
Following my experiments, we can submit loads of draws and the numbers during Mantle presentation are clueless. Mantle doesn't solve any actually issue. One actual issue what we need to resolve:
Shader cross compilation by defining a standard shader IL valide for HLSL and GLSL. We need it to be able to fully take advantage of all the OpenGL and Direct3D APIs.

NVIDIA has by far the fastest implementation to submit draws reaching >500000 per frame at 60Hz with MultiDrawIndirect. AMD and Intel are not comfortable with MultiDrawIndirect but if we use tight loop instead we can target >150000 draws per frame at 60Hz on Intel and >100000 draws per frame at 60Hz on AMD.

If you want some numbers:
draws.png


I haven't try on mobile but due to the tiled base architecture, it's even more important for those GPUs to target the 1 million draws than desktop GPUs.

1 million draws doesn't increase brut force performance but GPU efficiency. I am expecting magnitude 1 or 2 of higher scene complexity/increase efficiency with moving toward the MultiDrawIndirect development mind frame on current GPU architecture from the XBox360/PS3 development mind frame.

Mantle is only trying to optimize the XBox360/PS3 development mind frame. I am happy for whoever want to live in the past, but I have a future to build. :p

A modern GPU is no longer an array of ALU but a cluster of execution unit, each capable to address different resources and execute different shader code. Interesting number is that Kepler is capable to render meshes as small as ~50 triangles and Southern Islands meshes as small as ~300 triangles before reaching constant time for rendering smaller set of triangles. So submitting such small set make sense. It allows use to do very fine grain culling, hence increasing the GPU efficiency drastically. Such fine grain culling enables more clever sorting heuristics and discarding for stuff.

Ultimately, we should be able render a frame with less than 10 draw calls by API calls are much lower frequency updates and think in term of bandwidth instead, GPU based resource indexing instead of CPU based resource switching. The indirect draw buffer because a memory packed representation of the series of draws that is constantly updated to match the evolution of the scene frame after frame. A cached version of that buffer can even be used in a separate thread so that really the rendering thread as no much left to do anymore.

We need better subroutine so that each draw of a multi draw could execute different code path so that each draw can sustain maximum efficiency for the execution units register allocations. We also need some APIs to handle the variation of vertex shader output interface variations for example. Especially in case of tessellation or geometry shader usages, it's very important to keep those variables packed and only write what we need.

VAO can get ready to be dropped in that mental frame, replaced by programmable vertex fetching. Shader storage buffers are not flexible enough for my taste here. We should be able to address the memory and reinterpret_cast the data into whatever vertex format that is associated with a specific draw. Sorting by vertex format, as advertised by VAOs, is a poor heuristic. Fine grain front to back and screen space coherence will harvest better performance by leveraging cache hierarchy and push far away the parameter buffer getting full performance cliff on tiled base GPU. It's also a much easier code to write and maintain.

All in all, that stuff that I call programmable vertex pulling is putting ourselves in a position with no compromise and only wins. There is so many opportunities to take advantage of that mind frame.

All this is nice for IVHs to understand how to design future GPUs. For ISVs to move toward such greatness we need to resolve the shader cross compilation by defining a standard shader IL. It limits us because we have poor solutions currently that are more workaround than solutions. Hence, we can't take advantage of most of the Direct3D and OpenGL APIs.
 
I spend an hour to write a post detailing so much stuff but when I submitted it, I guess it timed out. :( A short resume.

Here some evidences that we can submit already a lot of draws:
draws.png


Stuff I said in that post: We can't leverage 100% of the OpenGL and Direct3D APIs because of cross compilation issue. That's a real issue to solve! We need a standard shader IL for GLSL and HLSL.

Mantle is just an optimization for the programming mental frame used for XBox 360 and PS3. GPUs have changed: There are clusters of ALUs, each of them capable of accessing different resources and executing different shader code. With a modern GPU mind frame we can't get better raw performance but we can increase the scene complexity or the efficiency by magnitude 1 or 2.

Fine grain culling/sorting with better heuristics, moving updates/binding at much lower update frequencies, replacing CPU based resource switching by GPU based resource indexing; a modern GPU mind frame that I like to call programmable vertex pulling.

Interesting numbers: ~50 triangles on NVIDIA and 300 triangles on AMD. That’s the number minimum numbers of triangles we can submit per draw before the performance get constants. We can submit very small draws!

But to leverage these concepts We need a standard shader IL for GLSL and HLSL. to be able to use all the APIs we could.
 
Back
Top