I'm gonna hit this point later because the discussion of bindess in the developer Q&A here (http://www.brightsideofnews.com/new...ed-with-qa-with-devs-from-oxide-and-dice.aspx) was pretty disappointing. It's clear that you get it, but I'm not sure the folks on the panel had thought about it too much.Bindless textures and hardware virtual memory (*) allow rendering in larger batches, thus increasing the GPU utilization (GPU partially idles at start/end of draw/dispatch calls).
Not sure if you saw, but I mentioned this in my post as well. The issue is this the only example people have given of cases where you're completely not utilizing the ALU array... rendering depth-only. So there's a nice one-time boost we can get during shadow map rendering, but it's not a long-term performance amplifier per se. Also the more power-constrained a GPU is, the less this will help.With Mantle you can run multiple kernels in parallel (or kernel + graphics in parallel) in a controlled way, and thus reduce the GPU bottlenecks. For example render a shadow map (mostly ROP and geometry setup) and compute ALU heavy pass (for example lighting for the previous light source) at the same time.
They are possible (you can set up arbitrary predicate data), but they are ugly no doubt. I've asked for this in the past as well.Better predicates and storing GPU query results to GPU buffers (without CPU intervention) allow GPU optimization techniques that are not possible with PC DirectX.
As much as I agree that SLI is totally dead in the water without explicit separate device enumeration, I'm not sure even that will save it. The market is just too microscopic to bother doing what would be quite significant work to make an engine take good advantage of it. I expect a few cases where developers are paid to do it for benchmark wins, but I don't expect much more than slightly-tweaked AFR for most games, and I'm not sure the PC market will tolerate that extra frame of latency going forward. We'll see.I was quite worried that SLI/Crossfire would die soon, as many new graphics engines will start doing scene management and rendering decisions on GPU side.
Yeah I spoke to this in my 2010 talk. While it's neat, it's a bit tricky to decide exactly what to expose and that's why I've been asking Johan for more info on what GPU features will be exposed Compression is one area of high variance between IHVs, so you could not expose the specifics of GCN there and still call it portable. I also highly doubt AMD would be comfortable with pinning down the specifics of their color/depth compression forever (forward compatibility). There are somewhat portable ways that you can expose a lot of the features that deferred MSAA needs, but we'll see what they do in Mantle when the details are public (next year? early hopefully?).Deferred antialiasing will be much more efficient, assuming the "Advanced MSAA features" in the Mantle slides means that you have direct access to GPU color/depth blocks and MSAA/layer data (including coverage sample index data). With all that data available, tiled (and clustered) deferred renderers can separate pixels (different sample counts) more efficiently and recover geometry edge information (using coverage samples) in a much more precise and efficient way.
The GPU side is an issue regardless of the CPU side. You don't want to retransform/skin/rasterize your entire scene twice, even if the CPU cost was free. Honestly in the long run pre-z is going to lose out to coarser low-res (and low latency) conservative depth rasterization (or equivalent ray tracing) and in a similar vein clustered shading (i.e. GPU pulls data from an light acceleration structure) is going to win for those who want forward rendering engines. There are cases where stuff like tiled forward (aka. forward+) will make sense in the meantime, but mostly because of legacy code. Emil made a pretty compelling argument for why it makes more sense to generate the light acceleration structure before submission, even if submission is cheap.With Mantle rendering a depth pass to prime the depth buffer in a Forward(+) renderer becomes a completely viable option without running into CPU bottleneck situations.
^ What he said Thanks for going into more detail. Fundamentally I don't think Mantle changes the arguments I made in my 2012 talk on tiled forward vs deferred (other than potentially making deferred MSAA less expensive); I was making no assumptions about CPU overhead in that analysis.Forward+ is a good solution if your triangle counts are not that high and you don't use heavy vertex animation or tessellation. In low polygon games (current gen ports for example) the depth only pass is dirt cheap (for GPU), and quad efficiency is often more than 80% (so you lose less than 20% of your lighting performance). But I just don't see it as a viable technique in the future, especially now as Mantle allows low level access to the GPU MSAA data on PC as well (this is a huge gain for the deferred renderers).
***
Now back to the bit on bindless... I really wasn't too happy with the answers given in the Q&A. I'm going to write it off to people not having thought a lot about it and being somewhat conditioned in their engine design thinking by the way that APIs have always been to this point, but they really missed the point of why people are bringing up bindless in this context.
Bindless textures basically remove the last piece of state that changes at high frequencies in engines and thus "breaks batches". sebbbi has mentioned this before, but if you want to you can completely render at least an entire pass to a set of render targets with one draw call using bindless. Thus the overhead of draw calls is largely irrelevant... even DX today has quite acceptable overhead if you're only talking about tens-of-draw-calls per frame.
And in doing this, you are giving up no flexibility vs. the low-overhead small batches Mantle approach. The results are identical and arguably it has even lower CPU overhead than Mantle. The only differences are whether you're pushing vs. pulling state and whether the effective command list/buffer is hidden behind an API or just implemented as an arbitrary user-space concept.
Now in theory there can be some additional GPU overhead for bindless, depending on how it is implemented and the GPU architecture. Pushed state is usually easier to deal with in hardware, since pulled state typically requires caches. That said, we're already long past the point where we've transitioned most of the fixed pipelines to generic cached memory accessess (and GCN more than most), so I'm not really concerned about this in the future. You might have to make your constant cache a bit bigger, but you get back the same space in the lack of other on-chip buffers that used to hold the binding table. It's really not very different.
Basically with pure bindless you can implement the "descriptor set" concept in user space, or however else you want to. In some cases it might make sense to have a simple list of them - such as in Mantle - in others it might make sense for them to be part of some deeper geometry data structure as is necessary if you want to "pull" arbitrary data and do shading more dynamically - such as with ray tracing.
The other advantage of the "pull" bindless model is better distributed scheduling. You effectively parallelize the work of pipelining the relevant state. They touched on this briefly in the Q&A in the question about "moving bottlenecks", but the Mantle model does not scale indefinitely. Even with multithreaded command buffer *creation*, they are still consumed by the GPU front-end in a fairly sequential manner. The front-end at a minimum has to deal with setting up the pipelined state (on GCN I think this takes the form of something like a memcpy of the relevant state vector). Either this is a copy of the "descriptor set" itself - in which case you're talking some number of cycles to copy it - or it's a pointer to the descriptor set. If the latter and later stages "pull" from it, the question arises as to why we even need that concept in the API since it could as easily be a user-space data structure. I imagine for that reason that some parts of it are indeed pushed through the pipeline.
But as the AMD guy hinted at, if you want another 10x gain on the 100k batches (Dan mentioned wanting millions in a few years), you're going to be pretty close to the limit of what you can do with current GPU frontend design (i.e. small number of cycles per state change/draw)... and parallizing the front-end would probably start to look a lot more like the pull model anyways bringing us back to the question of if we really want this "pushed" state around in the first place.
So yeah, I admit that the question the guy asked was perhaps not specific enough, but I would have wanted to hear a better discussion of push vs pull in the future. I just don't think most game devs have necessarily thought that far out yet. Don't get me wrong, I'll take both lower overhead and bindless, but it's not clear that we really need to go nuts on the former given that we really need the latter regardless.
Sorry for the long post; brain dump
Last edited by a moderator: