Is everything on one die a good idea?

I think you'd be surprised at how badly serial convergence points actually hurt modern games. Reductions, mip generation, etc. are all measurably becoming larger parts of game frames due to not getting any faster for several generations. This isn't a huge problem yet but it will continue to get worse each year.
Modern games yes, but not the future games that use better APIs that we currently have on PC.

Asynchronous compute helps a lot with that issue, and it also helps with other GPU utilization issues such as fixed function pipelines being the bottleneck (left over ALU or BW that could be used for compute). As long as you always have more than one task for the GPU to do, convergence points do not matter that much (usually both pipelines are not at convergence point at the same time). Asynchronous compute also helps with indirect draw calls / indirect dispatch, since you need to setup these stages by a simple compute shader (single thread + a convergence point).
When DX11 first came out we went through a similar period of excitement about all the great new stuff you can do with compute... then after a while you start to realize how fundamentally limited the model (and GPUs to some extent) really is.
DX11 didn't have multidraw or asynchronous compute, or other ways to create GPU work using the GPU. Compute pipeline that cannot be used properly to feed the graphics front end is quite limited. Also first party console developers are only now getting their hands dirty with compute shaders. And cross platforms games could not be designed around a feature that is not available on all the platforms (DirectX 10 PCs and last gen consoles). These facts limited compute shader adaptation.

Thus we basically only saw some games add extra optional eye candy with compute shaders. No developer could completely rewrite their engine with compute shaders in mind yet.
Nesting, self-dispatch and work stealing in compute are limited and inefficient even if the hardware is capable.
You can implement these with modern GPUs, but sadly the cross platforms APIs still do not expose all the tools needed.
Memory layouts and access patterns are opaque and impossible to reconcile through compute shaders in a portable way. And so on...
This is true. Access patterns are not easy to describe, and are not portable. You need to optimize the code for a single architecture and to the needs of your data set to get the best performance.
This is probably the biggest issue I see with current GPU *hardware*, and it really does limit what you can do in terms of targeting them with more "grown up" languages. I think it's solvable without massively changing the hardware but I don't see any way around taking *some* GPU efficiency/area hit in the process. It's legitimately more complicated, but it's ultimately necessary.
Kepler can spawn kernels from a kernel (function calls basically). This can be used to avoid the static register allocation problem in some cases. It is a good start, but we need more.
This has yet to be proven in general and fundamentally is pretty architecture-specific. Certainly in the past writing command buffers has been a big overhead but that is changing quickly, and it is far from proven that GPU frontends are as fast/efficient at "gathering" data as CPUs, especially given the frequency disparity.

I think it's fair to say that some element of this stuff will be good to do "on the GPU" in the future, but I think you're making assumptions that aren't warranted for integrated chips where CPUs are very close and memory is fully shared that are not warranted.
I just tested my GPU driven DX11 "asteroid" proto on Haswell HD 4600. It runs slightly faster than your DirectX 12 proto, and renders 5 times as many animating "asteroids" (random objects). My proto also uses less than 0.5 milliseconds of CPU time (from a single CPU thread), so it should give the GPU even a bigger slice of the thermal budget. Just like in your proto, in my proto every "asteroid" can also have it's own unique mesh, material and textures (I do not use GPU instancing). In light of this result, I still believe that a GPU driven pipeline is more energy efficient than a traditional CPU driven renderer. All the scene setup steps are highly parallel (viewport and occlusion culling, matrix setup, etc) and the GPU goes through all that super quickly.
As Nick noted, it's available pretty much anywhere and compiles basic obj/lib with optimization targets available for a wide variety of architectures, jaguar included :) It's basically a frontend on LLVM so I don't think compatibility should be a major concern. There's even support for ARM/Neon in there and IIRC an experimental PTX fork too. And of course it's open source so people can add to it and modify it as required.
I need to take another look at it. The last time I looked at it more closely was when it was originally launched. ARM/Neon support is good to have for porting purposes. It's good to know that these platforms are also supported.
With the current consoles having integrated GPUs, it won't take all that long for the resulting characteristics of unified memory and low round-trip latency to become the norm. Discrete GPUs will suffer from this, leaving them no option but to become capable of running scalar sequential code really fast, thus allowing them to run significant portions of the game's code entirely on the GPU. Of course that's just a few steps away from getting rid of the CPU and running everything on the unified GPU. Some might call that a unified CPU though.
Discrete GPUs will eventually be hurt from the unified memory consoles. But first we need the low level APIs (such as DX12 and Mantle) with manual resource management to become popular, before the PC ports can fully exploit the unified nature of integrated GPUs. Obviously the parallel code might also take most of the scene data structures with it and all of that moves to the discrete GPU. So many ways to go...
Meh, you could just as well look at the die shot of a server CPU and proclaim that caches have won, cores are sliding into irrelevance.
Caches won... That's the reality. Most of the optimization work is nowadays done to improve memory access patterns.
 
Asynchronous compute helps a lot with that issue, and it also helps with other GPU utilization issues such as fixed function pipelines being the bottleneck (left over ALU or BW that could be used for compute).
Not convinced. It puts off the issue slightly on some architectures, but it's hardly a holy grail as it does nothing to address the real thread of dependent work through the rendering pipeline. You can't avoid the fact that doing a reduction (for example) is going to have a certain work complexity even with "infinite" parallelism; that's ultimately what Ahmdahl's law is all about, and hand waving about "oh but we'll just add more data" isn't compelling IMO, even if you now have all this "free" time to do it in. I agree that this is not going to be a major concern on current architectures and renderers but when you're talking architecture in the long run, I haven't met anyone who seriously thinks that you don't need at least one "core" that can rip through this critical dependency chain as fast as possible.

DX11 didn't have multidraw or asynchronous compute, or other ways to create GPU work using the GPU.
Multidraw is mostly irrelevant in this discussion if it just expands to a loop in the GPU frontend; draw indirect has been there since day one. Async compute is also pretty irrelevant to trends... it's a performance optimization for some hardware but does nothing to improve power efficiency (assuming decent power gating on execution units going forward) and honestly beyond the initial "we can fill in a bit of empty time during shadow rendering" if you're getting much of a gain then the hardware wasn't very efficient before.

That's not to say either of these two are bad features... quite the contrary. But they don't change the fundamental CPU/GPU argument at all.

Also first party console developers are only now getting their hands dirty with compute shaders.
Right, but that's really my point... some of us have already been using them for >4 years. I won't claim we know all there is to know, but from a language design perspective there are clearly issues that no one who has been using them from day one really disagrees with. For instance, I think there was a panel at SIGGRAPH last week that pretty much uniformly agreed that shared memory was a bad idea.

Hindsight is 20/20 of course and I absolutely don't blame CUDA for making a few things that ultimately turned out to be mistakes... really I blame the people who blindly copied all of it when some of these things were already becoming clear a few years later (*cough* OpenCL :p).

Anyways, it'll be interesting to see what folks come up with, but I would be surprised if these problems don't come up in GDC presentations in the next few years; I'll just smile and nod and be happy that we can finally get some pressure and attention on improving these languages and hardware :)

Kepler can spawn kernels from a kernel (function calls basically). This can be used to avoid the static register allocation problem in some cases. It is a good start, but we need more.
Indeed and it's a good step in the right direction. As you note though, the way it implements it is not really ideal... the high performance path really is syntactic sugar for dispatch indirect. I give them lots of kudos for actually implementing proper nesting in CUDA though (i.e. including *sync*, unlike OpenCL :S), but by my understanding it can be a rather slow path if you have to spill/fill shared memory and such.

I just tested my GPU driven DX11 "asteroid" proto on Haswell HD 4600. It runs slightly faster than your DirectX 12 proto, and renders 5 times as many animating "asteroids" (random objects). My proto also uses less than 0.5 milliseconds of CPU time (from a single CPU thread), so it should give the GPU even a bigger slice of the thermal budget.
Cool! I'm assuming this is a desktop Haswell though, so "thermals" aren't really a big issue and obviously both the CPU and GPU are running at much higher frequencies than the Surface Pro 3.

Just like in your proto, in my proto every "asteroid" can also have it's own unique mesh, material and textures (I do not use GPU instancing).
So what exactly is the inner loop? The point in the demo was not to show the best way to render an asteroid field (which obviously would involve instancing-like-stuff), but rather to show that the most brute force/naive way of doing state and binding changes is actually quite reasonable on DX12 - even on a tablet! The objects are all similar due to art time/budget constraints, not technical issues :) They could of course be completely unique meshes with different vertex/texture/constant formats and data, etc.

Obviously using texture arrays, GPU indexing, etc. will be "faster" in terms of CPU overhead but that's not really the point here. Many game developers have said definitively that they want even complicated command buffers to be cheap(er) to create and that the AZDO-style rendering is not an acceptable replacement for that.

In light of this result, I still believe that a GPU driven pipeline is more energy efficient than a traditional CPU driven renderer.
Heh, have you tried doing ~50k+ indirect draws per frame? (i.e. get the GPU to actually gather all the draw parameters.) You may re-evaluate somewhat after that ;)

Energy efficiency is fairly subtle. You definitely need to measure stuff as apples-to-apples as possible on the same hardware to make any determination.
 
Not convinced. It puts off the issue slightly on some architectures, but it's hardly a holy grail as it does nothing to address the real thread of dependent work through the rendering pipeline. You can't avoid the fact that doing a reduction (for example) is going to have a certain work complexity even with "infinite" parallelism; that's ultimately what Ahmdahl's law is all about, and hand waving about "oh but we'll just add more data" isn't compelling IMO, even if you now have all this "free" time to do it in. I agree that this is not going to be a major concern on current architectures and renderers but when you're talking architecture in the long run, I haven't met anyone who seriously thinks that you don't need at least one "core" that can rip through this critical dependency chain as fast as possible.
I do not believe in single fast cores. Our CPU work scheduler for example uses the first thread that has completed its parallel execution to schedule more work for all the cores. This way other cores have work already in their queues when they have finished the current task. A single orchestrator core causes other cores to idle during the scheduling (an inherently sequential operation). GPU can also work like this. The end tail of an reduction only occupies one shader core. Other cores are already working on something else.

If you need more independent tasks, you can for example split the scene rendering to a coarse (for example 3x3) screen space grid. This way you can perform the lighting compute shader (and z-pyramid reduction) simultaneously for the already finished g-buffer tiles while the rasterizer continues rendering the next tiles. The z-pyramid reduction tails become basically free this way. Obviously tiling requires big enough resolution to present enough parallelism to occupy the whole GPU. 1080p is enough for current GPUs, but future ones will definitely need more.


Multidraw is mostly irrelevant in this discussion if it just expands to a loop in the GPU frontend; draw indirect has been there since day one.
It saves lots of CPU cycles. Try pushing 500k indirect draws per frame (at 60 fps) without multidraw and you will need multiple CPU cores to crunch through that work. Single multidraw has practically zero CPU cost and does the same. Lots of power saved.
Async compute is also pretty irrelevant to trends... it's a performance optimization for some hardware but does nothing to improve power efficiency (assuming decent power gating on execution units going forward) and honestly beyond the initial "we can fill in a bit of empty time during shadow rendering" if you're getting much of a gain then the hardware wasn't very efficient before.
You are underestimating async compute. Before async compute we GPU programmers had to try to minimize all the numerous graphics pipeline bottlenecks to make every shader pass utilize the whole GPU as close to 100% as possible (lots of extra work to make stuff like lookup tables for ALU heavy TEX light shaders, etc). Now you can use the leftovers to run compute. It makes our work easier and improves the GPU utilization a lot (much bigger gains than you might think). And as I said earlier it makes GPU synchronization points much less painful. The GPU never needs to idle at end of reductions or during indirect parameter setups.
For instance, I think there was a panel at SIGGRAPH last week that pretty much uniformly agreed that shared memory was a bad idea.
Agreed (in long run). L1 caches of modern GPUs are already similar in size and in speed (only slightly slower if any). Why have another temp memory pool of the same size as L1 that requires manual management? Also this pool is split manually to thread groups just like registers are split between waves/warps (= yet another way to reduce your occupancy). L1 is shared (fully automatic) and never affects occupancy (but can of course trash).
So what exactly is the inner loop? The point in the demo was not to show the best way to render an asteroid field (which obviously would involve instancing-like-stuff), but rather to show that the most brute force/naive way of doing state and binding changes is actually quite reasonable on DX12 - even on a tablet! The objects are all similar due to art time/budget constraints, not technical issues :) They could of course be completely unique meshes with different vertex/texture/constant formats and data, etc.
Obviously. I also just loaded the same rock mesh (and other dummy meshes) to GPU dozens of times (replicated data) to ensure that I get correct cache behaviour (reusing same data in memory would have resulted in better peformance because all vertices would come from the caches).

Unfortunately I cannot reveal what is inside our inner loop and what kind of techniques we use in a public forum. Let's continue this discussion elsewhere.
Obviously using texture arrays, GPU indexing, etc. will be "faster" in terms of CPU overhead but that's not really the point here. Many game developers have said definitively that they want even complicated command buffers to be cheap(er) to create and that the AZDO-style rendering is not an acceptable replacement for that.
Devs of course want different things. We have for example been using virtual texturing since DX9 era, so I do not even know how much texture changes actually cost on modern hardware (since it's so long time ago when I have changed any textures between draw calls). Having all the constants and mesh data in big GPU indexable raw buffers only makes things more uniform and easier. But as I said everyone has their unique needs. GPUs and APIs need to evolve to make both approaches more efficient in the future.
Heh, have you tried doing ~50k+ indirect draws per frame? (i.e. get the GPU to actually gather all the draw parameters.) You may re-evaluate somewhat after that ;).
Modern NVIDIA and AMD GPUs can push 500k+ indirect draw calls per frame at 60 fps (using multidraw). By emulating multidraw (by custom shader code and single indirect draw call) you can push over one million separate objects (per frame) with unique meshes on a lower middle class Radeon 7790 (at 60 fps). However the custom software emulation makes the vertex shader more expensive. But at least there is lots of parallel work (no occupancy problems with small meshes).
 
Last edited by a moderator:
Our CPU work scheduler for example uses the first thread that has completed its parallel execution to schedule more work for all the cores.
I'm well aware of distributed scheduling, work stealing, etc. and have advocated it myself for many years. But there are always serial bottlenecks in there, even if it's just the atomic that you need to go pick a victim. You won't see these issues on the relatively narrow machines we're talking about, but when you start talking hundreds or thousands of workers/cores/etc. it's a different ball game.

The end tail of an reduction only occupies one shader core. Other cores are already working on something else.
On an infinitely wide parallel machine, the "tail" is precisely the bottleneck. There's no "something else" as anything that could be done in parallel would have already completed. The step complexity is irrelevant, the work complexity determines performance. The architectural lessons from HPC are going to be relevant everywhere if technology continues on its current trajectory. We'll end up in a world where basically IO and coordination (via atomics or something like transactional memory) are what you design for, as everything else is relatively free.

It saves lots of CPU cycles. Try pushing 500k indirect draws per frame (at 60 fps) without multidraw and you will need multiple CPU cores to crunch through that work. Single multidraw has practically zero CPU cost and does the same. Lots of power saved.
Any modern CPU can put a few DWORDs into a command buffer faster than the GPU frontend can read those *and* go look up the indirect parameters. In the long run shifting the bottleneck to a slower processor isn't really a win. Obviously on discrete there are different memory pools and larger latencies to consider but on integrated the GPU frontend is at best a slow CPU. Obviously there are ways to improve things if we do really find ourselves needing to do 500k (tiny...) "independent draw calls", but the request is definitely moving towards questionable territory there IMO.

Before async compute we GPU programmers had to try to minimize all the numerous graphics pipeline bottlenecks to make every shader pass utilize the whole GPU as close to 100% as possible (lots of extra work to make stuff like lookup tables for ALU heavy TEX light shaders, etc).
Honestly, I think that's mostly legacy thinking about GPUs that are not power-limited and/or have poor power management. I get why it's a win on GCN, but don't pretend that other GPUs sit there burning power on idle execution unit cycles ;) We're getting to the point where all GPUs are power limited, even high end discrete, so "filling in all the gaps" and "using all the silicon at 100%" is not the goal anymore. This is similar to the conventional notion that you have to try and balance the workload so that the CPU and GPU are both running at 100%... the demo at SIGGRAPH was really to show that on modern devices that is not the way it works anymore.

Now don't get me wrong, I'm absolutely all for more flexibility in execution units running independent stuff (and many GPUs can already do that to a large extent), but I don't think the specific example of async compute really has any bearing on this architectural discussion. Like I said, it's an optimization for some hardware.

Unfortunately I cannot reveal what is inside our inner loop and what kind of techniques we use in a public forum. Let's continue this discussion elsewhere.
Ok fair enough :)

Devs of course want different things. ... GPUs and APIs need to evolve to make both approaches more efficient in the future.
Absolutely and I basically play devil's advocate for each party ;) I don't think the solutions at either extreme solve all of the problems to be honest.

By emulating multidraw (by custom shader code and single indirect draw call) you can push over one million separate objects (per frame) with unique meshes on a lower middle class Radeon 7790 (at 60 fps).
Yeah but what you call "emulating multidraw" is also called "fancy instancing" :) If you entirely bypass the API/binding and use all virtual texturing, AZDO-style stuff, etc. then of course you can bypass the GPU frontend too (in contrast to looping in the frontend as was the context I mentioned). But again, that's not what the DX12 demo was about, and that is not directly comparable with separate draw calls.

Ultimately we can throw around increasingly ill-defined numbers but the real question is the best way to implement a given workload. If you need 500k vertex ranges per frame that's completely doable even in a brute force fashion (perhaps surprising, even in DX11). If you need 500k different bits of texture or constants that's pretty easy too with arrays. If you need 500k unique textures with different formats and shapes (i.e. unique descriptors), or 500k different shaders you might run into some trouble (but, but, ubershaders! ;)).

This is all fairly academic without a real use case. I think it was a legitimate ask from game developers to lower the overhead vs. what was in conventional APIs, but it's absolutely not clear that we need to end up in a space where we change shaders (not just parameters) every ~10 triangles or anything.
 
I get why it's a win on GCN, but don't pretend that other GPUs sit there burning power on idle execution unit cycles ;)
Unless you can start and stop power gating just as efficiently as clock gating (1 cycle) you're burning power on idle execution cycles.

Since GPUs have fixed function logic with fixed throughput rates not all algorithms will saturate the compute cores. By doing other necessary work doing this time you can hurry up and wait (power gate). Unless the async compute work trashes the cache too much and slows down the graphics work such that the combination is slower than each running independently it's unlikely that gating logic will be more efficient than running everything in parallel.

it's absolutely not clear that we need to end up in a space where we change shaders (not just parameters) every ~10 triangles or anything.
I assume you're exaggerating, but in my opinion is absolutely clear that we don't want to change shaders every ~10 triangles unless the triangles cover a lot of pixels. If someone has a good reason for it I'd like to hear it.
 
Unless you can start and stop power gating just as efficiently as clock gating (1 cycle) you're burning power on idle execution cycles.
Sure, but we're talking about a case where not a single hardware thread on a given execution unit has any work to do for many cycles. Stalled threads can't be reused for async compute, so it really is only entirely idle ones, and by necessity that is going to be relatively large blocks of time so I don't think the overhead is a big deal to be honest. Also to be honest I think the trend is going to be increasingly towards finer and finer-grained power gating in the future.

... it's unlikely that gating logic will be more efficient than running everything in parallel.
I'm not arguing that it's more efficient, just that the benefit varies a lot depending on the architecture. It may be a big benefit on GCN, but conversely it might be a lot smaller or even negligible on other architectures. It all depends on the specifics of how things like scheduling, caches, shared memory and power are handled.

I assume you're exaggerating, but in my opinion is absolutely clear that we don't want to change shaders every ~10 triangles unless the triangles cover a lot of pixels.
Absolutely, that's my point (that's why I said *not* clear :)). When people say things like we need to be doing "500k draw calls" per frame (and to be clear, I don't think sebbbi was saying this, but others have) you're talking about one draw call for ever 4 pixels @ 1080p... Now of course "draw call" isn't a particularly well-defined notion these days with things like instancing, AZDO and multidraw but I'm using it to mean the thing that is processed independently by the GPU frontend command processor. In DX there are some further semantics as well (in terms of memory ordering) but let's leave that aside for now.
 
Sure, but we're talking about a case where not a single hardware thread on a given execution unit has any work to do for many cycles.
This is obviously the case where you can power gate an entire execution unit, but async compute can also be useful when graphics threads are idle for a small number of cycles as they can fill in the holes.

It is very content dependent as if the graphics shaders have maxed out bandwidth having compute jump in with an additional bandwidth requirement could be detrimental to graphics. So I don't argue async compute is a win in all cases.

Maybe I'm too familiar with GCN to see where it doesn't help, but it feels like other architectures should see some benefit from async compute.
 
On an infinitely wide parallel machine, the "tail" is precisely the bottleneck. There's no "something else" as anything that could be done in parallel would have already completed. The step complexity is irrelevant, the work complexity determines performance. The architectural lessons from HPC are going to be relevant everywhere if technology continues on its current trajectory. We'll end up in a world where basically IO and coordination (via atomics or something like transactional memory) are what you design for, as everything else is relatively free.
We are not yet there, and will not be there during our life time (at least on consumer devices). I have read lots of academic papers about algorithms that slightly decrease the big O notation of some algorithms, but are useless on modern day computers since the constant factor is so big that you can't reach a speedup with any N that fits the memory of modern computers. In reality (consumer devices) cache optimized linear algorithms often beat asymptotically optimized algorithms by a large margin. It's fine to discuss about the theory, but if we want to assume that we have infinitely many execution resources, we don't need to discuss about warps or waves or AVX-512 anymore, because none of that matters. All that matters is the clock rate, since the infinitely many parallel resources already exploit all ILP/TLP/DLP. Haswell is getting all it's performance boost by going wider and extracting more ILP from the instruction stream. When we have exhausted all the parallelism from the code, the only way to get faster is to improve the clocks, and we all know how well Intel's plan to get Pentium 4 to 10 GHz worked out. It would be nice to live long enough to see us hitting the parallel scaling limit as well.
Honestly, I think that's mostly legacy thinking about GPUs that are not power-limited and/or have poor power management. I get why it's a win on GCN, but don't pretend that other GPUs sit there burning power on idle execution unit cycles ;) We're getting to the point where all GPUs are power limited, even high end discrete, so "filling in all the gaps" and "using all the silicon at 100%" is not the goal anymore.
I am not talking about wasting execution cycles. All that works needs to be done. Either the GPU can be fully utilized for a shorter period of time, or be active for longer period of time executing a code filled with stalls and holes. GPU will of course partially power down during some of these holes, but not all the units idle. I need some further proof to accept that the execution filled with holes and stalls is more power efficient than the other way (when both do exactly the same tasks).
Sure, but we're talking about a case where not a single hardware thread on a given execution unit has any work to do for many cycles. Stalled threads can't be reused for async compute, so it really is only entirely idle ones, and by necessity that is going to be relatively large blocks of time so I don't think the overhead is a big deal to be honest. Also to be honest I think the trend is going to be increasingly towards finer and finer-grained power gating in the future.
In reality the rasterization workloads have very erratic shader unit utilization. Vertex shaders take small random slices here and there and the pixel work fluctuates rapidly based on hi-z culling, triangle size, backface culling, etc. There's no time to shut down the units between these small holes (and still wake up fast enough not to slow things down), so better use the idle execution cycles for something else.

Race to sleep has proven to be the most efficient way of computation. Using the whole GPU 100% for a shorter period of time (with asynchronous compute to fill the GPU fully) and then going to deep sleep for the remaining of the frame should win the efficiency race against longer execution filled with holes (no asynchronous compute). The longer execution need to keep the memory controllers, buses, caches and other shared hardware alive for a longer time. Shutting down and resuming units has also a nonzero power cost (and nonzero latency impact).
Absolutely, that's my point (that's why I said *not* clear :)). When people say things like we need to be doing "500k draw calls" per frame (and to be clear, I don't think sebbbi was saying this, but others have) you're talking about one draw call for ever 4 pixels @ 1080p...
We have just been discussing about going infinitely wide with parallel processing. 500k doesn't sound like a big number really compared to that.

Let me explain why we will need to be able to render 500k independently transformed geometry chunks with unique mesh data ("draw calls") per frame. Our games rely on user created (dynamic) content that doesn't exist offline. We have a turing complete scripting language exposed to our level creators (players). All our lighting and shadows are calculated at runtime. Everything can happen, script can be used to animate the sun and other light sources every frame, objects can move any way the player wants (we even have user created levels where every single object moves every frame, the whole scenes "materialize"). This kind of game needs to render lots of shadow maps per frame. Let's assume that the sun light cascaded shadow map is rendered for the whole view range, and because of the mip/cascade boundaries every object is rendered roughly twice to the sun light cascades. Compared to just rendering only visible object to the back buffer, this already triples the needed draw call count. Then you want to add local light sources. Lets assume that we want to ensure our artists that on average 4 local lights can hit every single surface in the game world. This means that every object needs to be rendered on average to 4 local light source shadow maps. Now we have reached 7x draw call amplification, and we are only talking about direct lighting. That 500k draw calls total count thus can only provide us with 71.4k visible objects.

Let's talk about draw distance. Most games fake backgrounds with big hand drawn/modeled low polygon back drops. This is completely fine, as the character movement is locked to a small area, and the developers knows that the camera can never get close to the background. However if you are instead building your world from real objects (open world game and/or user created content with free roaming), the backgrounds will have the majority of the draw calls. If you assume that the world is roughly planar (terrain based game), the object count rises by n^2 when the draw distance (n) increases. Double draw distance means 4x visible object count. Last gen games could get away with 200 meter average object draw distance (lots of popping), but nowadays you'd ideally want 2000 meters or more (1080p). And that means 100x visible object count. 100x more draw calls, assuming that the background is made from real objects (you can move there) and you want zero popping. Is 71.4k visible objects really too much to ask in this situation (assuming that fully dynamic lighting/shadowing brings 7x "draw call" amplification)?

If you want GI, you might want even more. Imagine rendering light probes at real time... That's going to need quite big "draw call" counts if you choose to use the rasterizer for this purpose.
Yeah but what you call "emulating multidraw" is also called "fancy instancing" :) If you entirely bypass the API/binding and use all virtual texturing, AZDO-style stuff, etc. then of course you can bypass the GPU frontend too (in contrast to looping in the frontend as was the context I mentioned). But again, that's not what the DX12 demo was about, and that is not directly comparable with separate draw calls.
You could basically call everything that doesn't change shaders between draw calls "fancy instancing". You don't necessarily need "separate draw calls" anymore, unless you want to use traditional forward rendering of course (and thus have no way to solve the problem of different shader permutations).
If you need 500k unique textures with different formats and shapes (i.e. unique descriptors), or 500k different shaders you might run into some trouble (but, but, ubershaders! ;)).
I don't know much about using huge pile of unique descriptors since we have been using virtual texturing, and that has been working well enough (we have a huge 256k*256k address space even on Xbox 360). Bindless textures sound fun, but I fail to see how that will be efficient if you have warp/wave divergence (different pixels in the same warp/wave need to access different resources). Virtual texturing has no problems with divergence. Also if you need to load a new resource descriptor for a small amount of (compressed) pixels, you will burn a lot of unnecessary bandwidth (and in worst case cause another cache miss).

Ubershaders are bad for GPUs (long branchy code = low occupancy). Compute shaders make it entirely possible for you to split your work items efficiently to be processed by multiple simpler shaders. You don't need or want to run the same big shader for all your elements anymore.
it's absolutely not clear that we need to end up in a space where we change shaders (not just parameters) every ~10 triangles or anything.
Agreed. I don't think anyone needs to change the shaders at that frequency. I see the opposite trend actually. Modern games separate the lighting code from the rasterization pass and that reduces lots of shader permutations. Physically based shading has lowered the shader counts even more (no need to hack reflections, etc in separate ways). The required pixel shader permutations has lowered a lot.
Maybe I'm too familiar with GCN to see where it doesn't help, but it feels like other architectures should see some benefit from async compute.
CUDA has supported this for ages. NVIDIA cards also have significant gains from executing multiple kernels simultaneously.
 
When we have exhausted all the parallelism from the code, the only way to get faster is to improve the clocks, and we all know how well Intel's plan to get Pentium 4 to 10 GHz worked out. It would be nice to live long enough to see us hitting the parallel scaling limit as well.
Obviously I'm not talking about practically getting "infinite" parallel performance, as if such a thing were even possible. My point was that these bottlenecks exist *now* in code that runs on current machines with no viable path to making it better. Sure we're likely not going to get to 10Ghz, but if you give up your 3Ghz core many games will get slower. The magnitude depends on the game, but Ahmdahl's law is about the dependent thread of execution through any game... for some it's longer, for some it's shorter. I'm glad that you are confident that you guys have made it very short or whatever, but I'll wait and see :) Most games still take a decent amount of time on the GPU even with trivial draw calls.

I need some further proof to accept that the execution filled with holes and stalls is more power efficient than the other way (when both do exactly the same tasks).
It's a question of how the architecture works. To give an example, how does a given architecture handle shared memory with async compute? What is that memory doing when you're doing graphics tasks? If it's unused or only partially used, that's less efficient when executing *all* graphics tasks (as I mentioned, memory operations tend to dominate power requirements). Now an architecture might always run that way and thus it's "free" on that architecture to put compute work in there, but that's not an argument for overall power efficiency going forward.

Do you instead wait until an entire execution unit can switch to compute so you can reconfigure the memory/cache? But then you have tough questions about how you schedule 3D work: greedily onto fewer units or onto more units for wider and potentially faster execution? 3dcgi gave another example with cache thrashing.

The point is it actually depends on the details of both the architecture and the workload. I agree in general that if it is "free" to fit work in to spare cycles that should generally be a win, but the cost vs. benefit trade-off in hardware is far from proven generally, *especially* for 3D/compute overlap (it's more clear for different PS/state for instance, and everyone pipelines that already AFAIK).

There's no time to shut down the units between these small holes (and still wake up fast enough not to slow things down), so better use the idle execution cycles for something else.
Yeah but that's why pipelines are buffers are very deep... I'm not sure how it is on GCN but outside of the obvious places like depth-only rendering, we don't see a huge amount of idle time even on Haswell which is very execution-unit heavy compared to other architectures.

Let me explain why we will need to be able to render 500k independently transformed geometry chunks with unique mesh data ("draw calls") per frame.... Now we have reached 7x draw call amplification, and we are only talking about direct lighting.
I think we need to start asking questions going forward about more efficient ways to do this in general. If a single object is going into 6+ shadow maps it's safe to say that's not the most power-efficient solution in the long run. You can definitely build an acceleration structure and query it faster than that.

In the short term I understand your request though. I think as you mentioned this isn't what I'm talking about as a "draw call". It sounds like you've already bypassed the command processor and that's really the right thing to do. For such a high amplification factor it might even be worth laying out all of the vertices in a big long VB and rendering them all as a single draw. I'm guessing you're doing something similar with an indirection in the VS.

Double draw distance means 4x visible object count.
The reality is that LOD has to start operating at a higher level than just "per object". I know that's (a lot) more difficult but it needs to happen both for aliasing reasons and for overhead. It may make sense to switch to a more volumetric approach once object sizes are fractions of pixels.

If you want GI, you might want even more. Imagine rendering light probes at real time... That's going to need quite big "draw call" counts if you choose to use the rasterizer for this purpose.
Yep but like with shadows, rendering objects a bazillion times into light probes is clearly not the most power efficient path in the long run. Think about the crossover point where rasterizing the same draw/triangle/whatever Nx becomes less efficient than putting it in an acceleration structure and querying that instead.

You could basically call everything that doesn't change shaders between draw calls "fancy instancing".
Absolutely and that was my point. The comment thread here started with where I said that multidraw via looping in the command processor is not a win. I think you're agreeing with me, you are just using a different definition of "draw" in this context than I am :)

Anyways good discussion, and thanks for laying out some of your reasoning on objects/amplification in your engine!
 
I'm curious how great an advantage we can expect the tight integration of CPU and GPU in the new consoles to have over discrete CPU/GPU setups in the future.

For example, say you have a decent quad core AMD/Intel CPU coupled with a higher end GPU from either the Kepler or GCN (1.0) ranges, say a GTX 670 or 7950. Are there many realistic scenarios that could potentially be used in future games were the weaker consoles could achieve macro level results that are simply out of reach for the discrete setup? Where in effect, regardless of how cleverly the developer tries to work around the limitations of a discrete setup, there just isn't enough brute force there to achieve the same overall level of visuals that the weaker APU's are outputting?

And how would that differ, if at all, between DX11 and Mantle/DX12?
 
Love the annotated die photo of an Intel integrated CPU/GPU:
https://software.intel.com/sites/de...tel_Processor_Graphics_Gen7dot5_Aug4_2014.pdf

The CPU is sliding into irrelevance. GPUs won. At least at Intel...
First of all die area does not determine the winner. Secondly, there are no winners or losers in unification. Did pixel processing win against vertex processing when the GPU cores unified? No, they both gained new capabilities and we got additional pipeline stages and compute. Likewise the unification of the CPU and GPU will retain the qualities of both while enabling a whole new set of applications to run efficiently.
 
First of all die area does not determine the winner. Secondly, there are no winners or losers in unification. Did pixel processing win against vertex processing when the GPU cores unified? No, they both gained new capabilities and we got additional pipeline stages and compute. Likewise the unification of the CPU and GPU will retain the qualities of both while enabling a whole new set of applications to run efficiently.
Your analogy would work better if CPU and GPU were actual unifying, as in the merging of units of your GPU example. But that's currently just not the case. Only the communication between them is improving because they are the same die. That has benefits, but it's different than one starting to do more of the other.
With pixels/s needs going up quadratically and more, CPUs are not catching up with GPUs, they're getting further behind. Maybe this will stop when we've stabilized at a point of 8K panels on each desk, but we're not there at all.
 
Your analogy would work better if CPU and GPU were actual unifying, as in the merging of units of your GPU example. But that's currently just not the case.
I'm not talking about currently. I'm talking about 10 years from now. The analogy with pixel and vertex processing unification is just to illustrate that it doesn't create winners or losers. It offers the best of both worlds, and additional advantages on top. We still need more convergence between the CPU and GPU to happen before that's feasible, but with AVX-512 on the CPU and scalar cores on the GPUs there's clearly a strong momentum toward unification.
With pixels/s needs going up quadratically and more, CPUs are not catching up with GPUs, they're getting further behind.
Why would an increase in pixels put the CPU further behind? I can't play games at full resolution on my MacBook Pro with Retina display even though it has a discrete graphics chip, so clearly it's just as much a struggle for GPUs to keep up. Wider vectors have allowed the CPU to catch up to GPUs fast, both in terms of density and power consumption. A quad-core with AVX-512 will be able to deliver a teraflop of computing power! That really strips the GPU from its title of being the only massively parallel architecture. And I expect that within 10 years we'll have AVX-1024, possibly in the form of two 512-bit clusters which alternatingly issue 1024-bit instructions over two cycles (very similar to how GCN has four clusters and splits 2048-bit instructions over four cycles), to help hide latency. Meanwhile GPUs can't beat Moore's law and they have to invest more die space on caches and versatile scalar units.
Maybe this will stop when we've stabilized at a point of 8K panels on each desk, but we're not there at all.
That won't happen, because 8K is overkill that nobody wants to pay for. Even 4K is ludicrous for the majority of consumers. 1080p is the most popular right now for TVs and desktops, and there's little momentum towards higher resolutions. Mobile has higher pixel density in comparison, but that's because we want a useful amount of text on the screen. For games and such the rendering resolution is typically lower and then upsampled, or it's very simple in comparison to the desktop titles. There's little demand to go much bigger.

I'm not denying that there's an evolution toward higher resolutions, but it's much slower than you think. In 2004 the average resolution was 1024x768, and ten years later we've only got 2.6x more pixels on average. Meanwhile the computing power has increased about 500-fold. So you can't rely on Gustavson's law to keep scaling without architectural changes. To beat Amdahl's law you have to lower latencies. They've been massively reduced in the past, and will continue to be lowered in the future by using fast and big caches, prefetching, out-of-order execution, etc. Then CPU and GPU will unify.
 
So disheartening hearing the "power efficiency" argument over and over again..

I want to be a good global citizen, and leave a good carbon footprint BUT I also want my powerful engines and amazing looking games.. :(

I hope we don't have to compromise too much for the sake of power efficiency :(
 
So disheartening hearing the "power efficiency" argument over and over again..

I want to be a good global citizen, and leave a good carbon footprint BUT I also want my powerful engines and amazing looking games.. :(

I hope we don't have to compromise too much for the sake of power efficiency :(

It's not about carbon footprints, it's about performance. If you're limited by a 150W power budget (for practical reasons like cooling system cost, weight, noise, etc.) then the fastest card/APU/whatever you can build is the one with the most power-efficiency at 150W.
 
It's a question of how the architecture works. To give an example, how does a given architecture handle shared memory with async compute? What is that memory doing when you're doing graphics tasks? If it's unused or only partially used, that's less efficient when executing *all* graphics tasks (as I mentioned, memory operations tend to dominate power requirements). Now an architecture might always run that way and thus it's "free" on that architecture to put compute work in there, but that's not an argument for overall power efficiency going forward.

Do you instead wait until an entire execution unit can switch to compute so you can reconfigure the memory/cache? But then you have tough questions about how you schedule 3D work: greedily onto fewer units or onto more units for wider and potentially faster execution? 3dcgi gave another example with cache thrashing.
We have been discussing a lot about Ahmdal's law in this thread. We have been discussing about reductions (such a mip pyramid generation or depth min/max pyramid generation) and we have been discussing about syncronization points such as indirect draw call waiting for previous step to end and write the problem size to a buffer or rendering to a back buffer and using that back buffer as a texture in the next draw call. All these are real problems, and waste real GPU cycles. Asynchronous compute can solve this problem right now on the currently available hardware.
The point is it actually depends on the details of both the architecture and the workload. I agree in general that if it is "free" to fit work in to spare cycles that should generally be a win, but the cost vs. benefit trade-off in hardware is far from proven generally, *especially* for 3D/compute overlap (it's more clear for different PS/state for instance, and everyone pipelines that already AFAIK).
Asynchronous compute is a bit like hyperthreading. It doesn't help with code that already maximizes the execution units all the time. But when you have stalls (such as those described above) that need serialization (Amhdal's law kicks in), or you have a rendering task that is either fill rate, triangle setup, front end bound, you will see big gains. Again I am not talking about some future hardware, I am talking about measured big gains on AMD and NVIDIA hardware running actual game engine code. And we are not talking about small ten percent gains here.
I think we need to start asking questions going forward about more efficient ways to do this in general. If a single object is going into 6+ shadow maps it's safe to say that's not the most power-efficient solution in the long run. You can definitely build an acceleration structure and query it faster than that.
Could you elaborate? I haven't seen any production quality solutions that offer better shadow rendering performance for direct lights compared to shadow mapping on current hardware. SVO techniques are surely interesting in the future, but currently the memory footprint and performance of these techniques is not yet acceptable (especially for a 60 fps game).
The reality is that LOD has to start operating at a higher level than just "per object". I know that's (a lot) more difficult but it needs to happen both for aliasing reasons and for overhead. It may make sense to switch to a more volumetric approach once object sizes are fractions of pixels.
SVOs are good at this (seamlessly solve LOD issues for scenes of any complexity and solves geometry antialiasing as well). The current solution to this problem is to make the "draw call" (separate object) cost disappear. Implement a renderer that doesn't care to how big independent chunks the geometry is split to. As a bonus, dynamic destruction becomes much easier.
The comment thread here started with where I said that multidraw via looping in the command processor is not a win. I think you're agreeing with me, you are just using a different definition of "draw" in this context than I am :)
I must say that a strongly disagree with this. I feel that multi-draw indirect is one the most important API advantages for ages. I even seriously considered switching our PC renderer to OpenGL (4.4) to have access to it. And that tells a lot :)

If you only think about traditional CPU-driven rendering engines, multi-draw doesn't seem to be that big improvement. In OpenGL 4.4 it gives you roughly 8x speedup compared to a CPU side tight draw call loop (http://gdcvault.com/play/1020791/). On close to hardware APIs the gain is much smaller. Unfortunately we can't discuss that yet here (let's wait until Mantle and DX12 become public APIs and we can have public benchmarks). However even if the draw call overhead would be zero (which will never happen on PC), without multi-draw you still need to spend considerably amount of more CPU cycles to push big amount of draw calls. Try pushing 500k draw calls using a tight CPU loop even in a low level API, and you will end up with a CPU cost that is far away from zero. A single multi-draw indirect is 499999x cheaper for the CPU (regardless of the driver overhead amount). On a future 100% close to metal (non-cross platform) API on a on-chip GPU, this might be just a few milliseconds, but that's still a few milliseconds of wasted CPU time (waster power budget that could be used elsewhere).

But that's not why multidraw is the most important future API feature. The reason why multidraw (OpenGL 4.4 version with arb_indirect_parameters, not the old OpenGL 4.3 version) is important is that it enables the GPU to decide how many draw calls it renders by itself. Multidraw takes the draw call count from a GPU buffer. This enables important things like fine grained GPU driven viewport and occlusion culling. The culling compute shader outputs visible draw calls to 4xUINT append buffer. This buffer is given as the argument buffer to a single multi-draw, and the append buffer structure count is given as the draw call count to the multidraw command. The cost of a single draw call thus becomes "append a 4xUINT to an append buffer". You can even reuse the argument buffers (separate static objects from dynamic, etc if you want to shave the last 0.01 ms off by adding complexity to your system).

By moving the decision making to GPU, the data movement is reduced (saves power) and the decisions can be made using the data generated by the GPU itself (solves the CPU->GPU latency problem). This allows much more precise data culling (early out), and it much reduces the wasted work the GPU needs to do. My experience has been that moving the graphics engine to GPU saves both the CPU cycles and the total GPU cycles needed to render a single frame.
 
You talk about 10 years in the future. And about 1TFLOPS being a big deal. Let's look 7 years in the past, and I see a GT200 breaking the 1TFLOPS barrier. On a worse process. Without full custom design. With dedicated resources for texture and ROP, something your Intel processor will have to do with the general FLOPS pool.

You talk about a discrete GPU not being able to drive a retina display as if that somehow strengthens the argument about unification. I see it as exactly the reason why Intel is increasing GPU area in their dies instead of the other way around.

And 1080p is supposed to be the remain gold standard, but at the same time you're talking retina display on your laptop.

It would be very interesting to so what you wrote 5 years ago about this upcoming unification, because I'm pretty sure that 5 years from now, we'll still be talking GPUs the way we do right now.

By carving out a very narrow part of the market, tailored to your argument, you can make everything work. Hell, business desktops have been able to get by without GPUs worth the name since forever. But unification? If it took more than 7 years for a CPU to barely catch up with a GPU, then the slowdown of Moore's Law is more likely to stall the march towards unification than to further it.
 
OMy point was that these bottlenecks exist *now* in code that runs on current machines with no viable path to making it better. Sure we're likely not going to get to 10Ghz, but if you give up your 3Ghz core many games will get slower. The magnitude depends on the game

...mind to elaborate? I do really not get it (supposed games are optimized for PS4/XB1/PS3 model, not like now where they often favour intel's architecture).
 
Again I am not talking about some future hardware, I am talking about measured big gains on AMD and NVIDIA hardware running actual game engine code.
I'm not disputing that. But this thread is about the future, and I don't think your current experience with these architectures necessarily predicts that very well frankly for the reasons that I've outlined.

By moving the decision making to GPU, the data movement is reduced (saves power) and the decisions can be made using the data generated by the GPU itself (solves the CPU->GPU latency problem).
This is a perfect example of a current problem that is not something inherent to the necessary design, at least on integrated chips (ultimately I don't think discrete chips are going to remain the primary design point - the writing is kind of on the wall on that issue already).

Caches are increasingly shared further and further up the chain... on Haswell already there's not really a concept of data "on the GPU" at the granularity you're talking about. Future APIs, OS updates and hardware will further blur these lines and drive down "latency". Ultimately there's no compelling reason to assume that communicating with the various execution units on the GPU is any more expensive than communicating with another core on the system... the hardware is mostly there already (at least big core Intel parts) and the software is on the path.

That's the part of the future that I think is totally clear, and that's really why I'm questioning some of the your assumptions. I'm not saying you're wrong (personally I have no strong opinion on these directions), just trying to understand the logic and where you're coming from :)
 
Back
Top