Modern games yes, but not the future games that use better APIs that we currently have on PC.I think you'd be surprised at how badly serial convergence points actually hurt modern games. Reductions, mip generation, etc. are all measurably becoming larger parts of game frames due to not getting any faster for several generations. This isn't a huge problem yet but it will continue to get worse each year.
Asynchronous compute helps a lot with that issue, and it also helps with other GPU utilization issues such as fixed function pipelines being the bottleneck (left over ALU or BW that could be used for compute). As long as you always have more than one task for the GPU to do, convergence points do not matter that much (usually both pipelines are not at convergence point at the same time). Asynchronous compute also helps with indirect draw calls / indirect dispatch, since you need to setup these stages by a simple compute shader (single thread + a convergence point).
DX11 didn't have multidraw or asynchronous compute, or other ways to create GPU work using the GPU. Compute pipeline that cannot be used properly to feed the graphics front end is quite limited. Also first party console developers are only now getting their hands dirty with compute shaders. And cross platforms games could not be designed around a feature that is not available on all the platforms (DirectX 10 PCs and last gen consoles). These facts limited compute shader adaptation.When DX11 first came out we went through a similar period of excitement about all the great new stuff you can do with compute... then after a while you start to realize how fundamentally limited the model (and GPUs to some extent) really is.
Thus we basically only saw some games add extra optional eye candy with compute shaders. No developer could completely rewrite their engine with compute shaders in mind yet.
You can implement these with modern GPUs, but sadly the cross platforms APIs still do not expose all the tools needed.Nesting, self-dispatch and work stealing in compute are limited and inefficient even if the hardware is capable.
This is true. Access patterns are not easy to describe, and are not portable. You need to optimize the code for a single architecture and to the needs of your data set to get the best performance.Memory layouts and access patterns are opaque and impossible to reconcile through compute shaders in a portable way. And so on...
Kepler can spawn kernels from a kernel (function calls basically). This can be used to avoid the static register allocation problem in some cases. It is a good start, but we need more.This is probably the biggest issue I see with current GPU *hardware*, and it really does limit what you can do in terms of targeting them with more "grown up" languages. I think it's solvable without massively changing the hardware but I don't see any way around taking *some* GPU efficiency/area hit in the process. It's legitimately more complicated, but it's ultimately necessary.
I just tested my GPU driven DX11 "asteroid" proto on Haswell HD 4600. It runs slightly faster than your DirectX 12 proto, and renders 5 times as many animating "asteroids" (random objects). My proto also uses less than 0.5 milliseconds of CPU time (from a single CPU thread), so it should give the GPU even a bigger slice of the thermal budget. Just like in your proto, in my proto every "asteroid" can also have it's own unique mesh, material and textures (I do not use GPU instancing). In light of this result, I still believe that a GPU driven pipeline is more energy efficient than a traditional CPU driven renderer. All the scene setup steps are highly parallel (viewport and occlusion culling, matrix setup, etc) and the GPU goes through all that super quickly.This has yet to be proven in general and fundamentally is pretty architecture-specific. Certainly in the past writing command buffers has been a big overhead but that is changing quickly, and it is far from proven that GPU frontends are as fast/efficient at "gathering" data as CPUs, especially given the frequency disparity.
I think it's fair to say that some element of this stuff will be good to do "on the GPU" in the future, but I think you're making assumptions that aren't warranted for integrated chips where CPUs are very close and memory is fully shared that are not warranted.
I need to take another look at it. The last time I looked at it more closely was when it was originally launched. ARM/Neon support is good to have for porting purposes. It's good to know that these platforms are also supported.As Nick noted, it's available pretty much anywhere and compiles basic obj/lib with optimization targets available for a wide variety of architectures, jaguar included It's basically a frontend on LLVM so I don't think compatibility should be a major concern. There's even support for ARM/Neon in there and IIRC an experimental PTX fork too. And of course it's open source so people can add to it and modify it as required.
Discrete GPUs will eventually be hurt from the unified memory consoles. But first we need the low level APIs (such as DX12 and Mantle) with manual resource management to become popular, before the PC ports can fully exploit the unified nature of integrated GPUs. Obviously the parallel code might also take most of the scene data structures with it and all of that moves to the discrete GPU. So many ways to go...With the current consoles having integrated GPUs, it won't take all that long for the resulting characteristics of unified memory and low round-trip latency to become the norm. Discrete GPUs will suffer from this, leaving them no option but to become capable of running scalar sequential code really fast, thus allowing them to run significant portions of the game's code entirely on the GPU. Of course that's just a few steps away from getting rid of the CPU and running everything on the unified GPU. Some might call that a unified CPU though.
Caches won... That's the reality. Most of the optimization work is nowadays done to improve memory access patterns.Meh, you could just as well look at the die shot of a server CPU and proclaim that caches have won, cores are sliding into irrelevance.