Perhaps it has more bearing then on the use of general compute or side functions that involve paging or updating the IOMMU, since the GPU can only operate as a guest, and something like updating or invalidating the IOMMU's TLB or page data can lead to stalls with the early implementations of shared virtual memory.
Some of that is CPU-driven when it comes to privileged operations, with other things like paging dominated by HDD latency and bandwidth.
Yes, it's very unfortunate that current GPUs cannot do things like update virtual memory page mappings using compute shaders. If this would be possible, it would make many things faster and easier to implement (no need for software indirection in virtual texturing or custom mesh data fetching). It's awkward that you need to fire CPU side interrupts to change your mappings. Latency for GPU->CPU->GPU roundtrip in a middle of the frame is just too long for it to be practical (especially on PC).
I suspect we'll find that the consoles were a few years too early and certain architectural and manufacturing advances a few years too late.
I would have at least liked to have transactional memory (TSX instructions or similar). AMD was ahead of Intel or IBM in announcing transactional memory extensions (x86 ASF instructions in 2009). However they still haven't said anything about the timeline to get these instructions to real products (source:
http://en.wikipedia.org/wiki/Transactional_memory). GPU compute in next gen consoles has really spawned lots of interesting research. GPU compute in consumer devices has taken a big jump forward. Tools have also improved a lot (for example Visual Studio 2013 DirectCompute debugger/profiler is miles ahead of the previous ones). It would have been nice to see how TSX would have changed things regarding to multithreaded programming, but maybe that's what we see in the next generation after this one. I would have also liked to see even more CPU cores, just to force everyone to use actual parallel programming methods instead of continuing to run six serial programs simultaneously on separate cores and synchronizing them between frames (= additional frames of input latency and poor CPU utilization).
Do you have one in mind that already exists and can be evolved to meet your needs? Or does it need to be a radically new language?
C++? or something more high level/safe/managed?
There are multiple nice new parallel languages in development, but I haven't had time to properly experiment with any of them.
Personally I like C++11 a lot. It improved things nicely over the old C++ standard, and C++14 seems to be continuing the trend. C++ is moving forward nicely. You can even finally create variable parameter functions that are type safe (thanks to variadic templates)
There are many nice parallel programming libraries developed on top of modern C++. There are CPU based ones such as Intel TBB and Microsoft PPL. Both can be used to implement modern task/job based systems, but unfortunately neither is cross platform compatible (Windows, Linux, Mac, XB1, PS4, iOS, Android, etc), so we game developers still need to write our own systems. Nothing bad about that either, since C++ is a very flexible language, so you can usually add things that the core language doesn't support. One thing that is not possible to implement on top of straight C++ (without super excessive expression template hackery that takes years to compile) is writing SPMP-style programs that could compile to optimized SoA AVX2 vector code (and automatically multithreaded). For this you need to use something like Intel SPMD compiler, and that's not cross platform compatible either.
C++AMP is nice for writing GPU code inside your CPU code. It uses standard C++11 primitives (lambdas) to present kernels, and it's fully C++11 compatible in syntax with templates and all (one extra language keyword added). Auto-completion and all the other modern IDE productivity improvements work properly with it. You could also theoretically compile C++AMP code to standard CPU code by implementing a small wrapper library. However this wouldn't automatically SoA vectorize your code (but you could multithread it easily). C++AMP debugging is also fully integrated to Visual Studio. You can step inside the code easily and inspect the memory and variables and go though call stacks. I would prefer writing all my GPU code in C++AMP instead of HLSL, but Unfortunately C++AMP is only available on Windows, and that's a big show stopper.
If you compared C++AMP to the ideal solution, it sorely lacks the ability to call other kernels inside a kernel. Kepler now has that feature on latest CUDA version (Dynamic Parallelism). This feature is sorely needed for all GPU hardware and all GPU APIs. Being able to spawn other shaders (proper function calls) inside a shader is one of the things that has limited the ability to run general purpose code on GPUs.
Obviously even the NVIDIA cards and APIs have limitations. You can't use Dynamic Parallelism to submit draw calls for example. This would be awesome for graphics code. Indirect multidraw is nice, as it allows GPU to write an array of draw call parameters and the draw call count, but it's not as flexible as directly executing draw calls from a shader.
Frame latency comes into play when we realize that parallelization without improving sequential performance could only continue if we were fine with having a hundred frames be computed simultaneously but each frame has a full second of latency. GPUs already resort to executing commands from multiple frames to keep all their cores busy, but it’s causing problems with interactivity. For virtual reality we’ve even already exceeded what’s acceptable, and touch interfaces also prefer very low input to response latencies. APIs and drivers are being rewritten for low latency as we speak, mobile CPUs have adopted out-of-order execution, and GPUs will have to lower their latencies further as well.
There are people saying that GPUs need to be more like CPUs to survive: GPUs need to utilize ILP better, GPUs need to be able to run less parallel (less threads) programs better. But if you look at both NVIDIAs and AMDs latest architectures, you see a change to the opposite direction.
AMDs new GCN architecture needs at least three times as many threads than their old VLIW architecture to run efficiently. VLIW exploited ILP (static analysis), while the new architecture is pure DLP architecture, it doesn't extract any ILP. AMD has made the GPUs even wider since (you need quite a bit bigger problems to fill R290X compared to 7970).
NVIDIA did similar changes to Kepler. Fermi had double clocked shader cores with dynamic hardware instruction scheduling (limited form of OoO). The new Kepler architecture however relies on static (compile time) scheduling and halved the shader core clocks. The new architecture needs over twice as many threads (more parallelism) compared to the old one to be fully utilized.
So it seems that GPUs are still not wide enough to hit the scaling wall. Both companies are adding more and more compute units and at the same time making architectural choices that need even wider problems to fill even the existing cores efficiently.
And it's not hard to see why. 720p = 0.9M pixels. 1080p = 2M pixels (2.25x). 2560x1600 (a popular Android tablet and high end PC monitor resolution) = 4M pixels (4.44x), 4096x2160 (4K video standard, all future TV sets) = 8.8M pixels (9.6x). Year ago all console games still rendered at 720p (or lower). Now most games render at 1080p on consoles, and tablet games start to sport Retina resolution (our Trials Frontier for example runs at native Retina resolution on iPad). Next gen consoles will surely support 4K, and many PC players are already enjoying 2560x1600. Almost 10x growth in pixel count compared to the base 720p resolution during three GPU generations should be enough to provide the needed extra parallelism. What happens after that, nobody knows. Some people are already talking about 8K. It has benefits for movie theaters already, but my personal opinion is that 4K is enough for all current home use as long as the content is properly antialiased. But who knows if the displays and tables will be fully covered with display ink in the future. Devices like these need obscene resolutions to look good at close distance.
Ideas have also been thrown out (in B3D threads) that suggest that CPU and GPU could share a big pool of vector execution units. I don't see this as a realistic future direction, since higher distance between units means more electricity used to transfer the data and the commands. And it also adds latency. One thing NVIDIA did with Kepler was to replicate some hardware to multiple cores instead of share it. This cost some die area, but according to NVIDIA it both improved the performance and reduced the power usage. This is also one of the problems in AMDs Bulldozer vector unit design. The shared unit (between two cores) have much higher latencies than the Intel units (one unit per core). I also expect the data transfer cost to be higher compared to Intel's design.
I’m actually observing some of that happening already. CPUs are very competitive at generic computing, and will become even much more competitive with AVX-512. GPU manufacturers are thus not able to monetize their GPGPU efforts, and therefore they focus more on pure graphics again. Ironically, games are doing ever more generic computing, which will require greater versatility again. So CPUs and GPUs are inevitably on a collision course, which can only result in unification.
Ironically, right now I would be perfectly happy if GPU rasterizers reverted back to fixed function pixel processing. It's the vertex processing and compute shaders that I need