A single image buffer/texture trashes the slow L3 cache completely! Your 8 MBytes are nothing - NOTHING - in modern graphics.
No need to yell. 8 MB is definitely something, and with 25.6 GB/s of RAM bandwidth it can be filled again 50 times at 60 Hz. So the fact that writing a "
single" image buffer/texture with no other accesses happening in between would thrash the L3 cache "completely" isn't all that terrible.
In practice it's far more effective than that worst case due to the 16-way associativity and (P)LRU replacement policy. Data that is accessed a few times per frame doesn't get evicted. This is for instance probably the case for
FQuake. It achieves up to 160 FPS at 2560x1600, which is a 16 MB color buffer so the L3 can't hold it but this represents only 10% of RAM bandwidth usage. The textures are tiny and repetitive so it's entirely possible for them to stay in the 8 MB L3 even though 16 MB is written to the color buffer every frame.
Also note that automatic prefetching ensures that it takes only a very short while after the cache has been thrashed, for the hit ratio to become very high again for a coherent access pattern (which is more often than not the case for graphics). In fact I found it to be impossible to beat the automatic prefetching with software prefetching. So you can get good use of those 50 refills while maintaining a high average hit ratio, thus preventing the cores from stalling a lot.
In my experience software rendering is arithmetic limited instead of bandwidth limited and the L3 cache is one of the reasons why not. It would take AVX-512 to change that and by then DDR4 will be standard while an L4 cache would retain all color buffers and depth buffers and lots of other data, with roughly twice the bandwidth and much lower power consumption than RAM. So I don't see bandwidth as an issue
at all for CPU-GPU unification.
We are not talking about a Java lecture class teaching how to draw something on the screen in the easiest and most elegant way! We are talking about the how you can do the best graphics with the limited resources given!
The reality is that sacrificing some low-level performance is necessary to gain high-level performance. Nobody writes their shaders in assembly any more, because it's too tedious and hell to maintain. It is humanly impossible to keep track of all the little details and to keep everything tuned to perfection while the high-level rendering approaches change and are sometimes completely overhauled. So we use high-level languages to crank up productivity and we rely on compilers to do the repetitive low-level optimization work. The transition from assembly shaders to high-level shaders happened around the time of shader unification, because that's when you no longer had to worry about what different instructions it would translate into or how to optimize the register usage for the different cores.
We're now approaching the point where heterogeneous computing is riddled with too many low-level details for the developer to manage effectively. The solution is unification to make it compiler-friendly, and while that definitely comes at a cost it will eventually be lower than what is gained by allowing developers to focus on the high-level computing problems which have a greater impact on performance.
This is an important driving force for more convergence. In the near future we're getting unified memory for the exact same reasons, and even though it comes at a hardware cost it's advertised as a performance feature because it allows developers to worry less about managing memory transfers and letting the hardware handle it more efficiently overall.
Having a unified processing pipeline requires additional explicit commands in order to tell the CPU that we don't want to trash the caches, stream data, etc. and trying to write efficient code without any structured boundaries is logistically a contradiction.
Separating critical code sections for finely tuned processing of sorted data, with as little interaction with outside code as possible, is the most fundamental principle for performance efficient software on the CPU, GPU, APU, everywhere.
It is a pipe dream to think that you can finely control all these things and gain a large benefit, across hundreds of different heterogeneous setups.
Things like streaming writes and software prefetches rarely result in a significant speedup, but can easily work adversely in scenarios you didn't foresee. For example a small render target texture should not use streaming stores if it's soon reused, but it should if it isn't, and what's considered small and what's considered soon will change between hardware generations and may not be deterministic due to user input. So it's generally better to let the hardware use conservative heuristics and to focus on your algorithms and universally applicable optimizations. Don't get me wrong, I'm not saying you should be oblivious to the architecture(s) you're targeting, but don't take Knuth's advice lightly: premature optimization is the root of all evil.
Showing examples that a scattered access to a data stream here and there can improve something is not an argument for your radical denial that there are problem classes which will always be processed fundamentally more efficient by hardware that is designed to process these.
That argument instantly falls apart when we look at unified shaders. Vertices and pixels are no longer fundamentally processed more efficiently by hardware that is designed to process them.
I won't even start to argue with your view that the failure of Larrabee as a GPU is no argument against your agenda...
I don't view it that black and white. Larrabee thought us things that work and things that don't work, to various degrees and under various circumstances. All I was saying is that Larrabee is definitely not direct proof that unification of the CPU and integrated GPU won't work, because that's not what it is by a long shot. You're going to have to use very specific arguments and yet not lose track of the bigger picture to argue against that.