In theory it could, but I'm pretty sure it doesn't do it in practice: it still assumes that your shader units are able to context switch very quickly between shaders of different contexts while other texture operations of a different context are still in flight. Right now, when you're running CUDA, *all* shaders units across the chip have to run and complete(!) the same shader program before it can launch on to the next one. That's two steps away from what you want.That's absurd. A texture lookup only requires the coordinates and the sampler index. This sampler index can be extended to samplers from multiple shaders. In other words, if a texture unit can sample different textures within the same shader it can also sample textures from different shaders. Right?
Now it's possible (likely?) that the architecture allows starting up a new shader program as soon as the last one has been launched, but that's still a far cry from switching between contexts all the time. Let's say it this way: if the architecture allows to do that efficiently, it'd be an incredibly amount of waste in terms of area for a feature that's not used by anyone.
Of course they do, but there'll always be a few shaders that will eat up the majority of the time.Games have dozens of shaders with many different characteristics. Some perform a Gaussian blur and are completely TEX limited, while for instance vertex shaders are typically purely arithmetic, and particle shaders are ROP limited.
Even if that's true (which I doubt), you're still ignoring the little detail of switching contexts. It's the key to this whole discussion. And we're not talking about 2 contexts, but 10.Combined they utilize the hardware way better than one could do on its own.
I'm not saying GPU's currently don't support switching contexts: somehow they have to run multiple 3D programs in Windows, but the idea that they have been engineered to do this extremely quickly is doubtful. It's easy to test, BTW, just run 2 demanding 3D programs at the same time and check if your performance goes down less or more quickly than expected.
How?Thread contexts can be switched cheaply.