But isn't this the case already? Latency inducing reads on current GPUs are dealt which by a specialized unit outside the ALU (processor = ALU in this terminology), and latency is hidden simply by switching contexts.
R580 does. NVidia's current GPUs don't. Actually, NVidia futzes this distinction because in NVidia's view, the single long ALU/TEX pipeline inside G71's quads, say, is actually executing hundreds of threads, one after the other (each quad of pixels in a quad-pipeline counts as a thread in NVidia's view). ATI and NVidia don't see eye to eye on this terminology.
You seem to be suggesting taking a shader and splitting it along TEX boundaries. TEX instructions turn into "metadata" attached to the piece of code following it. Each piece won't have any texture loads. Each piece then marks what input data it requires to be run (fragment input, texture, registers from previous code clause). The GPU then picks fragments and feeds them to the "stream processors" only when data is available, initiating loads for fragments whose metadata requirements are satisfiable.
Yeah. R580 and Xenos already do this.
My problem with this "split up the shader and only run segments when the data is there" approach is that it reminds me of OOOE instead of threading. With threading, you run until you are about to block for I/O, and then yield to someone else to do work. It dynamically adjusts and is very simple to implement.
In G71, say, the threading model is driven by the size of the register file. If there's space for 4 FP32s for 880 fragments in a batch, then there's alternatively space for 2 FP32s for each of 1760 fragments in a batch. Or 8 FP32s at 440 fragments per batch. The quad-pipeline is 220 clocks long (hence 880 is the default).
If the shader contains some seriously awkward texture fetches/filtering then you'll lose performance as the latency can only be partly hidden. No easy way to predict that as it depends on cache-thrashing and bandwidth. Latency hiding depends on there being enough pixels in the batch (count of clock cycles from one instruction to the next for a given pixel).
So the 440-fragment batch will run at roughly half-performance on bilinearly filtered textures because half the clock cycles are bubbles. If the shader has no texture fetches, then there's no effect on performance.
On the other hand, splitting a shader into pure-functional I/O-less chunks and scheduling data loads and packets seems to require alot more logic, because of the potential for out of order. You can't run chunk N+1 if it depends on registers in chunk N, and registers in chunk N were dependent on calculations from a texture load. Etc etc. We then have to do a lot of bookkeeping.
This is what Xenos and R5xx do. Even the cheap $50 X1300HM does this.
Not only do you fill your ALU pipeline (no bubbles) but you can minimise the bubbles in the (decoupled) TMU pipeline.
G71's TMU pipeline, not being decoupled from the ALU pipeline, doesn't offer any flexibilty in latency-hiding. So the batch size remains high to hide the typical worst case latency. R5xx doesn't depend on a single batch executing for enough cycles to hide latency, it swaps batches repeatedly. Which is how it can get away with batch sizes of 16 or 48 - and then you get into the whole argument about efficient dynamic branching requiring small batches.
The problem that D3D10 introduces is that textures are not the only source of latency. The best example of this is "constant buffers". SM4 allows devs to create fantastically complex structures as constants. You could easily have one structure amounting to 10s of KB. And SM4 supports an effectively unlimited count of these constant structures. The point being that constants are simply too large to keep entirely on die (in much the same way as shader programs can't, if they're very long). So the GPU now has to implement some kind of latency hiding when referring to apparently innocuous constants - all constants live in video RAM, by default.
Clearly there'll be some kind of caching for constants. I think it's fair to say that a small population of constants will prolly fit entirely in cache.
Jawed