Sorry, the exact details could reveal IP. But you can probably imagine that each of these steps take multiple instructions to implement, versus a nearly 1-to-1 translation for add, mul, dot4, etc. With SSE5/AVX CPUs will even have fused multiply-add... Compared to that texture sampling is very expensive (though per-component shift and multiply-add do optimize address generation). Again, a gather instruction would be quite useful...Do you have a breakdown for address generation, decompression and filtering, say for simple 8bpc bilinear?
Sure. Press 's' to see the stats. By varying resolution you can get an idea of how much time is spent doing vertex processing and setup, versus pixel processing.So I could install RenderMonkey, put the SS DX9.dll in the same folder and get some realtime execution stats from some simple shaders?
Automatic prefetch and out-of-order execution do a great job as far as I'm aware. And Hyper-Threading also means that at a hardware level twice the work is "in flight".Presumably, as a result of this, SS doesn't need to have many pixels in flight to hide texturing latency.