Apart from the re-ordering of memory operations, a TU's addressing, fetching, caching and filtering form a pipeline for the purpose of delivering texture results to the cluster. The fact that the stages have variable latency doesn't mean the TU is not working as fast as it can.
You can make that excuse for any unit that isn't achieving its peak throughput. That's what this is about - throughput.
I'm simply saying that during compilation it appears to be one class but in execution it turns out to be the opposite.
Veined Marble is just a gross example of the way that instantaneous ALU:TEX ratio can radically differ from a shader's overall ALU:TEX, making estimates of performance tricky.
I fully agree that compilation and execution can be different. My point lies elsewhere.
Take shader A, which has clustered texture access, and create a new shader B that has the same number/type of instructions (including the same sampling characteristics) but with a more even distribution. My claim is that the execution time of A will not differ from the execution time of B. Local ALU:TEX ratio is inconsequential.
But for 80% of the execution time, both ALUs and TUs are "fully utilised" - i.e. for the majority of execution time it's class C.
You're bastardizing my system of classification. My entire point is that you can't be one class part of the time and another class another part of the time, unless you're talking about gradual changes over thousands of cycles due to spatial workload variation (e.g. transitioning from magnification to minification).
In this case, it's class A 100% of the time. A shader is only class C if it is class A and B all the time. In retrospect I never should have included it, and instead should have called it class A-B or something.
Well if it takes 2 clocks for the TU to produce each texture result then that halves the ALU:TEX ratio of the GPU, making it approximately 0.85 for Veined Marble on R600. Obviously it can't hide that latency.
Exactly, but it's not a latency problem. It's a throughput problem. Even if you had more threads, the shader would still run at the same speed.
Just out of curiosity, why do you call it 0.85 instead of 3.4?
and seeing that X1800XT runs Car Surface and Veined Marble at almost exactly the same speed (implying that Veined Marble is, at least, very close to being ALU-limited), it's worth noting that GPUSA says that this shader is 6.75 ALU:TEX on X1800XT. Since the hardware is 1:1, that implies that these volume texture lookups are taking way more than 2 clocks on average!
What makes you say Veined Marble is merely "close" to being ALU limited on the X1800XT? Judging by the X1900XT's notable boost, it definately is (although texture limited on the X1900XT).
But yes, it does look like 4+ clocks per lookup. Trilinear filtering would take 4 clocks, FYI.
It seems it's C if the X1800XT data is meaningful (since that's so heavily ALU-limited) - though it is a different architecture, with no L2 and with a less effective memory system...
I still think it's A, because volume textures take a while to filter. R600 is actually a hair slower per clock than R580 in this test, and I would think that it has more register space.
The G80 results you found suggest the same thing. Even if we assume that it can compile to needing only 16 FP32 registers due to scalar architecture, it can only have 4092 pixels in flight, which is half of what R600 can hold. It gets twice the framerate, though, meaning it has 1/4 the time to get the fetches from memory. I'm very sure that this shader is filtering throughput limited.
Anyway, look closer at C. Do you have any objection to the 432 cycle figure? That's how long a batch has been waiting for data. That's how long memory latency has to be for the 27 batches to be insufficient when situation A doesn't apply, i.e. the shader is ALU throughput limited. I assumed the bare minimum ALU instructions, too, as more instructions would mean more latency hiding.