Upcoming ATI Radeon GPUs (45/40nm)

BTW - Comparing no-AF/AF relative performance differences doesn't really tell you much. For one, caching mechanism between the chips are completely different, and this effects AF significantly.
I agree that there are a lot of factors in AF, so you can't take the %age drop and determine anything out of it, but if there is a low drop and the texture resolution isn't low enough to limit AF to a tiny percentage of texture fetches, then there's a good chance that you're rarely texture limited.

The only problem with this logic is that some texture operations aren't ever affected by AF, like shadow map samples and post-processing. Given the prevalence of these in recent games, you may be right that no-AF/AF doesn't help us much.

I suppose you can still look at the difference between NVidia and ATI, though, to see if AF is ever the reason that ATI's chips become less competitive than NVidia's. I remember this being the case with Crysis with R600 at one point, but it may have changed with drivers.
Additionally, RV770 has 32 texture interpolators and 40 texture units, meaning, again that it does a max of 32 bilinear filters, while it can use all 40 in AF scenarios; so comparing no-AF/AF results between this and R6xx doesn't really give you a comparison.
Given that you work at ATI I don't want to sound like a prick, but I think you're wrong here. RV770 can do 40 bilinear fetch and filter ops/clock, even without AF. The only problem is when an application uses one vec4 interpolator per fetch in the shader, but that's really not necessary. Not only can you pack two UV pairs per interpolator, but the coordinates are often related between fetches, allowing you to use PS math instead of many interpolants from the VS to generate the coordinates.
 
No. If you have a shader that uses 10 registers then you only have 25 batches in flight, which is 100 clocks of latency hiding - about half the number of threads required to hide memory latency. So if there's a section of the shader with 2-level dependent texturing coupled with a low ALU:TEX ratio, then that part of the shader is going to bottleneck in a way that's not represented by the shader as a whole - the cluster simply runs out of threads.

Jawed
You're way off here.

First of all, each batch takes 16 clocks to be serviced by the TMU quad in the SIMD, so its 400 clocks of latency hiding with 10-register shaders.

Second of all, even if you are latency limited due to 20+ regs per shader, it doesn't change the fact that clustered dependent tex ops are no different than evenly spread out tex ops followed by math dependent on each result. Statistically, you'll only be idling either the TMU or the ALU, not both, so the overall ratio is all that matters. You'd need a pretty pathological shader for even 10 batches to be insufficient for this to be the case.
 
RV770 can do 40 bilinear fetch and filter ops/clock, even without AF. The only problem is when an application uses one vec4 interpolator per fetch in the shader, but that's really not necessary. Not only can you pack two UV pairs per interpolator, but the coordinates are often related between fetches, allowing you to use PS math instead of many interpolants from the VS to generate the coordinates.
This is correct. Also note that the majority of texture fetches in applications have trilinear filtering applied, which increases the ratio of fetch operations to interpolations, so AF is not actually necessary to get benefits from the extra texture fetch power. On a lot of shaders full advantage will be taken of the extra texture fetch units even without AF applied, even in cases where an initial glance at the shader code might suggest that you are interpolation-limited.
 
So it would be a combination of the TU's are not attached to the interpolators but scale evenly with each anyway, coupled with the notation that two extra SIMD's were added which is why the interpolators stayed at 32?
Certainly they're independent. As to the ratio Interpolator:TU it's harder to say. The primary driving force is the rasteriser, which in theory is only generating 16 fragments per clock.

Really we should be talking about Interpolator:Fragment ratio.

Is RV670 commonly interpolator-bound?...

It's worth noting that GPUSA provides an indicator for pixel shader code when it is interpolator-bound. Messing about with some silly shaders in GPUSA it appears that both RV670 and RV770 have 32 interpolators:

Code:
struct vertex 
{ 
   float4 colorA     : color0; 
   float4 colorB     : color1; 
   float4 colorC     : color2; 
   float4 colorD     : color3; 
};
float4 main(vertex IN) : COLOR 
{ 
 float4 A = IN.colorA.xyzw; 
 float4 B = IN.colorB.xyzw; 
 float4 C = IN.colorC.xyzw;
 float4 D = IN.colorD.xyzw;
 
 // 2 cycles of math are used to check that the GPU is not interpolator bound
 // Note that ALU:Fragment ratios vary across these GPUS
 
 // RV610 and RV630 - which are Interpolator bound? 
 //return A*A; // Both GPUs run at full speed here
 
 // 2 attributes makes both GPUs interpolation bound:
 //return A*B; // 1 instruction group 
 // RV610 needs 2 cycles
 //return A*B*B; // 2 instruction groups - 2 cycles on RV610
 
 // find the limit for RV630:
 //return 1/(A-B); // 5 instruction groups - interpolation bound
 //return A/(A-B); // 6 instruction groups - 2 cycles on RV630
 
 // RV670 and RV770
 // 4 attributes makes both GPUs interpolation bound:
 //return A*(A-B)*(A-B)*(C-D)/C; // 7 instruction groups
 
 // RV670 needs 2 cycles for 4 attributes, while RV770 is interpolation bound:
 //return A*B*(A-B)*(C-D)/C; // 8 instruction groups - 2 cycles on RV670
 
 // RV770 needs 2 cycles for 4 attributes:
 return A*A*B*B*C*D*exp(B)*exp(C)/exp(D); // 20 instruction groups - 2 cycles on RV770
}

RV610 and RV630 both appear to have 4 interpolators.

So the smaller GPUs are 1:1 Interpolator:Fragment while the bigger GPUs are 2:1.

Jawed
 
First of all, each batch takes 16 clocks to be serviced by the TMU quad in the SIMD, so its 400 clocks of latency hiding with 10-register shaders.
I'm talking about keeping the ALUs busy. My point is that when the ALUs stall for a long period in a situation like this, the overall ALU:TEX ratio of the shader "hides" this kind of problem.

You can't average-out this kind of bottleneck unless you have substantially more batches in flight.

The Veined Marble (Procedural Stone) shader in Shadermark is a good example of this. The hardware's effective ALU:TEX ratio (>1) looks fine...

Jawed
 
Since R300 at least, ATI GPUs have had dedicated interpolators. Prolly all of them.

Since not all attributes are for texture coordinates, it wouldn't make sense to do interpolation in the TUs.

Jawed

Thanks - somehow I managed to totally overlook this. Maybe I was too busy trying to understand the NV-architectures in the past. :(
 
I'm talking about keeping the ALUs busy. My point is that when the ALUs stall for a long period in a situation like this, the overall ALU:TEX ratio of the shader "hides" this kind of problem.
Why do you need 100's of cycles of latency hiding to keep the ALU busy?

If you're saturating the TMUs, you can't avoid having the ALU idle, and vice versa. Give me some more details in your example. You mentioned 10 registers and 2-level dependency, but we give me an instruction stream characterization like X ALU ops, Y dependent tex ops, Z ALU ops.

You can't average-out this kind of bottleneck unless you have substantially more batches in flight.
The odds of all 25 batches waiting for texture fetches simultaneously is very low if you aren't TEX throughput limited. There's almost always going to be something for the ALU to do if the overall ALU:TEX ratio is high enough not to be limited by texture throughput, which is the larger of hardware limit (e.g. 4/clk/SIMD for bilinear RGBA8) and latency limit (pixels in flight divided by latency per fetch).

I almost feel like I need to make a Flash animation of GPU scheduling to prove this to you.

The Veined Marble (Procedural Stone) shader in Shadermark is a good example of this. The hardware's effective ALU:TEX ratio (>1) looks fine...
Want to give me some details?
 
Moreover (and I admit that I don't fully grasp this concept) this means, attribute interpolation is neither done in "the texture unit" itself nor in the shader core. Assuming that the SIMDs are functionally identical.
My theory is that ATI does attribute interpolation at the rasterization stage and feeds them into the register file when initializing each batch. The compiler overwrites the register holding the attribute when it feels that there's no need for it anymore. That r600isa document makes no mention of an interpolation instruction, which is obviously different from NVidia's approach, so I can't think of anything else.

This, along with the scalar setup, may be one of the reasons that NVidia can get away with smaller register files, but they have to keep vertex attribute information around for a lot longer and worry about feeding it into the SF units in the SPs. OTOH, ATI can discard it as soon as the batches are initialized, keeping the data flow neat and simple.
 
Why do you need 100's of cycles of latency hiding to keep the ALU busy?
:?: This is normal.

If you're saturating the TMUs, you can't avoid having the ALU idle, and vice versa.
While the TUs are fully occupied it's also possible to have the ALUs fully occupied - which is why GPUs have latency-hiding, i.e. excess threads. Clearly this can't be the case 100% of the time.

Give me some more details in your example. You mentioned 10 registers and 2-level dependency, but we give me an instruction stream characterization like X ALU ops, Y dependent tex ops, Z ALU ops.
Those are broad generalisations based on the behaviour of some shaders I've been looking at - about one of which I will be posting a specific follow-up at some point...

The odds of all 25 batches waiting for texture fetches simultaneously is very low if you aren't TEX throughput limited. There's almost always going to be something for the ALU to do if the overall ALU:TEX ratio is high enough not to be limited by texture throughput, which is the larger of hardware limit (e.g. 4/clk/SIMD for bilinear RGBA8) and latency limit (pixels in flight divided by latency per fetch).
I agree with all that. My argument is with your earlier statement which entirely omitted "texture throughput":

The same thing goes for shaders with texture heavy parts and math heavy parts. The overall ratio is all that matters because there are enough batches in flight to statistically even this out.

With dependent texturing, 3D textures and dynamic branching, texturing latency can be severely increased. When this happens the shader's apparent ALU:TEX ratio becomes useless and the "shader-wide averaging" you allude to is no longer meaningful.

I almost feel like I need to make a Flash animation of GPU scheduling to prove this to you.

Want to give me some details?
Download Shadermark and load hlsl_stone.fx into GPUSA and have a play. Compare with published performance. See my discussion starting here:

http://forum.beyond3d.com/showthread.php?p=1164394#post1164394

which uses 8 vec4 fp32s under PS2.0 with hardware ALU:TEX of 1.69 on RV670.

Note if you compile for PS3.0 then the shader uses even more registers :???: Looking at the PS3.0 assembly some of it looks sloppy, e.g. R10 looks entirely superfluous.

As far as I can tell the poor performance of the shader (i.e. it is not ALU-limited as suggested by the ALU:TEX ratio) is down to the dependent fetches into the 3D texture.

I know you're going to retort "of course a TEX-limited shader is going to leave the ALUs idling". But I'm correcting your earlier statement about the overall ALU:TEX ratio of a shader, which is clearly not true in general.

The shader's overall ratio can easily lie due to short sections of unruly code which turn out to have a radically different ratio.

Jawed
 
That r600isa document makes no mention of an interpolation instruction, which is obviously different from NVidia's approach, so I can't think of anything else.
If you look at GPUSA output for R5xx and earlier GPUs you'll see explicit reference to RS (rasteriser) instructions, e.g. for the veined marble shader:

Code:
 RS Instructions:
 
   rs 00:                            r00.rgb- = txc00
   rs 01:                            r01.rgb- = txc01
   rs 02:                            r02.rgb- = txc02
   rs 03:                            r03.rgb- = txc03
Not very exciting, I know.

Not sure why GPUSA doesn't show this stuff for R600 and later.

Jawed
 
While the TUs are fully occupied it's also possible to have the ALUs fully occupied - which is why GPUs have latency-hiding, i.e. excess threads. Clearly this can't be the case 100% of the time.
Jawed, you're looking at this all wrong. A shader has a certain number of ALU instructions and TMU instructions. If the TUs are fully occupied when running a shader, the ALUs can only be fully occupied if it happens to have a very specific ALU:TEX ratio.

The are 4 possibilities about utilization:

A) TMU is saturated
B) ALU is saturated
C) Both are saturated
D) Neither are saturated

C) is extremely rare, as it can only happen for a small subset of shaders. If a GPU is running a shader at the maximum possible speed, then it will be either A or B.

Your job is to tell me why your example shader would fall into class D.

With dependent texturing, 3D textures and dynamic branching, texturing latency can be severely increased. When this happens the shader's apparent ALU:TEX ratio becomes useless and the "shader-wide averaging" you allude to is no longer meaningful.
I still think 25 batches is plenty to average this out, and the only factor here will be spatial coherence making most batches go one way or the other to possible change where the limit is. However, you will still be in just situation A or just situation B for spans of thousands of cycles.

Again, your job is to tell me why you'd be in situation D.

Download Shadermark and load hlsl_stone.fx into GPUSA and have a play. Compare with published performance. See my discussion starting here:

http://forum.beyond3d.com/showthread.php?p=1164394#post1164394
I'll take a look later. Does GPUSA take into account volume textures taking extra cycles? I know assembly language doesn't specify it. Does ShaderMark use trilinear filtering with the volume textures?

I know you're going to retort "of course a TEX-limited shader is going to leave the ALUs idling". But I'm correcting your earlier statement about the overall ALU:TEX ratio of a shader, which is clearly not true in general.
Well by "overall" I mean run-time average, of course. Filtering and branching will always affect what the effective ratio is. The point is that you only need the ALU or the TMU to be saturated for the GPU to be running a shader as fast as possible.
 
Last edited by a moderator:
The are 4 possibilities about utilization:

A) TMU is saturated
B) ALU is saturated
C) Both are saturated
D) Neither are saturated

C) is extremely rare, as it can only happen for a small subset of shaders.
Wrong. The normal operation of these out-of-order GPUs is that there'll be significant periods when both ALUs and TUs are fully occupied. With a >1 ALU:TEX ratio in the shader for the hardware, the TUs will be fully occupied for significant periods. This is due to the hardware having a 4:1 ALU:TEX ratio and due to the fact that surface textures can be pre-fetched/filtered ahead of when they're required by the ALU code.

Your job is to tell me why your example shader would fall into class D.
It wouldn't, don't know why you're asking - TUs waiting for fetches from memory count as occupied.

I'm talking about a class B shader as defined by its overall ALU:TEX ratio which executes as a class A shader due to sustained worst-case TEX throughput that exhausts the available ALU batches.

The PS2.0 compilation of the veined marble shader, with only 8 registers, exhausts all 32 batches in flight.

I'll take a look later. Does GPUSA take into account volume textures taking extra cycles?
Apparently not. The anisotropic column suggests this shader has a 1.21 effective ALU:TEX (down from 1.69 bilinear).

If the texture operations are cache-thrashing dependent-2D, GPUSA certainly can't take account of extra cycles caused by repeated L2 misses.

I know assembly language doesn't specify it. Does ShaderMark use trilinear filtering with the volume textures?
I don't know how you'd be able to tell. Unless you mean by experimental determination, twiddling control panel settings.

Well by "overall" I mean run-time average, of course. Filtering and branching will always affect what the effective ratio is. The point is that you only need the ALU or the TMU to be saturated for the GPU to be running a shader as fast as possible.
Yes, and with more batches in flight, the percentage of time that ALUs and TUs are simultaneously saturated increases...

Jawed
 
My theory is that ATI does attribute interpolation at the rasterization stage and feeds them into the register file when initializing each batch. The compiler overwrites the register holding the attribute when it feels that there's no need for it anymore. That r600isa document makes no mention of an interpolation instruction, which is obviously different from NVidia's approach, so I can't think of anything else.

This, along with the scalar setup, may be one of the reasons that NVidia can get away with smaller register files, but they have to keep vertex attribute information around for a lot longer and worry about feeding it into the SF units in the SPs. OTOH, ATI can discard it as soon as the batches are initialized, keeping the data flow neat and simple.

I agree with your overall point but want to quibble with Nvidia having to keep vertex attribute information around "a lot longer". Unless you're talking about pretty small triangles, a single triangle is used for more than just one quad or batch of pixels, which may not run completely concurrently -- they may be serialized for larger triangles, or interleaved with pixel batches from other triangles, etc. The vertex attribute information has to be kept around until *all* of them have finished interpolating their attributes. Sure, you can throw them away earlier in the execution of the very last set of pixels from that triangle, but you're still going to keep it around for the entire execution of many (most?) of the pixels.
 
Vertex attribute information is going to be in the post-transform cache for both methods, but in ATI's case the rasterizer only has to access at most three of those vertices every clock cycle. Nothing downstream from the rasterizer needs that, and it processes triangles serially so data flow is exceedingly simple. You may have quads from different triangles in a batch, and a triangle can have quads in different batches, but the rasterizer will only be working on one triangle at a time and won't move to the next one until the current one is completely fed to the shader arrays.

With NVidia's method, each cluster that got quads from a triangle will have to store a copy of its vertex attributes, as seriously I doubt all the SMs all have direct access to the post-transform cache. Each quad is going to have to keep pointers to the vertices so that the on-the-fly interpolation can be done.

It's not just how long, but how many have to be kept for that long. Again, I'm not talking about the post-transform cache. With ATI, the rasterizer only needs data from 3 verts at a time. With NVidia, each of the 16 SMs could need tens or even hundreds of vertices available to feed the SF/interpolator unit.
 
It wouldn't, don't know why you're asking - TUs waiting for fetches from memory count as occupied.
Well this is the source of confusion, then. When I say saturated, I mean it's matching its theoretical throughput (i.e. filtering as fast as it can for the given data types) for the duration of the workload.

I'm talking about a class B shader as defined by its overall ALU:TEX ratio which executes as a class A shader due to sustained worst-case TEX throughput that exhausts the available ALU batches.
A class B shader can't execute as a class A shader, or the chip would explode :D. If you have shader with 20 ALU instruction groups and 4 bilinear TEX instructions, it's class A. If you saturated the tex units (i.e. if it executed as a class B shader), the SIMD would be processing 4 TEX instructions per clock, and thus 20 ALU groups per clock. The SIMD can only do 16.

The PS2.0 compilation of the veined marble shader, with only 8 registers, exhausts all 32 batches in flight.
I don't agree. I think it's just a matter of GPUSA not realizing how long it takes to filter the fetches, and thus overestimating the texture throughput.

Just think about it: If there are no available ALU instructions, then one of three things happened:

A) This is a texture throughput limited shader. All the threads in the world couldn't help you, as you're limited by texture filtering and/or the TMU data bus.

B) You just got astronomically unlucky. In a theoretically ALU limited shader, at any given time a batch ready for the sequencer to dispatch has a greater chance of waiting for the ALU to service it than the TEX unit (I didn't quite describe that right, but hopefully you know what I mean). For 32 batches, we're looking at less than a one in billion chance of all needing to go to the TMU.

C) Latency. Maybe you got unlucky and 5 batches were just given to the TMU in the last few cycles. The other 27, though, have had the TMU make memory requests for but nothing has come back yet. That's 432 cycles, though. Not the 128 figure your math comes up with.

I really think an animation would help explain all this, but I don't know how to do it without taking up gobs of time. Maybe I can come up with a text-based simulation that's descriptive enough.

I don't know how you'd be able to tell.
You'd probably be able to see the mipmap line.

Yes, and with more batches in flight, the percentage of time that ALUs and TUs are simultaneously saturated increases...
I don't think you understand what I mean by saturated. Saying "saturated for a greater percent of the time" is redundant, because saturated already means near-100% of the time.
 
Vertex attribute information is going to be in the post-transform cache for both methods, but in ATI's case the rasterizer only has to access at most three of those vertices every clock cycle. Nothing downstream from the rasterizer needs that, and it processes triangles serially so data flow is exceedingly simple. You may have quads from different triangles in a batch, and a triangle can have quads in different batches, but the rasterizer will only be working on one triangle at a time and won't move to the next one until the current one is completely fed to the shader arrays.
Your ideas seem pretty compelling - I've long had a bit of a gap in my understanding at this point in the pipeline.

With NVidia's method, each cluster that got quads from a triangle will have to store a copy of its vertex attributes, as seriously I doubt all the SMs all have direct access to the post-transform cache. Each quad is going to have to keep pointers to the vertices so that the on-the-fly interpolation can be done.
Hmm, maybe this is relevant:

http://forum.beyond3d.com/showthread.php?p=1165632#post1165632

:p

It's not just how long, but how many have to be kept for that long. Again, I'm not talking about the post-transform cache. With ATI, the rasterizer only needs data from 3 verts at a time. With NVidia, each of the 16 SMs could need tens or even hundreds of vertices available to feed the SF/interpolator unit.
Apparently G80 does screen-space tiled rasterisation, i.e. clusters/SMs own regions of screen space.

The curious thing is, if you are doing one-triangle-at-a-time rasterisation and you are doing screen-space tiled rasterisation, why can't you have multiple rasterisers running in parallel, one per screen-space tile with the triangles per tile serialised?

Jawed
 
Your ideas seem pretty compelling - I've long had a bit of a gap in my understanding at this point in the pipeline.
One thing to note is that ATI's implementation wouldn't allow the use of NVidia's patent (merged SFU/interpolator). If indeed ATI is saving space with their method, that patent may be a bit too clever for its own good.

Well there we go! I'm pretty sure there is a global cache as well, as you need it for the primitive assembler and to divide the pixels among the SMs. It's just used strictly for attribute storage, not accessed for interpolation. I'm sure you and others can see that this method results in attribute duplication between the SMs.

Apparently G80 does screen-space tiled rasterisation, i.e. clusters/SMs own regions of screen space.
I wonder what the advantage of that would be. It makes sense to do that for the ROPs, because then you can tie them directly to the memory channels without a crossbar.

The curious thing is, if you are doing one-triangle-at-a-time rasterisation and you are doing screen-space tiled rasterisation, why can't you have multiple rasterisers running in parallel, one per screen-space tile with the triangles per tile serialised?
I don't see why we can't. The only thing is that multiple rasterizers are more expensive than one double speed one, and it's not useful unless you have faster setup as well.
 
Well this is the source of confusion, then. When I say saturated, I mean it's matching its theoretical throughput (i.e. filtering as fast as it can for the given data types) for the duration of the workload.
Apart from the re-ordering of memory operations, a TU's addressing, fetching, caching and filtering form a pipeline for the purpose of delivering texture results to the cluster. The fact that the stages have variable latency doesn't mean the TU is not working as fast as it can.

A class B shader can't execute as a class A shader, or the chip would explode :D.
I'm simply saying that during compilation it appears to be one class but in execution it turns out to be the opposite.

Veined Marble is just a gross example of the way that instantaneous ALU:TEX ratio can radically differ from a shader's overall ALU:TEX, making estimates of performance tricky.

If you have shader with 20 ALU instruction groups and 4 bilinear TEX instructions, it's class A. If you saturated the tex units (i.e. if it executed as a class B shader), the SIMD would be processing 4 TEX instructions per clock, and thus 20 ALU groups per clock. The SIMD can only do 16.
But for 80% of the execution time, both ALUs and TUs are "fully utilised" - i.e. for the majority of execution time it's class C.

I don't agree. I think it's just a matter of GPUSA not realizing how long it takes to filter the fetches, and thus overestimating the texture throughput.
Well if it takes 2 clocks for the TU to produce each texture result then that halves the ALU:TEX ratio of the GPU, making it approximately 0.85 for Veined Marble on R600. Obviously it can't hide that latency.

Using the X1800XT numbers that Carsten posted here:

http://forum.beyond3d.com/showthread.php?p=1194350#post1194350

and seeing that X1800XT runs Car Surface and Veined Marble at almost exactly the same speed (implying that Veined Marble is, at least, very close to being ALU-limited), it's worth noting that GPUSA says that this shader is 6.75 ALU:TEX on X1800XT. Since the hardware is 1:1, that implies that these volume texture lookups are taking way more than 2 clocks on average!

Just think about it: If there are no available ALU instructions, then one of three things happened:

A) This is a texture throughput limited shader. All the threads in the world couldn't help you, as you're limited by texture filtering and/or the TMU data bus.

B) You just got astronomically unlucky. In a theoretically ALU limited shader, at any given time a batch ready for the sequencer to dispatch has a greater chance of waiting for the ALU to service it than the TEX unit (I didn't quite describe that right, but hopefully you know what I mean). For 32 batches, we're looking at less than a one in billion chance of all needing to go to the TMU.

C) Latency. Maybe you got unlucky and 5 batches were just given to the TMU in the last few cycles. The other 27, though, have had the TMU make memory requests for but nothing has come back yet. That's 432 cycles, though. Not the 128 figure your math comes up with.
It seems it's C if the X1800XT data is meaningful (since that's so heavily ALU-limited) - though it is a different architecture, with no L2 and with a less effective memory system...

Jawed
 
It's not just how long, but how many have to be kept for that long. Again, I'm not talking about the post-transform cache. With ATI, the rasterizer only needs data from 3 verts at a time. With NVidia, each of the 16 SMs could need tens or even hundreds of vertices available to feed the SF/interpolator unit.

I've been wondering if this could account for a significant amount of the efficiency per mm2 discrepancy between ATI and NV. Going forward, it seems undesireable to have interpolator hw that scales with compute power rather than texturing rate.

Tying the interpolation rate to the number of TA units in each cluster (regardless of whether interpolation is done just in time or up front) seems like the way to go - am I missing something? The fact that NV sometimes makes multipliers otherwised used for interpolation available for general shading would seem to support that they actually have too much interpolation capacity.

So I guess my question is, would it be beneficial to have dedicated interpolation HW per cluster? This would basically involve dropping the 8 "sometimes missing" multipliers per SM, and some scheduler (and possibly register file) simplifications.
 
Apart from the re-ordering of memory operations, a TU's addressing, fetching, caching and filtering form a pipeline for the purpose of delivering texture results to the cluster. The fact that the stages have variable latency doesn't mean the TU is not working as fast as it can.
You can make that excuse for any unit that isn't achieving its peak throughput. That's what this is about - throughput.

I'm simply saying that during compilation it appears to be one class but in execution it turns out to be the opposite.

Veined Marble is just a gross example of the way that instantaneous ALU:TEX ratio can radically differ from a shader's overall ALU:TEX, making estimates of performance tricky.
I fully agree that compilation and execution can be different. My point lies elsewhere.

Take shader A, which has clustered texture access, and create a new shader B that has the same number/type of instructions (including the same sampling characteristics) but with a more even distribution. My claim is that the execution time of A will not differ from the execution time of B. Local ALU:TEX ratio is inconsequential.

But for 80% of the execution time, both ALUs and TUs are "fully utilised" - i.e. for the majority of execution time it's class C.
You're bastardizing my system of classification. My entire point is that you can't be one class part of the time and another class another part of the time, unless you're talking about gradual changes over thousands of cycles due to spatial workload variation (e.g. transitioning from magnification to minification).

In this case, it's class A 100% of the time. A shader is only class C if it is class A and B all the time. In retrospect I never should have included it, and instead should have called it class A-B or something.

Well if it takes 2 clocks for the TU to produce each texture result then that halves the ALU:TEX ratio of the GPU, making it approximately 0.85 for Veined Marble on R600. Obviously it can't hide that latency.
Exactly, but it's not a latency problem. It's a throughput problem. Even if you had more threads, the shader would still run at the same speed.

Just out of curiosity, why do you call it 0.85 instead of 3.4?

and seeing that X1800XT runs Car Surface and Veined Marble at almost exactly the same speed (implying that Veined Marble is, at least, very close to being ALU-limited), it's worth noting that GPUSA says that this shader is 6.75 ALU:TEX on X1800XT. Since the hardware is 1:1, that implies that these volume texture lookups are taking way more than 2 clocks on average!
What makes you say Veined Marble is merely "close" to being ALU limited on the X1800XT? Judging by the X1900XT's notable boost, it definately is (although texture limited on the X1900XT).

But yes, it does look like 4+ clocks per lookup. Trilinear filtering would take 4 clocks, FYI.

It seems it's C if the X1800XT data is meaningful (since that's so heavily ALU-limited) - though it is a different architecture, with no L2 and with a less effective memory system...
I still think it's A, because volume textures take a while to filter. R600 is actually a hair slower per clock than R580 in this test, and I would think that it has more register space.

The G80 results you found suggest the same thing. Even if we assume that it can compile to needing only 16 FP32 registers due to scalar architecture, it can only have 4092 pixels in flight, which is half of what R600 can hold. It gets twice the framerate, though, meaning it has 1/4 the time to get the fetches from memory. I'm very sure that this shader is filtering throughput limited.

Anyway, look closer at C. Do you have any objection to the 432 cycle figure? That's how long a batch has been waiting for data. That's how long memory latency has to be for the 27 batches to be insufficient when situation A doesn't apply, i.e. the shader is ALU throughput limited. I assumed the bare minimum ALU instructions, too, as more instructions would mean more latency hiding.
 
Back
Top