I was referring to shaders with a mixture of texturing types. So you might have 6 bilinear fetches + a volume texture (and for good measure you might put the volume texture fetch inside a 3-instruction dynamic loop
).
I'll send you my little program to play with before I release it as it does all this stuff, though branching isn't there yet. As suggested by Shebanow's presentation, the single bilinear texture fetch groups are the toughest to hide. Other texture groups that take longer (due to more fetches or longer filter ops) are easier.
To be more precise, cache-misses don't actually mean that TEX throughput is memory bandwidth constrained
I wasn't saying that. I meant cache misses only affect performance when you are BW limited, assuming latency can be hidden. For a low BW shader, a good cache may only use 20% the BW, and a bad one will use 40%. Both will have the same performance, though.
Though it appears to be common with render-target post-processing shaders, e.g. HDR tone-mapping with fp16 texels.
This only makes a difference for short shaders, because when you look at an area of pixels the data needed is only 8 bytes per pixel (plus another 4 for the ROP). For example, a 64x128 pixel area needs FP16 texels from a 80x144 area.
While you can indeed be BW limited in such a cases, it don't satisfy the other two conditions I mentioned, especially the last. 8,192 pixels in flight (8 reg shader) won't have a texel footprint of even half of R600's texture cache.
A 2048x2048 matrix multiply, however, is very different. A 64x128 section of the final matrix has a texel footprint of 1.5 MB. Good data management is very important.
I presume you're referring to macro Peano or Hilbert curves, e.g. where each cell is 16x16 elements in size.
Not quite, because those curves aren't that great for matrix mulitply. You'd basically want to use strips of a certain width (or height), but scan convert them in the short direction first. I'm sure plenty of people know how to do this, but the GPGPU pdf you linked to is more about compiling strategy for a given single quad workload.
And all I'm saying is that if you want to model shader throughput then you have to model all the resource constraints. Without modelling variations in latency due to cache-access patterns, the naive ALU:TEX model falls short.
Well, I already have a max and min parameter to randomly determine latency. Regardless of shortcomings in this cache model or any other, the point is that I can look at worst case analysis - i.e. 0% hit rate - for various instruction arrangements and see that not only is performance unaffected, but I can get peak throughput quite easily (ALU and/or TF with high utilization) and can hide pretty close to the predicted latency.
It's actually not that tough to model analytically. Take N = lambda * L / G, but do a sort of shader average for (lambda / G), i.e. sum of total cycles to filter every fetch and divide by the total number of groups. Combining with ALU throughput, you get:
Latency hiding ability = # batches * max{total ALU cycles/batch, total TF cycles/batch} / # tex groups
It's easy to estimate this from the input shader, particularly worst case. Now, it's not
too accurate because...
Listen to what he says starting at the 52 minute mark - the audience is arguing your point, but he says it's not that simple. He's quite vague, but what he's saying is that you want to maintain ALU:TEX in smaller and smaller windows of instructions - if you don't then the shader's overall throughput (effectively ALU:TEX) will be affected.
He is basically arguing my point B from above, i.e. sometimes you just get unlucky (and you also lose cycles for this reason for the first set of batches, obviously, hence the transient part). He just didn't get into the details of how often it happens.
This is where the scheduler comes in. I put in a naive algorithm in my simulation, prioritizing untouched batches first and then going by batch age. At first I only had the latter, and ran into a sort of "batch aliasing" causing me to be unlucky more often (well, it's not really luck...). I'm sure more sophisticated algorithms can do better, but I still get near perfect utilization for usual latency figures and 80-95% when reaching the limit predicted by the above equation. After that it's a linear dropoff as expected, i.e. utilization = latency hiding / actual latency.
I might play around with other ideas to help stagger the batches more randomly.
Agreed on those matters about G, NVidia vs ATI, and compilation. It should be noted that it's not hard for a compiler to group.
Despite our earlier conclusion that it's filtering-throughput limited, this increase in performance implies that something other than filtering-throughput is playing a part.
I think doubled L1 in RV770 is a potentially big factor.
Well, we already knew that something else matters owing to the per-clock comparison with RV770.
An L1 factor would show BW dependence, but there is none. I think the only way it wouldn't is if there is insufficient BW between the L1 and L2, which would be quite odd.
Remember, latency really shouldn't be a factor. Assuming volume textures, I'm finding 1000+ cycles of latency are easily absorbed by this shader. The longer it takes to filter, the more latency can be hidden.