Upcoming ATI Radeon GPUs (45/40nm)

Humus · Jul 24, 2008

psurge said:
Going forward, it seems undesireable to have interpolator hw that scales with compute power rather than texturing rate.

Well, it doesn't make much sense to scale it with texturing rate either. Being interpolator limited is a rather rare event. It could likely be kept constant for several generations to come without causing much of a problem. The number of inputs shaders require isn't scaling anywhere near the rate at which shaders are expanding in ALU and TEX work. A typical SM2.0 shader might have used 4-6 interpolators and perhaps 30 instructions. A typical SM3.0 shader could be 6-8 interpolators and 150 instructions. And SM4.0 we might see 6-10 interpolators and 500 instructions.

Mintmaster · Jul 24, 2008

It probably makes sense to scale it with fillrate, though, since you're probably only getting interpolator limited with simple pixels (and only occasionally at that). Incidentally, that's exactly what ATI has done by doing it all at the rasterizer stage. The register cost is ALU related, though.

At first it may seem like Nvidia isn't really scaling it with ALU count either, because the same hardware is used for special functions. However, overall costs are scaling with cluster count due to the storage and routing overhead I mentioned earlier.

Jawed · Jul 24, 2008

Mintmaster said:
Take shader A, which has clustered texture access, and create a new shader B that has the same number/type of instructions (including the same sampling characteristics) but with a more even distribution. My claim is that the execution time of A will not differ from the execution time of B. Local ALU:TEX ratio is inconsequential.

This is only true if you have "infinite" cache.

In the real world, the differences in clustering of TEX accesses and the load-balancing versus batch-ordering scenarios will produce different cache access patterns.

The result is differences in effective ALU:TEX ratio.

I'm sure I didn't download this presentation from IBM, last week, but I can't find it on AMD's site right now

http://www.research.ibm.com/people/h/hind/pldi08-tutorial_files/GPGPU.pdf

so get it while you can. (My local copy of this presentation is called PLDI08Tutorial.pdf )

Starting on page 39 is a tutorial on maximising the performance of matrix multiplication. Skip past the Brook+ implementation to page 48 where the IL versions start.

The initial version of the code, for matrix size of 2048x2048, starts at:

~122GFLOPs
97.486% cache hit
49:12 ALU:TEX - 1.02:1 hardware
26 registers
85.31% ALU utilisation

Version 2

~156GFLOPs
97.688% cache hit
135:48 ALU:TEX - 0.7:1 hardware
27 registers
87.85% ALU utilisation

Version 3

~162GFLOPs
97.981% cache hit
136:48 ALU:TEX - 0.71:1 hardware
28 registers
87.35% ALU utilisation

Version 4

180GFLOPs
98.264% cache hit
136:48 ALU:TEX - 0.71:1 hardware
32 registers
87.35% ALU utilisation

Version 5

~218GFLOPs
98.999% cache hit
136:48 ALU:TEX - 0.71:1 hardware
38 registers
87.35% ALU utilisation

Exactly, but it's not a latency problem. It's a throughput problem. Even if you had more threads, the shader would still run at the same speed.

Yes, you're right.

Just out of curiosity, why do you call it 0.85 instead of 3.4?

Instead of 4 ALU instructions per TEX instruction, with a 2-cycle TEX result the hardware has a halved ALU:TEX ratio. The shader shifts from being ALU-bound (hardware: 1.69:1) to TEX bound 0.85:1.

What makes you say Veined Marble is merely "close" to being ALU limited on the X1800XT? Judging by the X1900XT's notable boost, it definately is (although texture limited on the X1900XT).

According to GPUSA, on X1800XT Veined Marble should be faster than Car Surface, but it's slower. Veined Marble should be about 13% faster than it is, e.g. ~176fps at 2560x1600...

So it looks like X1950XTX is less TEX bound - which would be down to cache-ordering/cache-hit. Apart from X1950XTX's 3:1 ALU:TEX don't forget the large bump in bandwidth (33%), too.

I still think it's A, because volume textures take a while to filter. R600 is actually a hair slower per clock than R580 in this test, and I would think that it has more register space.

Hmm at 2560x1600 Carsten's results put R600 (742MHz) as 2.8% faster than X1950XTX (650MHz) - 1% at the lower resolutions.

The G80 results you found suggest the same thing. Even if we assume that it can compile to needing only 16 FP32 registers due to scalar architecture, it can only have 4092 pixels in flight, which is half of what R600 can hold. It gets twice the framerate, though, meaning it has 1/4 the time to get the fetches from memory. I'm very sure that this shader is filtering throughput limited.

I hypothesised that G80 is bandwidth limited:

The raw rates for G80 appear to be ~1675fps for GTX and ~1200fps for GTS (I hate picking numbers out of graphs like this, but what can I do?), which means GTS is about 72% of GTX performance. ALU rate, ~65%, and texture rate, 67%, when comparing these cards seem to indicate that these tests are bandwidth limited, as bandwidth here is ~74% for GTS in comparison with GTX.

Anyway, look closer at C. Do you have any objection to the 432 cycle figure? That's how long a batch has been waiting for data. That's how long memory latency has to be for the 27 batches to be insufficient when situation A doesn't apply, i.e. the shader is ALU throughput limited. I assumed the bare minimum ALU instructions, too, as more instructions would mean more latency hiding.

I got lost when you introduced 5 batches I'm afraid.

To be strict according to the numbers GPUSA is showing, 1.69 ALU:TEX * 32 batches * 4 ALU clocks = 216 ALU clocks limit, on average, for all the batches' TEX instructions to complete.

I didn't realise that this shader's volume texture filter operations would take longer than bilinear, so was assuming this is purely down to memory-latency (cache-thrashing). The fact that all the L2 in R600 basically doesn't help at all (not to mention the huge increase in bandwidth) would appear to indicate memory-latency isn't a factor.

Jawed

Mintmaster · Jul 25, 2008

Jawed said:
This is only true if you have "infinite" cache.

In the real world, the differences in clustering of TEX accesses and the load-balancing versus batch-ordering scenarios will produce different cache access patterns.

I can see random differences but I don't see why clustered accesses are worse off. The whole point of small batch size is that it allows

The result is differences in effective ALU:TEX ratio.

For the last time, ALU:TEX is about throughput. Cache makes no difference there unless it causes BW limitations.

Starting on page 39 is a tutorial on maximising the performance of matrix multiplication. Skip past the Brook+ implementation to page 48 where the IL versions start.

27-38 register GPGPU programs with huge data requirements (we're talking about one 4xFP32 fetch after another) are a little different from 8 register pixel shaders.

Instead of 4 ALU instructions per TEX instruction, with a 2-cycle TEX result the hardware has a halved ALU:TEX ratio. The shader shifts from being ALU-bound (hardware: 1.69:1) to TEX bound 0.85:1.

I know about the filtering, but I'm wondering why a shader that most people call 6:1 is called 1.5:1 by you. It's like you consider 4 ALU instruction groups (each having up to 5 scalar instructions) just one single ALU instruction.

According to GPUSA, on X1800XT Veined Marble should be faster than Car Surface, but it's slower.

Forget about what GPUSA says. We are already quite sure that it doesn't distinguish between bilinear and volume fetches in the shader.

I hypothesised that G80 is bandwidth limited:

First of all, you're splitting hairs with your rationale there. The framerates are so high that any little thing could be holding back the GTX. If it took just 50 microseconds less time, it would be back to the 50% lead over the GTS that you expect.

Secondly, it doesn't matter if BW is a limit. R600 has 4 times as long as the GTX between a memory request and having a filtered result for any pixel, so it can't be latency limited.

I got lost when you introduced 5 batches I'm afraid.

You can ignore those 5, then. I just mixed in a less ridiculous case of B into C.

To be strict according to the numbers GPUSA is showing, 1.69 ALU:TEX * 32 batches * 4 ALU clocks = 216 ALU clocks limit, on average, for all the batches' TEX instructions to complete.

Here's that factor of 4 again. 27 instruction groups and 4 tex instructions means ALU:TEX is, as most would define it, 6.75:1. You can't divide this ratio by 4 and say it only takes 4 ALU clocks per batch. If you're going to divide it by 4, then it's 16 ALU clocks per batch because now you're talking about executing 4 ALU instruction groups instead of 1 on the batch.

Jawed · Jul 25, 2008

Mintmaster said:
I can see random differences but I don't see why clustered accesses are worse off.

The ordering of texture fetches affects cache eviction/thrashing.

http://forum.beyond3d.com/showthread.php?t=42852

NVidia's architecture is looser in the way that texture fetches are scheduled - not clumped like ATI. This appears to allow it to use less batches to hide texturing latency. This is because batch scheduling is more finely grained. In general, ALU instructions are not forced to wait for TEX results that they aren't even directly dependent upon, whereas this is often the case in ATI.

Again, this finer granularity in NVidia's architecture is an instance of relaxed instantaneous ALU:TEX ratio. NVidia's architecture has shorter TEX->ALU dependency chains, which generally reduces the chances of ALUs stalling.

The whole point of small batch size is that it allows

For the last time, ALU:TEX is about throughput. Cache makes no difference there unless it causes BW limitations.

Cache misses clearly affect bandwidth and latency. Variations in the structure of the fetches in a shader (which are variations in instantaneous ALU:TEX), even when all versions of the shader have constant ALU:TEX, produce differing proportions of cache misses.

The pictures in that presentation cannot be more explicit. Did you look at the presentation?

(One thing I haven't done is to check what the ALU:TEX of the loop is. I'm not sure if AMD is reporting the shader as a whole or the loop.)

27-38 register GPGPU programs with huge data requirements (we're talking about one 4xFP32 fetch after another) are a little different from 8 register pixel shaders.

Why? Throughput is throughput - the same rules to evaluate throughput apply to both. In both cases each TEX fetch is slower than bilinear in its rate.

I know about the filtering, but I'm wondering why a shader that most people call 6:1 is called 1.5:1 by you. It's like you consider 4 ALU instruction groups (each having up to 5 scalar instructions) just one single ALU instruction.

When I refer to the effective hardware ALU:TEX I'm referring to the per-clock throughput of the hardware, which is 4 instruction groups per single TEX instruction in the case of RV670.

GPUSA makes this easy: the column labelled ALU:TEX(Bi) takes account of the hardware's throughput. >1 and you're theoretically ALU-bound.

So, nominally per fragment, Veined Marble spends 6.75 ALU clocks and 4 TEX clocks, i.e. the hardware's effective ALU:TEX for this shader is 1.69:1. (Of course we now know that the bilinear throughput is useless.)

First of all, you're splitting hairs with your rationale there. The framerates are so high that any little thing could be holding back the GTX. If it took just 50 microseconds less time, it would be back to the 50% lead over the GTS that you expect.

Well, the shader is very consistent in its execution and GTS is faster per TEX clock than GTX.

Secondly, it doesn't matter if BW is a limit. R600 has 4 times as long as the GTX between a memory request and having a filtered result for any pixel, so it can't be latency limited.

I don't understand how you get "4 times as long" here.

Jawed

Mintmaster · Jul 25, 2008

Jawed said:
The ordering of texture fetches affects cache eviction/thrashing.

http://forum.beyond3d.com/showthread.php?t=42852

That thread is one where I don't agree with a lot of your assertions. For example:

That means there are 4 ALUs that can run in parallel with the 3 TEXs. That's a 1.33:1 ALU:TEX ratio. But R580 can run 3 ALUs per clock, so the effective ratio is actually 0.44:1.

This is just a plain misunderstanding of how scheduling works. Local ALU:TEX ratio doesn't matter overall. It's just a transient problem for the first batches entering the shader unit.

I think I will make that little GPU scheduler simulator for you. It'll be fun.

NVidia's architecture is looser in the way that texture fetches are scheduled - not clumped like ATI. This appears to allow it to use less batches to hide texturing latency. This is because batch scheduling is more finely grained. In general, ALU instructions are not forced to wait for TEX results that they aren't even directly dependent upon, whereas this is often the case in ATI.

I see the difference, but the implications are not as severe as you think. They're both equivalent in terms of worst case (shader-wide semaphore), but NVidia has it easier with the average case if there are enough ALU instructions.

My bad. I meant to come back to finish that point, but I can't remember my train of thought from yesterday.

Why? Throughput is throughput - the same rules to evaluate throughput apply to both. In both cases each TEX fetch is slower than bilinear in its rate.

I know throughput is the same, but between the matrix multiply shaders and veined marble shader there are a massive differencess in batch count (~6-9 vs 32) and pressure on the cache. Qualitatively your arguments matter for these pathological cases, but not for regular graphics stuff. By the way, did you notice that the later versions have a bigger clump of texture accesses than the earlier ones?

When I refer to the effective hardware ALU:TEX I'm referring to the per-clock throughput of the hardware, which is 4 instruction groups per single TEX instruction in the case of RV670.

Right, 4 instruction groups. Why does a SIMD only take 4 clocks per batch, then? It takes 16 clocks to do 4 instruction groups on a batch just like it takes 16 clocks to do a single TEX instruction per batch. You're double-counting the 4:1 hardware ratio in your math later on, as I mentioned in my previous post (and you unfortunately chose not to address that major point).

Well, the shader is very consistent in its execution and GTS is faster per TEX clock than GTX.

I don't understand how you get "4 times as long" here.

Okay, forget about if you're right about BW. Remember the good old fundamental equation:

time in flight (i.e. latency hiding ability) = threads in flight / throughput

Even if we assume G80 can get away with half the register space, it only has half as many threads in flight as R600, yet has twice the throughput for this shader. This is my point about why R600 can't be latency limited in this shader.

Freak'n Big Panda · Jul 25, 2008

I think I will make that little GPU scheduler simulator for you. It'll be fun.

Please do! I think a lot of us here would appreciate it

Jawed · Jul 25, 2008

Mintmaster said:
I think I will make that little GPU scheduler simulator for you. It'll be fun.

You're arguing against Mike Shebanow: Lecture 12:

http://courses.ece.uiuc.edu/ece498/al1/Archive/Spring2007/Syllabus.html

The audio is best.

I know throughput is the same, but between the matrix multiply shaders and veined marble shader there are a massive differencess in batch count (~6-9 vs 32) and pressure on the cache. Qualitatively your arguments matter for these pathological cases, but not for regular graphics stuff.

Both shaders have all TEX fetches that are significantly slower than bilinear, so yes they're pathological in that sense. But with both of them being TEX throughput limited, performance only varies with cache behaviour - and that depends on the order that batches make their TEX requests (which is the whole reason I linked that AMD presentation).

A 1.5% difference in cache hit ratio is the difference between ~122GFLOPs and ~218GFLOPs

The number of batches, 6 or 32, doesn't alter the fact these shaders are TEX limited. Just like a graphics shader with 4 instruction groups and 2 bilinear TEX instructions.

In general some TEX requests in a shader will be throughput-limited and others won't. The extent of TEX throughput limitations will determine the shift of performance away from that indicated by the shader's headline ALU:TEX.

Obviously when a dev is doing GPGPU programming they'll be hyper sensitive to performance - they're optimising for a limited variety of GPUs (or even just the one), so they can afford to be that sensitive.

By the way, did you notice that the later versions have a bigger clump of texture accesses than the earlier ones?

Yeah, not particularly obvious stuff.

The loop itself is 113 ALU instruction groups and 48 TEX instructions, so 0.59 effective hardware ALU:TEX on RV670.

Right, 4 instruction groups. Why does a SIMD only take 4 clocks per batch, then? It takes 16 clocks to do 4 instruction groups on a batch just like it takes 16 clocks to do a single TEX instruction per batch. You're double-counting the 4:1 hardware ratio in your math later on, as I mentioned in my previous post (and you unfortunately chose not to address that major point).

Yes you're right, I should have been multiplying 1.69 * 32 batches * 16 clocks = 865 total clocks for all the TEX results.

Okay, forget about if you're right about BW. Remember the good old fundamental equation:

time in flight (i.e. latency hiding ability) = threads in flight / throughput

Even if we assume G80 can get away with half the register space, it only has half as many threads in flight as R600, yet has twice the throughput for this shader. This is my point about why R600 can't be latency limited in this shader.

G80 has a maximum of 24 batches per SIMD in flight versus 128 per SIMD in R600. G80 has 32KB of register file per SIMD whereas R600 has 256KB. It's all quite stark :!:

G80's throughput in this shader is in some way TEX throughput bound, even if it has 2x/4x the per-clock TEX throughput of R600 - though admittedly I don't know the relative rates for volume textures in both architectures...

Anyway, I agree, R600 isn't memory-latency bound.

At least RV770 is faster

http://www.computerbase.de/artikel/...50_rv770/3/#abschnitt_theoretische_benchmarks

Also, RV670 at 1920x1200 matches the performance of R600 in Carsten's results.

Jawed

Mintmaster · Jul 25, 2008

Jawed said:
A 1.5% difference in cache hit ratio is the difference between ~122GFLOPs and ~218GFLOPs

Yes, because matrix multiplies need gobs of FP32 data accesses. For the most part it's a BW test, so you have to make use of the available BW as efficiently as you can.

Texture BW used is proportional to cache miss rate, so look at that instead. That's why AMD's GPU PerfStudio reports cache miss rate as bytes per pixel.

Yes you're right, I should have been multiplying 1.69 * 32 batches * 16 clocks = 865 total clocks for all the TEX results.

Well there we go. That's all I've been trying to say all this time.

You still think 25 batches (for 10 reg shaders) is insufficient? They give you a minimum of 400 cycles of latency hiding, not 100. Recall that this is where the whole discussion started:
http://forum.beyond3d.com/showpost.php?p=1193047&postcount=167

Now I just have to show you that clustering isn't an issue...

G80 has a maximum of 24 batches per SIMD in flight versus 128 per SIMD in R600. G80 has 32KB of register file per SIMD whereas R600 has 256KB. It's all quite stark

I just realized I made a mistake - G80 has the same number of threads in flight (8192), not half, if we assume 16 FP32 registers for the shader on G80.

Nonetheless, time in flight is still half that of R600 so my point stands about memory latency unlikely to be an issue. You agree with me on that point, though, so I think the case is closed for this shader. It's the texture filtering.

Jawed · Jul 26, 2008

Mintmaster said:
Well there we go. That's all I've been trying to say all this time.

You still think 25 batches (for 10 reg shaders) is insufficient? They give you a minimum of 400 cycles of latency hiding, not 100. Recall that this is where the whole discussion started:
http://forum.beyond3d.com/showpost.php?p=1193047&postcount=167

Yes, sorry about that. Bilinear int8 fetches are pretty safe territory - definitely need more registers to screw with throughput.

Mix in some TEX bottlenecks though (a volume texture fetch followed immediately by a dependent ALU instruction, say) and cache behaviour coupled with shader layout and batch scheduling will lead to variations in performance.

Now I just have to show you that clustering isn't an issue...

But the matrix multiply results make it clear that it is. Listen to Shebanow.

Without knowing the detail of a GPU's cache system you can't simulate the effects on throughput of differing shader layouts and scheduling algorithms.

I just realized I made a mistake - G80 has the same number of threads in flight (8192), not half, if we assume 16 FP32 registers for the shader on G80.

It'd be nice if there was an equivalent of GPUSA for NVidia's GPUs. I forgot to mention earlier that PDC, 16KB per SIMD, can be used as additional "register file" in GPGPU code.

Nonetheless, time in flight is still half that of R600 so my point stands about memory latency unlikely to be an issue. You agree with me on that point, though, so I think the case is closed for this shader. It's the texture filtering.

While HD4850 appears to be 2.25x faster than HD3870/HD2900XT. So HD4850 is faster than theoretical throughput would indicate: 625/775 * 2.5 = 2x. L1 is doubled in size in RV770 (though I'm a bit cagey on this because it's described as "2x increase in effective storage per L1").

http://www.winland.fr/dmdocuments/03ScottHartog-RV770Architecture-Final.pdf

Interestingly, HD4870 appears to be 33% faster than HD4850:

http://www.computerbase.de/artikel/..._260_sli/6/#abschnitt_theoretische_benchmarks

so bandwidth (and therefore cache system) appears to be playing a part in RV770's performance.

Jawed

Mintmaster · Jul 26, 2008

Jawed said:
Mix in some TEX bottlenecks though (a volume texture fetch followed immediately by a dependent ALU instruction, say) and cache behaviour coupled with shader layout and batch scheduling will lead to variations in performance.

Remember that when you're texturing throughput limited (ALU:TEX < 1) you don't multiply by the ratio since the ALU will necessarily have to idle. Also, with volume textures, you need a minimum of 32 clocks per batch and sometimes more. That means we're looking at 1024 cycles of latency hiding.

We are already assuming dependent texturing in our calcs. If there's no dependent texturing, all N texture fetches can be grouped, and number of batches needed to hide the latency is divided by N (or, conversely, latency hiding ability goes up by N for a given number of batches).

But the matrix multiply results make it clear that it is. Listen to Shebanow.

Without knowing the detail of a GPU's cache system you can't simulate the effects on throughput of differing shader layouts and scheduling algorithms.

Cache can only an issue when the GPU is BW limited, most of the BW consumption is by the shader as opposed to the ROP, and you can't fit all batches' texels in the cache. That is very, very rare in games. More pertinent to our discussion is the fact that when this happens, more batches doesn't help you. Matrix multiply is a special case because each pixel needs a whole row of input data form one matrix and a whole column from another. For nxn matrices, that's 2*n^3 fetches without cache (8 bytes per scalar MAD - anyone got 4TB/s BW?

) or block multiply. Intelligent data access is crucial to performance. I personally think that they can do even better with clever pixel marching.

I'm just going to assume every texture access needs between X and Y cycles to get the result from memory, i.e. a no-cache, infinite BW scenario. Surely you can see that such a model overestimates the impact of latency.

As for the Shebanow lecture, I don't know what you're talking about. Everything he says supports my arguments. Even "Little's law modified" shows you that you need fewer threads when you have clusters of TEX access (G goes up), or when you have volume textures (lambda goes down).

so bandwidth (and therefore cache system) appears to be playing a part in RV770's performance.

Good catch. So the 4850 is probably a bit BW limited and the 4870 isn't, because these tests are mostly homogenous workloads. It's surprising, then, that it is still 2.25x faster than RV670/R600 because they are not affected by BW. I guess there's some sort of bug or limitation that we don't know about, as per clock throughput for the 4870 is 3.1x that of the 3870.

It seems we keep finding out how good the texturing units on RV770 are. You gotta wonder what the designers were smoking when they made those monsterous TMUs in R600...

Jawed · Jul 27, 2008

Mintmaster said:
Remember that when you're texturing throughput limited (ALU:TEX < 1) you don't multiply by the ratio since the ALU will necessarily have to idle. Also, with volume textures, you need a minimum of 32 clocks per batch and sometimes more. That means we're looking at 1024 cycles of latency hiding.

I was referring to shaders with a mixture of texturing types. So you might have 6 bilinear fetches + a volume texture (and for good measure you might put the volume texture fetch inside a 3-instruction dynamic loop

). It's all really tricky and this problem appears to be a cornerstone of the difficulty of writing compilers for these architectures - e.g. how does the compiler decide the best-performing balance between the number of registers and the number of batches in flight?

Apparently the compiler ideally needs to model average cache latencies and cache sizes and decide the degree of randomness in fetches and how big the textures are.

We are already assuming dependent texturing in our calcs. If there's no dependent texturing, all N texture fetches can be grouped, and number of batches needed to hide the latency is divided by N (or, conversely, latency hiding ability goes up by N for a given number of batches).

Cache can only an issue when the GPU is BW limited,

To be more precise, cache-misses don't actually mean that TEX throughput is memory bandwidth constrained - simply that memory latency and bandwidth affect throughput (which is effective ALU:TEX). It depends on the miss ratio and the size of the texels. Obviously things like vec4 fp32 fetches are going to be severely bandwidth constrained if they're truly random.

As far as I can tell in ATI's GPUs the rasteriser drives the pre-fetching of texels. So common surface texturing (int8 bilinear, at least) should have 100% cache hit apart from the start-up period for a shader. I interpret this as a key reason why ATI's 16/40 TUs are so competitive against NVidia's 64/80.

most of the BW consumption is by the shader as opposed to the ROP, and you can't fit all batches' texels in the cache. That is very, very rare in games.

Though it appears to be common with render-target post-processing shaders, e.g. HDR tone-mapping with fp16 texels.

More pertinent to our discussion is the fact that when this happens, more batches doesn't help you. Matrix multiply is a special case because each pixel needs a whole row of input data form one matrix and a whole column from another. For nxn matrices, that's 2*n^3 fetches without cache (8 bytes per scalar MAD - anyone got 4TB/s BW? ) or block multiply. Intelligent data access is crucial to performance. I personally think that they can do even better with clever pixel marching.

I presume you're referring to macro Peano or Hilbert curves, e.g. where each cell is 16x16 elements in size.

I'm just going to assume every texture access needs between X and Y cycles to get the result from memory, i.e. a no-cache, infinite BW scenario. Surely you can see that such a model overestimates the impact of latency.

And all I'm saying is that if you want to model shader throughput then you have to model all the resource constraints. Without modelling variations in latency due to cache-access patterns, the naive ALU:TEX model falls short.

All I've been saying is that you can generalise the problem by saying that the effective ALU:TEX for windows of code become relevant to the throughput of the shader. A volume texture fetch followed immediately by a dependent ALU instruction might have a 1:4 effective ratio. The effect of this specific window's ratio in the context of a shader that is, overall, 8:1, is hard to predict. The model is very subtle.

As for the Shebanow lecture, I don't know what you're talking about. Everything he says supports my arguments.

Listen to what he says starting at the 52 minute mark - the audience is arguing your point, but he says it's not that simple. He's quite vague, but what he's saying is that you want to maintain ALU:TEX in smaller and smaller windows of instructions - if you don't then the shader's overall throughput (effectively ALU:TEX) will be affected.

NVidia's architecture suffers from extra resource restrictions on throughput in comparison with ATI's, the clearest example being register read-after-write latency.

Even "Little's law modified" shows you that you need fewer threads when you have clusters of TEX access (G goes up), or when you have volume textures (lambda goes down).

When G goes up you are effectively increasing the ALU:TEX for the window that is bounded by the first TEX instruction in the group and the first ALU instruction that is dependent upon the last TEX instruction. Even if you don't explicitly increase G in your code, the chances are the NVidia GPU will do it for you (since the second TEX can be issued as soon as the ALU instructions it's dependent upon are done).

With ATI GPUs this is a static compilation question.

And as I pointed out in the other thread, the looser more finely-grained scheduling in NVidia's architecture means that ALU:TEX will tend not to fall too low as a direct result of load-balancing/instruction-scheduling

Good catch. So the 4850 is probably a bit BW limited and the 4870 isn't, because these tests are mostly homogenous workloads. It's surprising, then, that it is still 2.25x faster than RV670/R600 because they are not affected by BW. I guess there's some sort of bug or limitation that we don't know about, as per clock throughput for the 4870 is 3.1x that of the 3870.

It's interesting that R600 performance for Veined Marble has increased by 2% from Cat 8.4 to Cat 8.7 at 2560x1600. Despite our earlier conclusion that it's filtering-throughput limited, this increase in performance implies that something other than filtering-throughput is playing a part.

4% for Car Surface.

I think doubled L1 in RV770 is a potentially big factor. It's worth observing that R6xx's L1s were being crucified by having to support 4 SIMDs (which basically means 4 distinct screen-space tiles). I think L1s in R6xx were resting far more heavily on L2 cache. ATI seemingly thought that the sheer ring-bus bandwidth from L2 to L1 was enough to mitigate this problem, but the ring-bus prolly adds latency (when compared with a hub/crossbar, as in RV770).

So, I dunno, between R600 and RV770 we could be seeing nothing more than a 10-20% reduction in average latency (when taking account of the full memory hierarchy, L1, L2 and GDDR).

Jawed

Mintmaster · Jul 27, 2008

Jawed said:
I was referring to shaders with a mixture of texturing types. So you might have 6 bilinear fetches + a volume texture (and for good measure you might put the volume texture fetch inside a 3-instruction dynamic loop ).

I'll send you my little program to play with before I release it as it does all this stuff, though branching isn't there yet. As suggested by Shebanow's presentation, the single bilinear texture fetch groups are the toughest to hide. Other texture groups that take longer (due to more fetches or longer filter ops) are easier.

To be more precise, cache-misses don't actually mean that TEX throughput is memory bandwidth constrained

I wasn't saying that. I meant cache misses only affect performance when you are BW limited, assuming latency can be hidden. For a low BW shader, a good cache may only use 20% the BW, and a bad one will use 40%. Both will have the same performance, though.

Though it appears to be common with render-target post-processing shaders, e.g. HDR tone-mapping with fp16 texels.

This only makes a difference for short shaders, because when you look at an area of pixels the data needed is only 8 bytes per pixel (plus another 4 for the ROP). For example, a 64x128 pixel area needs FP16 texels from a 80x144 area.

While you can indeed be BW limited in such a cases, it don't satisfy the other two conditions I mentioned, especially the last. 8,192 pixels in flight (8 reg shader) won't have a texel footprint of even half of R600's texture cache.

A 2048x2048 matrix multiply, however, is very different. A 64x128 section of the final matrix has a texel footprint of 1.5 MB. Good data management is very important.

I presume you're referring to macro Peano or Hilbert curves, e.g. where each cell is 16x16 elements in size.

Not quite, because those curves aren't that great for matrix mulitply. You'd basically want to use strips of a certain width (or height), but scan convert them in the short direction first. I'm sure plenty of people know how to do this, but the GPGPU pdf you linked to is more about compiling strategy for a given single quad workload.

And all I'm saying is that if you want to model shader throughput then you have to model all the resource constraints. Without modelling variations in latency due to cache-access patterns, the naive ALU:TEX model falls short.

Well, I already have a max and min parameter to randomly determine latency. Regardless of shortcomings in this cache model or any other, the point is that I can look at worst case analysis - i.e. 0% hit rate - for various instruction arrangements and see that not only is performance unaffected, but I can get peak throughput quite easily (ALU and/or TF with high utilization) and can hide pretty close to the predicted latency.

It's actually not that tough to model analytically. Take N = lambda * L / G, but do a sort of shader average for (lambda / G), i.e. sum of total cycles to filter every fetch and divide by the total number of groups. Combining with ALU throughput, you get:

Latency hiding ability = # batches * max{total ALU cycles/batch, total TF cycles/batch} / # tex groups

It's easy to estimate this from the input shader, particularly worst case. Now, it's not too accurate because...

Listen to what he says starting at the 52 minute mark - the audience is arguing your point, but he says it's not that simple. He's quite vague, but what he's saying is that you want to maintain ALU:TEX in smaller and smaller windows of instructions - if you don't then the shader's overall throughput (effectively ALU:TEX) will be affected.

He is basically arguing my point B from above, i.e. sometimes you just get unlucky (and you also lose cycles for this reason for the first set of batches, obviously, hence the transient part). He just didn't get into the details of how often it happens.

This is where the scheduler comes in. I put in a naive algorithm in my simulation, prioritizing untouched batches first and then going by batch age. At first I only had the latter, and ran into a sort of "batch aliasing" causing me to be unlucky more often (well, it's not really luck...). I'm sure more sophisticated algorithms can do better, but I still get near perfect utilization for usual latency figures and 80-95% when reaching the limit predicted by the above equation. After that it's a linear dropoff as expected, i.e. utilization = latency hiding / actual latency.

I might play around with other ideas to help stagger the batches more randomly.

<snip>

Agreed on those matters about G, NVidia vs ATI, and compilation. It should be noted that it's not hard for a compiler to group.

Despite our earlier conclusion that it's filtering-throughput limited, this increase in performance implies that something other than filtering-throughput is playing a part.

I think doubled L1 in RV770 is a potentially big factor.

Well, we already knew that something else matters owing to the per-clock comparison with RV770.

An L1 factor would show BW dependence, but there is none. I think the only way it wouldn't is if there is insufficient BW between the L1 and L2, which would be quite odd.

Remember, latency really shouldn't be a factor. Assuming volume textures, I'm finding 1000+ cycles of latency are easily absorbed by this shader. The longer it takes to filter, the more latency can be hidden.

3dcgi · Jul 27, 2008

Mintmaster said:
This is where the scheduler comes in. I put in a naive algorithm in my simulation, prioritizing untouched batches first and then going by batch age. At first I only had the latter, and ran into a sort of "batch aliasing" causing me to be unlucky more often (well, it's not really luck...). I'm sure more sophisticated algorithms can do better, but I still get near perfect utilization for usual latency figures and 80-95% when reaching the limit predicted by the above equation. After that it's a linear dropoff as expected, i.e. utilization = latency hiding / actual latency.

Can you describe the aliasing you see from using age as the criteria? Are you focusing on one thread type, like pixel, and ignoring the vertex/geometry types to keep things simple?

I'm curious to see the results of the program. Thanks.

Jawed · Jul 28, 2008

This is just a quick, partial, reply - on the basis that some of this stuff is easier to discuss with your modelling program in both our hands.

Mintmaster said:
I'll send you my little program to play with before I release it as it does all this stuff, though branching isn't there yet. As suggested by Shebanow's presentation, the single bilinear texture fetch groups are the toughest to hide. Other texture groups that take longer (due to more fetches or longer filter ops) are easier.

It sounds like you are using his B, T, G (block, texture, gate) notation to describe a shader as input to your program, which is good.

An L1 factor would show BW dependence, but there is none. I think the only way it wouldn't is if there is insufficient BW between the L1 and L2, which would be quite odd.

I'm intrigued to see if your program can provide an indicative simulation of this and other shaders for which we have real-world data on a number of ATI GPUs. i.e. changes in performance depending on GPU that are similar in magniutude to what's observed.

Ideally you'd be simulating the combination of a vertex shader and a pixel shader and the amount of bandwidth consumed by non-texture fetches.

I'm sure you've been having fun.

Remember, latency really shouldn't be a factor. Assuming volume textures, I'm finding 1000+ cycles of latency are easily absorbed by this shader. The longer it takes to filter, the more latency can be hidden.

A set of 32 batches (8 registers per batch) should execute for 864 clocks, so how you are able to hide 1000+ cycles of latency?

Jawed

Mintmaster · Jul 28, 2008

Jawed said:
It sounds like you are using his B, T, G (block, texture, gate) notation to describe a shader as input to your program, which is good.

Not quite, as I'm following ATI's model since that's what we were talking about in the first place. An execution unit (three types: ALU, TA, TF) stays on a batch until the instruction type changes, and each block is finished before the next one is started.

Don't expect too much from it! I'm not making a competitor to ATTILA. BTW, YGPM.

A set of 32 batches (8 registers per batch) should execute for 864 clocks, so how you are able to hide 1000+ cycles of latency?

Remember that volume textures change the number of cycles per batch for one TEX instruction to 32 or even 64. Thus 32 batches can hide at least 1024 to 2048 cycles. Alternatively, you can look at the equation above. Of course the ALU is partially idle, but it has to be, as the TF unit is nearly 100% occupied.

CarstenS · Jul 28, 2008

Would it be possible to adjust shader compilers to extract more ILP by giving them a tad more CPU-time or just changing the algorithm to a more cpu-intense variant?

I'm asking because comparing Cat8.4 to 8.7 (not only in Shadermark 2.1) I've been seeing losses in cpu-bound situations (games and/or low resolutions) of up to 10 percent for the newer driver whereas in the very same tests 8.7 comes out victorious in very high resolutions by 2-5 percent. All this on a R600, so RV770-specific improvements do not count here.

Upcoming ATI Radeon GPUs (45/40nm)

Humus

Crazy coder

Mintmaster

Jawed

Mintmaster

Jawed

Mintmaster

Freak'n Big Panda

Jawed

Mintmaster

Jawed

Mintmaster

Jawed

Mintmaster

3dcgi

Jawed

Mintmaster

CarstenS

Moderator