Dynamic Branching Granularity

Demirug - can you provide a link that says WDDM2.1 will support "SMT" (and not just a context switch that isn't insanely slow)?

Serge
 
Nick said:
So it's the time between one shader instruction and the next, for a given quad?
I was referring to the total shader latency. It's basically #ALUop * #latency_cycles/ALUop + #TEXops * #latency_cycles/TEX, modulo parallel ALUs and/or texture units. Basically, it's how long it takes for the shader to start executing until the shader ends execution for one particular thread.

Since you allocate the register file for the whole duration of the shader execution, the shader latency you can cover depends on the number of registers used per thread, the total register file size and the latency of a single thread.

This effectively tells you what your throughput: You can use these factors to approximate the max throughput.

Here's an example, using Manager Math, that I totally made up: 1 quad pipe, 16 clocks/ALU of latency, 2 ALUs in parallel, 50% of ALU instructions can run on both ALUs, one texture unit in series with the ALUs with an average latency of 500 cycles. The shader has 60 ALU instructions and 2 texture instructions, and uses 5 registers.

In addition, the RF size is 32 KB per pixel pipe (128 KB/quad pipe)

Total latency of the shader is then:
16 cycles latency in ALU * (60 ALU ops - 0.5 parallel ALU ops * 60 ALU ops / 2 ALUs) + 2 TEXops * 500 cycles of latency in TEX = 1720 cycles.

You can only run that many threads to cover the total latency:
32KB / 16 bytes/reg / 5 registers = 409

So your best case performance is: 409/1720 * 4 pixels/quad = 0.95 pixels/clock per quad pipe, or about quarter speed.


If you took the TEX unit and put in parallel to the ALUs instead of in series, you then need to consider what your dependency chain is. If you can put 10 non-dependent instructions between each TEX look-up and when you need the texture results, your total latency then becomes 1480 cycles, pushing your max speed to 1.10 pixels/clock per quad pipe.

Does it make sense now?


Edit: I just realized I made a small mistake up there. I took the RF size as being per pixel, and not for the whole quad pipe. If it's 32KB for the whole quad pipe, just divide all the resulting numbers by 4.

Edit 2: Here's another example. If you had 2 texture pipelines in parallel (NV30-style), both in serial with the two ALUs as in the first example, then, everything else being the same, you would get a total latency of 1220 cycles, for a total speed of 1.34 pixels/clock/quad pipe.
 
Last edited by a moderator:
Nick said:
It's impossible to hide latencies of dozens of memory accesses through threading. In fact you'd need a cache-like structure to store all thread register sets. :) And that would only 'solve' the latency problem.
It's entirely possible to hide long latencies. You just need a reasonable amount of storage on chip. GPUs use caches as do CPUs for just this purpose.

The technique for handling branches when all pixels of a batch/thread don't take the same path is called predication. Itanium is one chip that uses predication and the docs have some good explainations. For a simple case like this...

if ( x )
a = b
else
a = c

it is actually quicker to execute all the instructions and throw away the result you don't want than to branch.
 
Nick said:
The way I understand it, ATI's Ultra-Threading is very similar to Intel's Hyper-Threading. It hides latencies (mainly from texture sampling) by scheduling whichever instruction from a group of shader threads is ready to execute. Still in-order execution though. I think that's what you meant but please correct me when I'm wrong.
Your explaination seems ok, but it's more like the coarse grained multithreading used in Sun's Niagra (Ultrasparc T1).
 
psurge said:
Demirug - can you provide a link that says WDDM2.1 will support "SMT" (and not just a context switch that isn't insanely slow)?

Serge
I'm pretty sure only context switching is required although I'm not sure what the minimum switch time is. I didn't notice Demirug mention SMT so maybe you read too much into the post.
 
3dcgi, Demirug

I quite possibly did read too much into it. This part of the post :

Demirug said:
(snip)
CPU goes in the direction that they can run more threads at the same time. But the program has to spilt in threads to get some more speed.

GPUs will do this too (WDDM 2.1) ...
(snip)

seemed to imply it at first glance. Actually though it isn't necessarily "SMT" but could instead be closer to "SoEMT" (so that the whole GPU pipe doesn't have to flush for a context switch to happen).

Either one is quite a bit more than I would have expected :smile: . Anyways, sorry if I did misread.
Serge
 
psurge said:
Demirug - can you provide a link that says WDDM2.1 will support "SMT" (and not just a context switch that isn't insanely slow)?

Serge

Sorry it’s not public available yet but even with WDDDM 2.1 it will still a context switch as this is the only way to do multitasking on a GPU. The difference is that with WDDM 1.0 the graphics subsystem/driver (which both runs on the CPU) will switch the context. The fastest rate is after every DMA buffer packet. With WDM 2.1 the GPU itself will switch from one context to another. Those switches will happen without any help from the driver and the GPU. In the case of page faults in the virtual memory a switch should be possible even within a pixel. From the GPU point of view this means that it have to be possible that the different shader units can work on different programs at the same time. The main different will be on the command processor part. Instead of reading only from one DMA buffer it have to read from multiple buffers and keep for every buffer the current state.
 
psurge said:
3dcgi, Demirug

seemed to imply it at first glance. Actually though it isn't necessarily "SMT" but could instead be closer to "SoEMT" (so that the whole GPU pipe doesn't have to flush for a context switch to happen).

Serge

With WDDM 2.1 it should be possible to have more than one context running on the GPU at the same time at every part. This will make a full pipeline flush unnecessary.
 
Nick said:
The way I understand it, ATI's Ultra-Threading is very similar to Intel's Hyper-Threading. It hides latencies (mainly from texture sampling) by scheduling whichever instruction from a group of shader threads is ready to execute. Still in-order execution though. I think that's what you meant but please correct me when I'm wrong.

Well, not really. The intel threading is more like virtual machines. Our threading has the concept of in-order execution within a threads, but threads are executed out of order, as they have no dependancy within shading. They need to be retired in order. As well, the number of threads in flight is at least an order of magnitude different between us.

As for granularity, you want as small as possible to improve branching efficiency (i.e. every line of shader executed for a pixel that's going down the wrong path is a wasted cycle). However, for SIMD efficiency, you want as large as possible (though the need drops off passed a certain size); as well, as stated above somewhere, the cost of meta data for each threads is significant, so you want to minimize that (also, for texture cache efficiency, large granularity helps a lot). You can do lots of performance tuning to find the best size/area tradeoff, for a given metric.

Oh yes, as for the latency hiding, it's really the number of threads * the clause size which determines the latency hiding for a given group of threads. As clauses are larger (i.e. more instructions can be executed before we need to go fetch data), the number of threads required to hide all the texture fetch latency goes down. That means that the clauses can use more GPRs and still hide the same latency. More or less. That's why on R5xx, you see pretty much linear performance as GPR usage increases (using more GPRs means that the threads are longer, and so you need less threads to hide the same latency) -- You can see the Stanford data for that.
 
Last edited by a moderator:
Bob said:
Does it make sense now?
I still have to wrap my head around the relationship between the number of registers and the latencies. But it's probably just a matter of taking a pencil, sketching the architecture and visiualizing the processes. I'm sure it will make a lot of sense after I let it sink in. I got way more information in this thead than expected. :D

So thank you very much and everybody else here!
 
Nick said:
I still have to wrap my head around the relationship between the number of registers and the latencies. But it's probably just a matter of taking a pencil, sketching the architecture and visiualizing the processes. I'm sure it will make a lot of sense after I let it sink in. I got way more information in this thead than expected.

So thank you very much and everybody else here!
Hi Nick,

Branching granularity reflects the tradeoff between the amount of parallel floating-point processing power in a GPU and the amount of control-flow thread processing power. If each pixel could branch independently, then you'd need one thread of control flow for each pixel, thus you wouldn't be much more efficient than a CPU. This is basically the summary of all the contributons/posts in this thread IMO.
 
Bob said:
If your shader takes 20 clocks through ALUs and 100 clocks through texturing, you need to keep around its registers for 120 clocks. Hense to run at full speed, you need 120 threads to run serially.
Note that this is not the case unless you can execute 20 ALU instructions per clock - throughput limitations accordingly reduce thread requirements. (This is not tricky to model on paper, and left as an exercise to the reader ;) ).
 
This should be stickied somewhere....waay too much good info here to allow it to disappear into oblivion.
 
One of these days I suspect we'll get around to a "Hall of Fame" forum, or a "B3D Indispensable Threads" or somesuch, take nominations, and copy/move (I think COPY, personally, so we can also delete out the dross at the same time) threads into it. It keeps coming up as a suggestion from folks in one form or another, and it just makes a lot of sense.
 
Dio said:
Note that this is not the case unless you can execute 20 ALU instructions per clock - throughput limitations accordingly reduce thread requirements. (This is not tricky to model on paper, and left as an exercise to the reader ).
I was going for one instruction/clock. If you can do 20 instructions/clock (presumably through different threads), then you need 20x the threads to keep the machine fully busy.
 
Dear Bob,

Do you mean each pixel for a quad will execute totally "60 ALU instructions and 2 texture instructions" which takes 1720 cycles? This is only for one pixel in the quad pipe. Don't you need to multiply 4 to the total latency, 1720*4 = 6880 cycles.

If we want to make the total latency 1720 cycles, it means that one thread is a quad, and each pixel in the quad pipe has its own "2 ALUs in parallel and one texture unit in series". So the shader could performs the whole one quad parallel. To fulfill the max throughput 1quad/clock, we need to feed 1720 threads(quads) into the quad pipeline, otherwise the performance will degrade to a ration depending on how many hw thread resources you invest.

Full Pipeline(1720 stage pipline, throughput = 1 quad/clock )
Thread#.pixel# => T#.p#
T[n].p0 fectch -> |T1719.p0|T1718.p0|.........................|T1.p0|T0.p0| -> Frame buffer
T[n].p1 fectch -> |T1719.p1|T1718.p1|.........................|T1.p1|T0.p1| -> Frame buffer
T[n].p2 fectch -> |T1719.p2|T1718.p2|.........................|T1.p2|T0.p2| -> Frame buffer
T[n].p3 fectch -> |T1719.p3|T1718.p3|.........................|T1.p3|T0.p3| -> Frame buffer
Full Parallel(1720 thread parallel, throughput = 1720 quad/1720 clock)
Thread#.pixel# => T#.p#
T[n].p0 fectch -> |T0.p0..................................................................| -> Frame buffer
T[n].p1 fectch -> |T0.p1..................................................................| -> Frame buffer
T[n].p2 fectch -> |T0.p2..................................................................| -> Frame buffer
T[n].p3 fectch -> |T0.p3..................................................................| -> Frame buffer
T[n].p0 fectch -> |T1.p0..................................................................| -> Frame buffer
T[n].p1 fectch -> |T1.p1..................................................................| -> Frame buffer
T[n].p2 fectch -> |T1.p2..................................................................| -> Frame buffer
T[n].p3 fectch -> |T1.p3..................................................................| -> Frame buffer
.
.
.
.
.
T[n].p0 fectch -> |T1719.p0.............................................................| -> Frame buffer
T[n].p1 fectch -> |T1719.p1.............................................................| -> Frame buffer
T[n].p2 fectch -> |T1719.p2.............................................................| -> Frame buffer
T[n].p3 fectch -> |T1719.p3.............................................................| -> Frame buffer


Thank you.
Eddie

This effectively tells you what your throughput: You can use these factors to approximate the max throughput.

Here's an example, using Manager Math, that I totally made up: 1 quad pipe, 16 clocks/ALU of latency, 2 ALUs in parallel, 50% of ALU instructions can run on both ALUs, one texture unit in series with the ALUs with an average latency of 500 cycles. The shader has 60 ALU instructions and 2 texture instructions, and uses 5 registers.

In addition, the RF size is 32 KB per pixel pipe (128 KB/quad pipe)

Total latency of the shader is then:
16 cycles latency in ALU * (60 ALU ops - 0.5 parallel ALU ops * 60 ALU ops / 2 ALUs) + 2 TEXops * 500 cycles of latency in TEX = 1720 cycles.

You can only run that many threads to cover the total latency:
32KB / 16 bytes/reg / 5 registers = 409

So your best case performance is: 409/1720 * 4 pixels/quad = 0.95 pixels/clock per quad pipe, or about quarter speed.


If you took the TEX unit and put in parallel to the ALUs instead of in series, you then need to consider what your dependency chain is. If you can put 10 non-dependent instructions between each TEX look-up and when you need the texture results, your total latency then becomes 1480 cycles, pushing your max speed to 1.10 pixels/clock per quad pipe.

Does it make sense now?


Edit: I just realized I made a small mistake up there. I took the RF size as being per pixel, and not for the whole quad pipe. If it's 32KB for the whole quad pipe, just divide all the resulting numbers by 4.

Edit 2: Here's another example. If you had 2 texture pipelines in parallel (NV30-style), both in serial with the two ALUs as in the first example, then, everything else being the same, you would get a total latency of 1220 cycles, for a total speed of 1.34 pixels/clock/quad pipe.
 
Eddie Tsao said:
Don't you need to multiply 4 to the total latency, 1720*4 = 6880 cycles.
No, because the quad pipe processes 4 pixel threads in parallel. Thus, you have 4x the instructions, but 4x the units working all in parallel, so the total thread time stays the same. Dynamic branching changes this, of course, but I'm not considering that in any of my calculations (since they're all approximations anyhow).
 
Back
Top