I was referring to the total shader latency. It's basically #ALUop * #latency_cycles/ALUop + #TEXops * #latency_cycles/TEX, modulo parallel ALUs and/or texture units. Basically, it's how long it takes for the shader to start executing until the shader ends execution for one particular thread.Nick said:So it's the time between one shader instruction and the next, for a given quad?
What?ERK said:So much for 3:1...
It's entirely possible to hide long latencies. You just need a reasonable amount of storage on chip. GPUs use caches as do CPUs for just this purpose.Nick said:It's impossible to hide latencies of dozens of memory accesses through threading. In fact you'd need a cache-like structure to store all thread register sets. And that would only 'solve' the latency problem.
Your explaination seems ok, but it's more like the coarse grained multithreading used in Sun's Niagra (Ultrasparc T1).Nick said:The way I understand it, ATI's Ultra-Threading is very similar to Intel's Hyper-Threading. It hides latencies (mainly from texture sampling) by scheduling whichever instruction from a group of shader threads is ready to execute. Still in-order execution though. I think that's what you meant but please correct me when I'm wrong.
I'm pretty sure only context switching is required although I'm not sure what the minimum switch time is. I didn't notice Demirug mention SMT so maybe you read too much into the post.psurge said:Demirug - can you provide a link that says WDDM2.1 will support "SMT" (and not just a context switch that isn't insanely slow)?
Serge
Demirug said:(snip)
CPU goes in the direction that they can run more threads at the same time. But the program has to spilt in threads to get some more speed.
GPUs will do this too (WDDM 2.1) ...
(snip)
psurge said:Demirug - can you provide a link that says WDDM2.1 will support "SMT" (and not just a context switch that isn't insanely slow)?
Serge
psurge said:3dcgi, Demirug
seemed to imply it at first glance. Actually though it isn't necessarily "SMT" but could instead be closer to "SoEMT" (so that the whole GPU pipe doesn't have to flush for a context switch to happen).
Serge
Nick said:The way I understand it, ATI's Ultra-Threading is very similar to Intel's Hyper-Threading. It hides latencies (mainly from texture sampling) by scheduling whichever instruction from a group of shader threads is ready to execute. Still in-order execution though. I think that's what you meant but please correct me when I'm wrong.
I still have to wrap my head around the relationship between the number of registers and the latencies. But it's probably just a matter of taking a pencil, sketching the architecture and visiualizing the processes. I'm sure it will make a lot of sense after I let it sink in. I got way more information in this thead than expected.Bob said:Does it make sense now?
Hi Nick,Nick said:I still have to wrap my head around the relationship between the number of registers and the latencies. But it's probably just a matter of taking a pencil, sketching the architecture and visiualizing the processes. I'm sure it will make a lot of sense after I let it sink in. I got way more information in this thead than expected.
So thank you very much and everybody else here!
Note that this is not the case unless you can execute 20 ALU instructions per clock - throughput limitations accordingly reduce thread requirements. (This is not tricky to model on paper, and left as an exercise to the reader ).Bob said:If your shader takes 20 clocks through ALUs and 100 clocks through texturing, you need to keep around its registers for 120 clocks. Hense to run at full speed, you need 120 threads to run serially.
I was going for one instruction/clock. If you can do 20 instructions/clock (presumably through different threads), then you need 20x the threads to keep the machine fully busy.Dio said:Note that this is not the case unless you can execute 20 ALU instructions per clock - throughput limitations accordingly reduce thread requirements. (This is not tricky to model on paper, and left as an exercise to the reader ).
This effectively tells you what your throughput: You can use these factors to approximate the max throughput.
Here's an example, using Manager Math, that I totally made up: 1 quad pipe, 16 clocks/ALU of latency, 2 ALUs in parallel, 50% of ALU instructions can run on both ALUs, one texture unit in series with the ALUs with an average latency of 500 cycles. The shader has 60 ALU instructions and 2 texture instructions, and uses 5 registers.
In addition, the RF size is 32 KB per pixel pipe (128 KB/quad pipe)
Total latency of the shader is then:
16 cycles latency in ALU * (60 ALU ops - 0.5 parallel ALU ops * 60 ALU ops / 2 ALUs) + 2 TEXops * 500 cycles of latency in TEX = 1720 cycles.
You can only run that many threads to cover the total latency:
32KB / 16 bytes/reg / 5 registers = 409
So your best case performance is: 409/1720 * 4 pixels/quad = 0.95 pixels/clock per quad pipe, or about quarter speed.
If you took the TEX unit and put in parallel to the ALUs instead of in series, you then need to consider what your dependency chain is. If you can put 10 non-dependent instructions between each TEX look-up and when you need the texture results, your total latency then becomes 1480 cycles, pushing your max speed to 1.10 pixels/clock per quad pipe.
Does it make sense now?
Edit: I just realized I made a small mistake up there. I took the RF size as being per pixel, and not for the whole quad pipe. If it's 32KB for the whole quad pipe, just divide all the resulting numbers by 4.
Edit 2: Here's another example. If you had 2 texture pipelines in parallel (NV30-style), both in serial with the two ALUs as in the first example, then, everything else being the same, you would get a total latency of 1220 cycles, for a total speed of 1.34 pixels/clock/quad pipe.
No, because the quad pipe processes 4 pixel threads in parallel. Thus, you have 4x the instructions, but 4x the units working all in parallel, so the total thread time stays the same. Dynamic branching changes this, of course, but I'm not considering that in any of my calculations (since they're all approximations anyhow).Eddie Tsao said:Don't you need to multiply 4 to the total latency, 1720*4 = 6880 cycles.