Fine-grained SMT on the GPU

Nick

Veteran
Hi all,

I've been trying to understand the new GPU architectures' implementation details, and while I think I grasp the bulk of it, I'm curious about the SMT capabilities.

Say we have a pixel shader full of MAD instructions, and a vertex shader full of special-function instructions, do they execute concurrently and maximize ALU usage? Likewise, if the top half of a pixel shader is full of MAD instructions, and the bottom half is full of special-function instructions, and hard dependencies exist between the parts, do other batches of pixels (i.e. threads) increase ALU utilization?

Or is this form of 'Hyper-Threading' still a CPU-only feature (to be reintroduced by Nehalem)? If it is, what can be excpected for the foreseeable future?

Thanks,

Nicolas

Edit: Since Arun clarified that ADD and MUL use the same ALU (at least on G80), I've changed the example to MAD and special-function.
 
Last edited by a moderator:
There are [generally] two major types of SMT implementations -- one, like Hyper Threading, aimed to actually fill the pipeline gaps of potentially unused resources (thin of a long pipeline) and the other found in much more simple architectures (SUN's Niagara) targeting to cover instruction/operation dependences in a relatively short and in-order (eg. simple) pipe. The last one is more attractive, because of the prospect of supporting much more "virtual" threads in-flight on a shared resource in a heavily threaded environment.
It's all about code nature and the compiler output.
 
The MAD is a single instruction executed in a single unit; I'm not aware of any architecture that benefits from SMT in that scenario :) If the unit was divided in two, you'd obviously need 6 registers per clock, rather than 4. Also, scheduling complexity in general would be higher, you'd need more threads being handled at the same time, etc. - it really doesn't make sense for a GPU to do that, imo.

And to follow on what fellix said, if you divide SMT that way, you could argue that GPUs are like even more exotic versions of Niagara with ultra-wide FPUs. (okay, it's a stretch, I know!)
 
There are [generally] two major types of SMT implementations -- one, like Hyper Threading, aimed to actually fill the pipeline gaps of potentially unused resources (thin of a long pipeline) and the other found in much more simple architectures (SUN's Niagara) targeting to cover instruction/operation dependences in a relatively short and in-order (eg. simple) pipe. The last one is more attractive, because of the prospect of supporting much more "virtual" threads in-flight on a shared resource in a heavily threaded environment.
How are these fundamentally different? As far as I can tell they're both the same kind of SMT, and the difference is really the out-of-order or in-order nature of the architecture. Or am I missing some crucial difference between these two that affects SMT itself?

Anyway, I was curious whether GPUs do any kind of SMT at all, now, or if we can expect it for the future...
 
The MAD is a single instruction executed in a single unit;
Interesting. So when a DP3 is split into three MULs and two ADDs on a scalar architecture it's really three MADs and there's no possibility to shedule an extra ADD?

Are the SFUs in the same unit as well then? I assume not because higher GFLOPS are claimed when they are counted in...
If the unit was divided in two, you'd obviously need 6 registers per clock, rather than 4.
I see. I wasn't aware of that increase in complexity.
...it really doesn't make sense for a GPU to do that, imo.
As in; now and till forever?

Thanks!
 
How are these fundamentally different? As far as I can tell they're both the same kind of SMT, and the difference is really the out-of-order or in-order nature of the architecture.
Well, you could easily say that one is the extension of the other, yeah. So they're not fundamentally different imo, but still perhaps worth to differentiate.

I wouldn't say the difference is OoOE though; I'd rather say it's the number of individual units you have. If you only have one unit you can send instructions to in your core, well...

In the case of the G80 and R600, it depends how you look at it. Obviously, you don't have 2 threads in flight at any given time, you have hundreds of threads, or perhaps rather dozens of batches. In G80's case, up to 12 batches per multiprocessor, with 16 multiprocessors on the chip. In each multiprocessor, you have two units: an 8-wide ALU (which is double-pumped so arguably looks like it's 16-wide from the scheduler's POV) and interpolation/SFU.

So in G80's case, you could think of it as a 2-issue core. You can think of it as SMT if you want, I guess. In R600's case, well, the leaks imply that unlike G80, it's VLIW and is 5-wide. You could argue that makes each ALU block apply temporal multithreading exclusively, instead of SMT; but in the end, I'm not even sure what's the point of thinking about it that way.
 
Interesting. So when a DP3 is split into three MULs and two ADDs on a scalar architecture it's really three MADs and there's no possibility to shedule an extra ADD?
I think so, but I'll admit not to be 100% sure in the specific case of a DP3. Actually, ironically, I'm sure for R600 but not for G80 - oops. (given the VLIW nature of R600, it shouldn't be very hard to conclude what this means...)
Are the SFUs in the same unit as well then? I assume not because higher GFLOPS are claimed when they are counted in...
See above reply! :)
As in; now and till forever?
It's hard to say, so I won't pretend to know for sure. In this context, however, I like to bring up the G7x's pipeline. It had two MADs, but couldn't use both units as MAD (at least in FP32 mode!) due to the number of registers that could be fetched in one pipeline loop.

Instead, this allowed you to do the following: ADD+MAD, MAD+ADD, MUL+MAD, MAD+MUL. So they could never reach 100% utilization in FP32 mode, but in the end the tradeoff was probably quite valid for the architecture in question and more flexible than NV40's structure.

If your question can be summarized by "Does it make sense in a GPU to increase scheduling complexity to increase ALU utilization?" - then I guess the answer is that you just need to consider your total die size in both cases, and the average utilization you'd achieve. In most cases, I'd suspect the answer is definitely 'No!', though - today's complexity feels like a relatively good sweetspot to me. But we'll see, these are the kinds of things that are quite hard to predict! Especially so depending on where the industry wants to go forward in terms of branch coherence...
 
The definition of SMT you linked is a single pipeline stage simultaneously executing multiple threads (presumably each thread in a different execution unit of a superscalar architecture). Whether you consider GPUs as doing SMT kind of depends on how you chop up the shading units -- i.e. where you draw the line between multiple processors and multiple execution units within a single processor.

One way of looking at G80 is that each SP is a separate processor with two execution units (FPU/ALU and quarter-speed SFU). In this case, whether it is SMT depends on whether the FPU/ALU can be working on one thread while the SFU is still working on a previous thread. I suspect it can't issue instructions to both units in the same clock cycle, but I suspect it can issue instructions from one thread to the FPU/ALU while the SFU is still working on instructions from a previous thread. But I don't really know.. I don't think this quite meets the definition of SMT you're using, though it's marginal.

Another way of looking at G80 is that each processor cluster (8 SPs) is a single processor with 16 (effectively) FPU/ALU units and 4 (effectively) SFU units. Looked at this way, G80 is intrinsically SMT: each execution unit is *expected* and *designed* to be working on a different thread, all at the same time.

But at least in spirit GPUs are using massive SMT, even if they don't quite fit every detail of a particular CPU-oriented definition of SMT.
 
It's hard to say, so I won't pretend to know for sure. In this context, however, I like to bring up the G7x's pipeline. It had two MADs, but couldn't use both units as MAD (at least in FP32 mode!) due to the number of registers that could be fetched in one pipeline loop.
Depends:

MAD r0,r1,r2,r3
MAD r5,r1,r2,r4

should work. The limit is simply four fp32 registers (each register actually being a vector 4 fp32). I wouldn't be surprised if the use of one or more inline constants (i.e. not dynamically indexed) also helped:

MAD r0, r1, r2, c0
MAD r5, r3, r4, c1

as constants can be encoded into the instruction word.

As to the question of truly SMT behaviour inside a GPU, this patent application appears to describe precisely this:

Vertex data processing with multiple threads of execution

b3d93.gif


But the catch is, this programmable unit isn't user-programmable in the strict sense. The only programs it runs are vertex attribute interpolations (vertex coordinate, vertex colour, texture coordinate). Since the attributes that need interpolating across a triangle (as it's rasterised) vary depending on what's output by the vertex shader, the driver, I assume, submits the required program.​

But I've no idea if this actually appears in any GPU - hard to be sure with patent applications, lol.​

Jawed​
 
armchair_architect said:
But at least in spirit GPUs are using massive SMT, even if they don't quite fit every detail of a particular CPU-oriented definition of SMT.
Couldn't have put it better :)

As for whether you can *truly* co-issue SFU and MAD, I think you can, but that might not be perfectly obvious at first because there is a preprocessing step for the data happening in the MAD... So you might think it's actually losing a cycle there for the FPU because you're sending something to the SFU, but it's really just computing something the SFU needs via the FPU. It could be that you are also correct, however, thus adding even more overhead - I'd need to test that specifically to be absolutely sure it is not the case.
Jawed said:
Depends:
MAD r0,r1,r2,r3
MAD r5,r1,r2,r4
should work.
Ah yes, thankies, I definitely forgot to mention that! As for constants, clearly since G7x does not really have PS constants afaik (it just recompiles the program), it would be quite surprising if that wasn't completely free indeed.
 
"SMT" types on different GPUs, in light of PS3.0 dynamic branch support:
  • NV40 - single wide and deep pipeline (variable batch size, depending on # of quads), very latent thread switching;
  • G70 - wide, but "partitioned" long pipes (fixed batch size, per quad), latent thread switching, but allows for some minor performance tweaks;
  • R520 - fine-grained pipe (four independent 128-batch "channels", per quad), very short thread switching, but insufficient processing power;
  • R580 - same as above, but with three times the resource at expense of the tripling the thread switching time (extending the pipe depth);
Someone to correct me or add more info. ;)
 
As far as I can tell they're both the same kind of SMT, and the difference is really the out-of-order or in-order nature of the architecture.
If you look at pure implementation, in 2-way hyperthreading like the P4, you're going to have 2 copies of the processor state, along the pipeline, but only 1 copy for non-state logic (= combinational logic), with muxes that select between copy 1 or copy 2. In essence, you really create 2 virtual instances of the CPU.

As long as 1 copy doesn't stall, you let it flow. When it stalls, you switch from one copy to the next. As arm_architect pointed out, this means that at all points in time, you'll have 2 threads in flight at all locations along the pipeline.

In CMT, this is not the case: here you maintain a list of threads and automatically schedule from one thread to the next when a certain thread stalls or after a particular time slice.
In this case, you will still have multiple threads in the pipeline, due the pipelined nature of the processor, but at each stage along the pipeline, you only have 1 piece of thread state.

The SMT case is not very scalable: a hack to increase the amount of utilization in case of stalls, but only doable for very low multiples (usually 2), because you'll soon run into timing paths due the multiplexers themselves. This is not the case for CMT, but the granularity of CMT is higher and the ability of fill up little bubbles consequently lower.

A GPU is very similar to the CMT case, except that they do this for multiple threads in parallel. I don't think it will ever make sense to use SMT for a GPU.

Edit: Clearly, I'm taking a more CPU-oriented purist view about SMT <> CMT than others in this thread. ;)

Edit 2: Ok, I see the reason for the disparity. On Intel, I believe SMT stood for Symmetric Multi-Threading. In the Wikipedia article, it's simultaneous multi-threading, which is then divided in 2 categories, like the ones I described above.
 
Last edited by a moderator:
"SMT" types on different GPUs, in light of PS3.0 dynamic branch support:
  • NV40 - single wide and deep pipeline (variable batch size, depending on # of quads), very latent thread switching;
  • G70 - wide, but "partitioned" long pipes (fixed batch size, per quad), latent thread switching, but allows for some minor performance tweaks;
  • R520 - fine-grained pipe (four independent 128-batch "channels", per quad), very short thread switching, but insufficient processing power;
  • R580 - same as above, but with three times the resource at expense of the tripling the thread switching time (extending the pipe depth);
Someone to correct me or add more info. ;)

R520->R580 was extending the pipe width, not the depth. Its still x number of clocks but 3x the number of pixel pipelines, so 3x the number of pixels per batch.
 
R520->R580 was extending the pipe width, not the depth. Its still x number of clocks but 3x the number of pixel pipelines, so 3x the number of pixels per batch.
Actually -- from the Thread Dispatcher point of view -- it's namely extending the depth of the batch, through increasing the # of successive clock ticks, so the burst of fragments to match the extended width of the shader core. ;)
 
I've just re-read the G80 architecture article, and noticed that the SFU's are mentioned completely separately. Furthermore, it sais that each special function operation takes 4 clock cycles. Is that exeuction clocks, latency, or can they only be issued once every 4 clock cycles? I sometimes see people mention 518 GFLOPS for G80 at 1.35 GHz, which I assumed to add in the SFU's.

I'm trying to understand the limitations (if any) of this architecture, to have an idea of what might lie ahead...
 
Back
Top