Do scalar pipelines need to be bigger?

KimB

Legend
Just a thought I had...if you're going to be doing SIMD on blocks of pixels, why should scalar pipelines need to be any bigger than vector pipelines? You might be able to have them take up the same amount of space by just working on more pixels per block.
 
You might be able to have them take up the same amount of space by just working on more pixels per block.
There you go, answering your own question ;)
Now, of course, in theory, that'd quadruple the number of threads in flight in the ALUs at a given time, compared to a Vec4 implementation, so there'd be some overhead there. As I love to repeat, though, there is a very simple way to "fix" that though, but people would most likely kill me if I said "the word"...


Uttar
 
There you go, answering your own question ;)
Now, of course, in theory, that'd quadruple the number of threads in flight in the ALUs at a given time, compared to a Vec4 implementation, so there'd be some overhead there. As I love to repeat, though, there is a very simple way to "fix" that though, but people would most likely kill me if I said "the word"...
Well, yeah, it requires wider blocks. But since the basic unit is a quad, 16 scalar ALU's grouped into one block would be a sort of natural number.
 
Now, of course, in theory, that'd quadruple the number of threads in flight in the ALUs at a given time, compared to a Vec4 implementation, so there'd be some overhead there. As I love to repeat, though, there is a very simple way to "fix" that though, but people would most likely kill me if I said "the word"...
I don't think it does because it takes longer to complete vector instructions.

The number of threads in flight is only there to hide data latency (instruction latency is over an order of magnitude less). You'd only quadruple the number of threads if you want to get data from the texture units between each scalar instruction, which is silly. Do 4 scalar instructions instead.

Take Xenos for example, but let's assume it was vec3+scalar. It works on groups of 64 pixels or vertices, and the shader arrays complete vec3+scalar operations on 16 quads at a time for 4 clocks, then switch to another group (i.e. batch). Imagine instead that the arrays operate on all 64 pixels at a time, but one channel at a time for 4 clocks. You pretty much get the same thing, except you have an opportunity to change the instruction each clock. The only issue is data dependency, because each instruction has latency, so you may need to switch batches more frequently if you want the full efficiency gain.

Chalnoth, I agree with you, and that's why all that instruction packing nonsense in the R600 thread seems a bit far-fetched to me. However, there probably are some issues with register access and routing that require more silicon. Also, the dependency issue would be important for dot-products via scalar MADD, so there could be more silicon there.
 
The number of threads in flight is only there to hide data latency (instruction latency is over an order of magnitude less). You'd only quadruple the number of threads if you want to get data from the texture units between each scalar instruction, which is silly. Do 4 scalar instructions instead.
Certainly; however, if in a Vec4 architecture, the instruction latency is 10 cycles and the data latency is 200 cycles, the total latency hiding necessary would be 210. Now, with a scalar architecture, it'd become 240, so that's a ~14% increase. While it might be quite small, it's still not negligible.

You can get that to work for you, instead of against you, by basically doing a very basic form of out-of-order execution commanded by the driver/compiler. Simply make the compiler add "checkpoints" where all previous instructions must have completed before continuing. That way, if you got a Vec4 MUL, such a checkpoint could be added after 4 instructions, and you'd have 4 instructions of the same thread/batch running in the same ALU at the same time. It's easy to see how if you have two Vec4 instructions that can be scheduled independently of the MUL, that alone could reduce the necessary latency hiding power by 80 cycles, if we use the scenario above.

Now, the big question is if G80 implements this, and I honestly don't know. I wouldn't be surprised if it didn't, but it'd certainly make sense if it did; and a very basic implementation (KISS!) of this wouldn't be too complex.


Uttar
 
Just a thought I had...if you're going to be doing SIMD on blocks of pixels, why should scalar pipelines need to be any bigger than vector pipelines? You might be able to have them take up the same amount of space by just working on more pixels per block.

Are you talking transistor real estate or what? A SIMD implementation is always going to be more compact regardless since you don't have to have redundant issue logic for each scalar ALU. That's been the whole point of SIMD for ages (at the cost of flexability in data structure requirements).
 
archie, G80's "scalar" processors are SIMD.

They do the same scalar instruction on one channel of many pixels (32 pixels per batch) instead of the same vector instruction on all 4 channels of the pixels in a quad. From what DemoCoder said in the other thread, the processors are 8-wide SIMD, with 2 groups of these in each cluster.
 
Now, the big question is if G80 implements this, and I honestly don't know. I wouldn't be surprised if it didn't, but it'd certainly make sense if it did; and a very basic implementation (KISS!) of this wouldn't be too complex.


Uttar


Wouldn't the thread execution manager take care of that though?
 
Uttar, that 14% difference is certainly smaller than "quadrupling the number of threads in flight".

I'm pretty sure that the compiler talk you speak of has been around since R300. If you want to take advantage of co-issue, you really need to be able to separate independent instruction streams from a shader.

Razor1, it's a pretty simple software task to resolve dependencies in that way. The hardware will take care of dynamic changes.
 
I forget - what's the thinking behind such a configuration again? What's the advantage vs 16-wide SIMD per cluster?
I'm not really sure. Maybe it helps flexibility in some way.

Assume instruction latency is 8 cycles. The 8-wide units would only have to alternate between two 32-pixel batches to completely eliminate dependency issues. It would spend 4 cycles in each batch, and by then the first instruction's result has appeared. A single 16-wide unit would have to alternate between 4 batches, switching every two cycles. This might be a bit more expensive to do, and/or less amenable to high-speed pipelining.

I'm just guessing, really.
 
Uttar, that 14% difference is certainly smaller than "quadrupling the number of threads in flight".
Obviously, not sure if I presented it as something quite that dramatic. I did say: "that'd quadruple the number of threads in flight in the ALUs at a given time" but heh :)
I'm pretty sure that the compiler talk you speak of has been around since R300. If you want to take advantage of co-issue, you really need to be able to separate independent instruction streams from a shader.
The compiler already had to do it for co-issue up to a certain extend, but I would assume that the hardware worked on what basically amounts to instructions "blocks" that included descriptions of what each unit had to do for a given cycle. A relatively good proof of that lies in R300's instruction limits for DX9; ATI claimed that it was higher than the specification, because they could handle both 96 Vec3 instructions AND 96 Scalar instructions. So you'd clearly assume they were stored together.. And anyway, if that wasn't the case, you'd basically have an OoOE engine in-hardware...

A similar reasoning holds true for NV30, NV40 and G70; given the non-decoupled nature of texture processing, I can't even see how such an idea is theorically possible. On the other hand, there's nothing preventing R5xx from doing this. Hmm.


Uttar
 
archie, G80's "scalar" processors are SIMD.

They do the same scalar instruction on one channel of many pixels (32 pixels per batch) instead of the same vector instruction on all 4 channels of the pixels in a quad. From what DemoCoder said in the other thread, the processors are 8-wide SIMD, with 2 groups of these in each cluster.

I think the actual configuration is err, configurable somehow. A TCP only has 16 SPs, and if all 16 SPs had to be always working on the same instruction, the G8x would have "unified shading" at the TCP level only, being able to toggle TCPs between doing vertex work vs pixel work.

Apparently though, this is wrong and within a given clock cycle, a single G8x TCP can be executing a maximum of *two* thread types. Either PS and VS or VS and GS (and maybe PS and GS, dunno about that combo) This explains the subtle "grouping" you see on NVidia's architectural diagrams of the TCP, where 8 SPs are grouped into 2 clusters within a TCP.

So, it seems a given TCP can be executing either 4 quads worth of shaders (16-wide), or 2-quads worth of shaders (8-wide) plus 8 SPs allocated to something else either (VS and GS). It is unknown if you could have two separate PS threads running within a single TCP, each dealing with a separate instruction.

I think it is worth pointing out that while one could consider the G8x to be SIMD (what do you call Xenos then, super-SIMD? SIMD-extreme? Seems like any GPU with a batch size > 1 would be labeled SIMD then), each have different granularity issues. Threads have local storage context, and there are definate differences to allocating scalar registers to thread contexts vs allocating vectorized registers. You need extra hardware to implement swizzling and replication which you do not need on the G8x for example.

Also, arranging independent instruction streams to optimally pack the ALUs and registers is much more complication than simply grouping quads into batches of 2 or 4, all sharing the same PC, but being dispatched to 8-16 SPs.

Fragments are independent by design of the shader model, so there is no need to try and extract ILP for interference free sequences of instructions for packing. I would say that a scalar design actually fits the "embarassingly parallel" nature of the shader model a little bit better by running more threads over more clock cycles, instead of fewer threads over fewer clock cycles with wider registers.

Performance wise you may be able to make the two equivalent, but implementation wise, scalar seems alot simpler to implement.
 
Performance wise you may be able to make the two equivalent, but implementation wise, scalar seems alot simpler to implement.
I think it's a combination of both hardware level and driver level support for instruction packing. If G80's instruction scheduler (driver side) doesn't implement checkpointing or a small sliding instruction window (less likely), before asking the hardware to do work, I'd be pretty surprised at this point.

As for the thread mix per cluster, I'm not sure why any thread type combo isn't possible on any given cycle.
 
I think it is worth pointing out that while one could consider the G8x to be SIMD (what do you call Xenos then, super-SIMD? SIMD-extreme? Seems like any GPU with a batch size > 1 would be labeled SIMD then), each have different granularity issues. Threads have local storage context, and there are definate differences to allocating scalar registers to thread contexts vs allocating vectorized registers. You need extra hardware to implement swizzling and replication which you do not need on the G8x for example.
Of course all GPUs are SIMD. Single instruction, multiple data, right?

Swizzling is just an extra stage of data routing, but because G80's SPs can access any channel of any register for more pixels, data routing complexity is probably roughly equal. Consider access to 10 float4 registers by a shader (640 bytes per quad of pixels), and for comparison's sake we'll keep it fixed between G71 and G80. For each operand, a G71 shader quad can select one of ten 64-byte blocks of data in one stage, then for each channel select one of four 16-byte blocks in the next stage. A G80 cluster must select one of forty 64-byte blocks (2560 bytes per 16 pixels). I'm not sure which is more expensive, but I think it depends on how everything is implemented.

Also, arranging independent instruction streams to optimally pack the ALUs and registers is much more complication than simply grouping quads into batches of 2 or 4, all sharing the same PC, but being dispatched to 8-16 SPs.
You referring to the rumours of how R600 works? I'll believe it when I see it, because it seems like a dumb way of increasing utilization given the obvious SOA-esque method of G80.

EDIT: Fixed SIMD. :oops: (thanks Simon)
 
Last edited by a moderator:
Perhaps someone can better explain why the G80 needs to care about whether something is a vec4 or not, and what the checkpointing would accomplish.

AFAIK, the G80 driver erases any semblance of code being vectorized before feeding it to to the HW. Whether 4 instructons were generated via a vec4 MUL, or just 4 scalar MULs added by the developer is immaterial.

Maybe the way it's being described is unclear.

Let's say you've got the following code:

Code:
MUL R0, R1, R2
ADD R3, R0, R2

The compiler is going to turn that into

Code:
MUL R0x, R1x, R2x
MUL R0y, R1y, R2y
MUL R0z, R1z, R2z
MUL R0w, R1w, R2w

ADD R3x, R0x, R2x
ADD R3y, R0y, R2y
ADD R3z, R0z, R2z
ADD R3w, R0w, R2w

Now, if say, MUL latency is 8 cycles, then of course, the ADD dependency on R0x would incur a wait of 4 cycles. Now, we could posit an uber compiler that knows all deterministic instruction latencies (i.e. except TEX and I/O) that would try to schedule some non-dependent instructions between the last MUL and first ADD (besides NOPs), but why? This great thing about uberthreading is we don't need to commit to much resources to exotic compiler schemes.

Moreover, I don't even necessarily think you need the compiler to put an instruction there "Warning, you can't execute this ADD because you are still waiting for MUL #1" Why not just arrange the # of threads so that they cover the worse case instruction latency. So if you have a latency of 8 cycles, and you've got 8 SPs to handle, just make sure you have atleast 64 threads to work on, then you can even handle cases like

Code:
MUL r1x, r0x, r0x
MUL r2x, r1x, r1x
MUL r3x, r2x, r2x

The 'checkpoint' proposal seems to suggest that fully scalar (non-vec4) code like the above would have compiler generated checkpoint bitss like so:

Code:
MUL r1x, r0x, r0x
*checkpoint, please do not progress until this MUL is done*
MUL r2x, r1x, r1x
*checkpoint, please do not progress until this MUL is done*
MUL r3x, r2x, r2x
*checkpoint...*

But I really don't think it helps the HW, and seems to complicate the design. Maybe I'm missing the purported benefits.
 
Checkpointing (if that's even a word) in this instance would also have to take into account multiple thread types and ALU pipelining, when helping the hardware schedule threads. Dropping in a checkpoint on dependent instructions for one thread type like that is just automatic, surely?
 
I'm not sure which is more expensive, but I think it depends on how everything is implemented.

I don't think you're considering replication. With the G80, it's resolved via copy propagation. Write masks as well are also handled by simply not emitting an instruction for that channel (dead code elimination) There is a definate savings.

Effectively, we're talking about whatever FLOAT[N] is cheaper than FLOAT[N/4][N%4]. N needs the same number of addressing bits, sure. But 1-stage plus no replication or write mask looks alot simpler to me. The G71/R5xx for example, would need to have a compiler to detect write masks, mark those components and "unused" and then try to pack other independent ops there for co-issue, and then untangle everything back again. It's not going to be as efficient.
 
Checkpointing (if that's even a word) in this instance would also have to take into account multiple thread types and ALU pipelining, when helping the hardware schedule threads. Dropping in a checkpoint on dependent instructions for one thread type like that is just automatic, surely?

Well, but outside of IO and flow control instructions which have indeterminate latency, why would you need to do this, when you can just cover the latency of continuous sequences of ALU instructions with threads? Look at ATI's CTM for example, you have checkpoints (wait semaphores) for only when you dependencies between mixed instruction types (TEX->ALU, ALU->flow control, etc) This makes sense ot me, since you can't statically schedule between those mixed types. However, with a contiguous group of ALU instrs, you can.

Thus, I can see the compiler inserting checkpoints (IMHO, conditional variables is more accurate), as they are synchronizaton primitives, between I/O and ALU clauses, or between flow control and others. But between groups of 4 scalars just because the original source code they came from was a vec4 MUL? I still don't understand why the GPU should whether the original *source code shader* MUL was 1D/2D/3D/4D/etc.

Putting checkpoints between high level SM4.0 MULs seems quite arbitrary to me.
 
Back
Top