Do scalar pipelines need to be bigger?

Demirug · Dec 18, 2006

IMHO you thought much too complicated.

Letâ€™s assume that every thread have an instruction pointer and a state flag. When a thread enters the shader unit the IP is zero and the flag says ready. Every time the scheduler needs a new thread it looks in its thread pool for ready threads and select one. The thread is then moved to execution unit and the flag is changed. The Scheduler will not touch this thread anymore until it is back from the execution unit and the flag is set back to ready again. This will save all the synchronization work.

Mintmaster · Dec 18, 2006

DemoCoder said:
I don't think you're considering replication. With the G80, it's resolved via copy propagation.

It's still just the inverse of what I was talking about with swizzling. Writing scalars to 16 pixels has a larger number of register destination possibilities than vectors to 4 pixels. Just as with swizzling, I don't think it's clear you're saving much.

Write masks as well are also handled by simply not emitting an instruction for that channel (dead code elimination) There is a definate savings.

In terms of speed, yeah. I totally agree that going scalar gives you a speed boost. In terms of hardware, though, it's almost nothing.

Mintmaster · Dec 18, 2006

DemoCoder said:
Perhaps someone can better explain why the G80 needs to care about whether something is a vec4 or not, and what the checkpointing would accomplish.

I'm going to agree with you that it really doesn't do anything between arithmetic instructions. I assume they designed the thread handling so that the result of a scalar instruction is available the next time the SP's attack that group of pixels again for the next instruction. Dependencies won't be an issue then.

Arun · Dec 18, 2006

Demirug said:
This will save all the synchronization work.

Synchronization doesn't have to be expensive. It's a way to save transistors by minimizing the number of threads in flight. This improves performance for a given number of registers and triangles that may be in flight in the shader core's pipelines at any single time.

Demirug said:
Let’s assume that every thread have an instruction pointer and a state flag. When a thread enters the shader unit the IP is zero and the flag says ready.

As a very simple extension of this scheme, consider the addition a per-instruction flag. If this flag is not set, the thread will remain ready. This, of course, is not the best way to implement this scheme, but it shows you can have a very, very cheap implementation that already improves things slightly. Of course, a much better way to illustrate my point might just be to link the proper patent

The bolding is mine.

http://aiw1.uspto.gov:80/.aiw?Docid=20050138328&PageNum=&Rtype=&SectionNum=&idkey=FE4F7A87B524
"Across-thread out of order instruction dispatch in a multithreaded graphics processor"
Inventors: Moy, Simon S.; (Los Altos, CA) ; Lindholm, John Erik; (Saratoga, CA)
Filed: December 18, 2003

[0006] However, round-robin issue does not always hide the latency. For example, pixel processing programs often include instructions to fetch texture data from system memory. Such an instruction may have a very long latency (e.g., over 100 clock cycles). After a texture fetch instruction is issued for a first thread, the issue control circuit may continue to issue instructions (including subsequent instructions from the first thread that do not depend on the texture fetch instruction) until it comes to an instruction from the first thread that requires the texture data. This instruction cannot be issued until the texture fetch instruction completes. Accordingly, the issue control circuit stops issuing instructions and waits for the texture fetch instruction to be completed before beginning to issue instructions again. Thus, "bubbles" can arise in the execution pipeline, leading to idle time for the execution units and inefficiency in the processor.

[0007] One way to reduce this inefficiency is by increasing the number of threads that can be executed concurrently by the core. This, however, is an expensive solution because each thread requires additional circuitry. For example, to accommodate the frequent thread switching that occurs in this parallel design, each thread is generally provided with its own dedicated set of data registers. Increasing the number of threads increases the number of registers required, which can add significantly to the cost of the processor chip, the complexity of the design, and the overall chip area. Other circuitry for supporting multiple threads, e.g., program counter control logic that maintains a program counter for each thread, also becomes more complex and consumes more area as the number of threads increases.

[0008] It would therefore be desirable to provide an execution core architecture that efficiently and effectively reduces the occurrence of bubbles in the execution pipeline without requiring substantial increases in chip area.

16. A method for processing instructions in a microprocessor configured for parallel processing of a plurality of threads, wherein each thread includes a sequence of instructions, the method comprising: fetching a first instruction from a first one of the plurality of threads into an instruction buffer configured to store an instruction from each of the plurality of threads; subsequently fetching a second instruction from a second one of the plurality of threads into the instruction buffer; determining whether one or more of the first instruction and the second instruction is ready to execute; and issuing a ready one of the first instruction and the second instruction for execution, wherein the second instruction is issued prior to issuing the first instruction in the event that the second instruction is ready to execute and the first instruction is not ready to execute.

[0061] FIG. 5 is a simplified block diagram of a dispatch circuit 140 according to an embodiment of the present invention. Dispatch circuit 140 includes a scoreboard circuit 502, a scheduler 504, and an issue circuit (or issuer) 506. Scoreboard circuit 502, which may be of generally conventional design, reads each of the (valid) instructions in buffer 138. For each instruction, scoreboard circuit 502 checks register file 144 to determine whether the source operands are available. Scoreboard circuit 502 generates a set of ready signals (e.g., one bit per thread) indicating which instructions in buffer 138 are ready to be executed, i.e., have their source operands available in register file 144. Scheduler 504 receives the ready signals from scoreboard 502 and the valid signals from buffer 138 and selects a next instruction to dispatch. The selected instruction is dispatched to issuer 506, which issues the instruction by forwarding it to execution module 142. The thread identifier of the thread to which the selected instruction belongs may also be forwarded to issuer 506 and/or execution module 142, e.g., to enable selection of the appropriate registers for the source operands and result data.

It is also interesting to note that this scheme benefits from the compiler being aware of it.

Anyway, before anyone tries to tell me this most likely isn't used in G80, may I point out that Lindholm is the lead architect of the G80 shader core, and has a number of patents with him as the sole inventor on unified architectures? Obviously, a scheme without such an optimization would be more simple to implement. It would also arguably be more elegant to describe. And it might even "feel" better in some people's minds. But when designing such an architecture, elegance is of very little importance. If you must sacrifice some simplicity to reduce your die size by 5% for a given performance level, then you just do it, no questions asked.

Uttar

dnavas · Dec 18, 2006

Dammit, Uttar, I was just about to post that

*Both* the per-instruction ready-flag *and* the patent. :sigh:

Only one more question left, then. Are we really sure that the compiler/driver transforms Vec instructions to scalars? It would seem advantageous to keep them as Vec instructions for two reasons: first, when you find a unit of work, you have to think a lot less hard to find your next unit of work; and second, you use less instruction cache.

Consider the work that the scheduler has to do (and include that patent from above). First you have to check to see if the next instruction is in your instruction cache. If not, you need to issue a fetch, and then try a different batch. Once you find a batch, you need to go check to see if its operand registers are ready or not. If not, you need to find a different batch. That's a bunch of speculative work -- once you find an operation, it'd sure be nice if it was more than just a scalar op, necessitating all of that work to happen on the next clock (and that train of thought is where the per-instruction ready flag thought arose -- if you can indicate that the next instruction is independent of preceding instructions [up to some pipeline depth], you can reduce the speculative work of the scheduler).

That could be why you have two schedulers per cluster. Each scheduler might allocate all 16 units if it finds enough work, allowing the other scheduler to continue looking. If both have work, they can split the ALUs. [My assumption behind 32 vs. 16 is that it's a memory coherence optimization -- a little extra logic ought to be able to reactively reduce batch-width from 32 to 16 under extremely long, divergent branches -- I don't know if that wins a lot, though.]

Razor1 · Dec 18, 2006

Mintmaster said:
Razor1, it's a pretty simple software task to resolve dependencies in that way. The hardware will take care of dynamic changes.

Ah ok thx!

DemoCoder · Dec 19, 2006

The mechanism being described above tho is just a traditional scoreboarding technique, not a "checkpoint". That's why I was confused. There is no need for the compiler to do anything with scoreboarding, the HW handles everything.

NVidia might be using scoreboards or "done" bits, but this is different than the original claim (and somewhat orthogonal) for some kind of compiler generated 'checkpoint' to be inserted after each highlevel vector operation in the source.

Mintmaster · Dec 19, 2006

Uttar said:
Obviously, not sure if I presented it as something quite that dramatic. I did say: "that'd quadruple the number of threads in flight in the ALUs at a given time" but heh

I figured you'd reply with that.

The thing is that if threads are in flight, they're in flight. There's no special distinction for in-flight "in the ALUs". Even if a cluster isn't executing a texture instruction on a particular batch of pixels at a particular time, you still have the same number of threads in flight.

Arun · Dec 19, 2006

DemoCoder said:
There is no need for the compiler to do anything with scoreboarding, the HW handles everything.

The HW might still benefit from having a compiler that's aware of it, if the scheme used is that of the "two instructions checked", it's easy to see how a naive compiler would be suboptimal for, say, dot products.

NVidia might be using scoreboards or "done" bits, but this is different than the original claim (and somewhat orthogonal) for some kind of compiler generated 'checkpoint' to be inserted after each highlevel vector operation in the source.

Both schemes have the exact same effect, with differing efficiency levels. Given the docs at your disposal, I'm sure you can imagine why I was thinking of checkpointing in that specific post...

It had been a while since I read that patent and other related Lindholm ones.

Mintmaster said:
The thing is that if threads are in flight, they're in flight. There's no special distinction for in-flight "in the ALUs". Even if a cluster isn't executing a texture instruction on a particular batch of pixels at a particular time, you still have the same number of threads in flight.

I guess the terminology I'm using is different from yours, and most likely yours is much more accurate than mine. The way I was defining things is that if a thread has its registers allocated but has no instructions waiting for data and no instructions running in the ALUs, it's "idle". If it has instructions waiting for data or running in the ALUs, it's "in flight". Now that I think about it, I can see how such definitions might be a tad confusing, and far from perfect.

My point remains though that by exclusively diminishing the number of different threads running in the ALU pipelines by running up to several instructions of the same thread in them at the same time, you can most likely reduce the necessary register file size by up to 10%. That's not too great yet, but a slightly more complex scheme (as described in the patent) can probably help you get much more than that 10% extra efficiency.

Uttar

3dilettante · Dec 19, 2006

Uttar said:
I guess the terminology I'm using is different from yours, and most likely yours is much more accurate than mine. The way I was defining things is that if a thread has its registers allocated but has no instructions waiting for data and no instructions running in the ALUs, it's "idle". If it has instructions waiting for data or running in the ALUs, it's "in flight". Now that I think about it, I can see how such definitions might be a tad confusing, and far from perfect.

I think your definition of in-flight is closer to that of "stalled". The use of in-flight I've been used to just means an instruction (or in this case part of a thread) has gotten past the decode phase in the execution pipeline.

Edit: Make that after entering the fetch stage. Once something takes up any on-core real-estate (execution core), it's in-flight.
Edit edit: Make that just after being fetched. Fetch is in that weird transition zone.

Do scalar pipelines need to be bigger?

Demirug

Mintmaster

Mintmaster

Arun

Unknown.

dnavas

Razor1

DemoCoder

Mintmaster

Arun

Unknown.

3dilettante

Similar threads