A new patent issued by NVidia

nAo

Nutella Nutellae
Veteran
Across-thread out of order instruction dispatch in a multithreaded graphics processor
Instruction dispatch in a multithreaded microprocessor such as a graphics processor is not constrained by an order among the threads. Instructions are fetched into an instruction buffer that is configured to store an instruction from each of the threads. A dispatch circuit determines which instructions in the buffer are ready to execute and may issue any ready instruction for execution. An instruction from one thread may be issued prior to an instruction from another thread regardless of which instruction was fetched into the buffer first. Once an instruction from a particular thread has issued, the fetch circuit fills the available buffer location with the following instruction from that thread.
It's pretty broad patent and it addresses (among other things) unified shading too and balancing of diffrent thread types execution on specialized and not specialized shading processors
 
Sorry to be picky but its the second time its happend recently and it really gets on my nerve for some reason.
Graphics companies don't issue patents the patent office does. ( although really graphics companies might as bloody well )
 
The patent is just claiming a glorified instruction window for a multithreaded fetch/dispatch/issue unit (aka shader) in a graphic processor.

Other than the graphic processor reference the only difference with the instruction window in an out of order CPU (Pentium 4 for example) is that each instruction in the window is from a different thread. In fact I would name it a 'thread window' rather than an instruction window.

An instruction window selecting instructions between dozens of threads is more suited for stream like architectures and algorithms (as vertex and fragment shading is) than selecting between a sequence of instructions in one or just a few threads. But as both P4 and Power5 actually support two threads and an instruction window I fail too see how this patent application provides something new. And the killed EV8 processor if I remember well support 4 or 8 threads with their instruction randomly stored in the instruction window. I don't see how the wording of just one instruction per thread can make a difference to the fact that exist a lot of previous work. But I guess that's just me ranting about the patent system ...

The patent claims that one of the possible implementations (embodiment in patent language) could execute two types of threads (vertex and fragments). That's the only reference to unified shaders in the patent and it is actually relatively independent from the concept propossed in the patent.

Other than that this patent may be hinting that current or previous NVidia architectures executed vertex or fragment threads with a circular queue (or using a round robin policy). This implementation, as is also explained in the patent (which BTW is quite easy to read compared with other patents I have found), allows for a very simple fetch and instruction cache implementation where only one PC is required for a whole batch of vertex/fragment inputs/threads and a fetch from the instruction cache is only required when all the batch in a thread have executed the current instruction. The fetch bandwidth requeriments of such implementation are ridicule compared with the fetch bandwidth of a CPU beast like EV8 would be (there is a lot of research work on improving fetch bandwidth for general purpose processors). In fact I would say that with a one or two bit bus you could still have spare bandwidth. It would also be quite useful to hide misses to the instruction cache.

On the other hand, as the patent also discusses, a pure circular thread queue may stall if the latency of a texture access happens to be greater than the length of the queue. As it is likely that it's impossible to have enough thread state for hiding, for example, an access to a texture stored in system memory the thread window implementation could improve performance.

The thread window implementation increases the complexity of the fetch/dispatch and issue logic. The technique requires a PC per thread and increases the fetch bandwidth needed as now each thread may be fetching instructions from (slighlty) different locations. Also it requires at least one instruction fetch per cycle rather than one instruction fetch each N cycles (N being the number of threads in the shader queue). The requeriments aren't that large in number of transistors but if the GPUs ever get to be multi GHz with current or near future technology (not very likely) the logic for selecting the instructions in the window may become a limiting factor for the GPU maximum frequency (as it happens with current CPUs) as at least one instruction must be awakened and dispatched per cycle (there is research for multicycle awake and dispatch for CPUs but they have some problems).

Also, if the described thread window technique is used, you could provide less threads/state per shader unit and still keep the same or better performance than with a circular queue. As thread state (input, temporal and output registers) is likely to represent a significative percentage of the shader unit area/transistors you could either implement a GPU with less transistors and or power requeriments or with more shader units (and shading power) with the same transistor/area and power budgets.

As an example of how so non innovative I think this patent is I actually implemented the fetch/issue unit in my simulator to fetch from any ready thread without even giving a thought. It was later when I thought about the circular queue implementation with a single PC for all the threads.
 
The majority of patents are non-innovative, it's just defensive so they don't get sued.

These days, if you can patent 2+2, you better do it, else someone else will and sue you.
 
Back
Top