Jawed
Legend
I couldn't be bothered to read that closely, but I suspect what's going on there is that the compression is used to cater for horizontal or vertical register file addressing. i.e. registers can be allocated in either direction, depending on access pattern in instructions. One or the other layout then plays ball with compression.One consideration about the patent most recently posted is that I am not sure how necessary part of the patent is.
The desire to compress down runs of sequential RAW hazards seems laudable, but the maximum number of registers addressable in CUDA does not exceed those of already existing and standardly scoreboarded designs.
Oh and the main instruction issue patent document, since I stumbled upon it:
http://v3.espacenet.com/publication...7214343A1&KC=A1&FT=D&date=20070913&DB=&locale=
The title uses "out of order" to refer to hardware threads, rather than intra-thread instruction ordering.
Compilation can be used to re-order instructions for minimal hazards per issue clock:
But, regardless, because instruction issue is keyed upon register-dependency:[0043] The number of threads in a given core may also be varied according to the particular implementation and the amount of latency that is to be hidden. In this connection, it should be noted that in some embodiments, instruction ordering can also be used to hide some latency. For instance, as is known in the art, compilers for graphics processor code can be optimized to arrange the instructions of the program such that if there is a first instruction that creates data and a second instruction that consumes the data, one or more other instructions that do not consume the data created by the first instruction are placed between the first and second instructions. This allows processing of a thread to continue while the first instruction is executing. It is also known in the art that, for instructions with long latencies, it is usually not practical to place enough independent instructions between creator and consumer to fully hide the latency. In determining the number of threads per core, consideration may be given to the availability (or lack thereof) of such optimizations; e.g., the number of threads supported by a core may be decided based on the maximum latency of any instruction and the average (or minimum or maximum) number of instructions that a particular compiler can be expected to provide between a maximum-latency instruction and its first dependent instruction.
intra-thread instructions can issue out of order. With the proviso that the document isn't a 100% guarantee of what's inside a GPU...[0076] Buffer 510 is advantageously configured to store collected operands together with their instructions while other operands for the instruction are being collected. In some embodiments, issuer 506 is configured to issue instructions to execution units 142 as soon as their operands have been collected. Issuer 506 is not required to issue instructions in the order in which they were dispatched. For example, instructions in buffer 510 may be stored in a sequence corresponding to the order in which they were dispatched, and at each clock cycle issuer 506 may select the oldest instruction that has its operands by stepping through the sequence (starting with the least-recently dispatched instruction) until an instruction that has all of its operands is found. This instruction is issued, and instructions behind it in the sequence are shifted forward; newly dispatched instructions are added at the end of the sequence. The sequence may be maintained, e.g., by an ordered set of physical storage locations in buffer 510, with instructions being shifted to different locations as preceding instructions are removed.
[0077] In one embodiment, an instruction that has been dispatched to issuer 506 remains in buffer 138 until it has been issued to execution module 142. After dispatch, the instruction is advantageously maintained in a valid but not ready state (e.g., the valid bit 210 for a dispatched instruction may remain in the logical true state until the instruction is issued). It will be appreciated that in embodiments where issuer 506 may issue instructions out of the dispatch order, this configuration can help to prevent multiple instructions from the same thread from being concurrently present in buffer 510, thereby preserving order of instructions within a thread.
Actually it's possible to read that as merely stating that the hardware-thread ordering isn't necessarily maintained by issuer 506.
Anyway we agree, whichever way we take it, the fine-grained register-dependency and operand-readiness scoreboarding is relatively costly.
Jawed