Jawed
Legend
GPUs only have registers for "fast" context. I don't understand how D3D11 works, but part of the intention is for "optimal" register allocation for a suite of virtual functions.Registers are allocated at invocation for all the potential operands used by whatever exception handler is coded?
In HD5870 it seems that the number of clause temporary registers available is 8, compared with 4 in earlier GPUs. This increase might have been targetted for usage by an exception handler.
Additionally, ATI has the concept of shared registers - registers that are shared by all current hardware threads, where the registers are shared by work items of the same work item ID. So if 3 shared registers are allocated, then 64 work items * 3 shared registers = 192 registers need to be allocated.
NVidia could implement the same thing - it's simply a matter of removing the hardware thread ID from the addressing (assuming that the register file is designed to handle more than one kernel at a time, e.g. VS+PS). Since the hardware will only run a single exception handler at any one time, there's no need to allocate a monster wodge of these shared registers.
So the context at the time the exception occurs is "locked" simply by control flow passing to the handler. The handler then has dedicated working registers (the shared registers allocated earlier + clause temporaries if this is ATI) and it gets to work with all registers in the context.
An alternative is to use L1 for working memory during the exception handler. The context (registers) remains untouched as the handler starts, work is done in L1 and then registers updated according to the handler. L1 is "full speed" in GF100 (i.e. not really, due to limited bandwidth and increased latency compared with registers) but the latencies encountered by LDSTs in the handler will be covered by the other hardware threads which will be running as normal.
All that's speculation, of course. I haven't spent any time on ATI's implementation.
I'm referring to flushing the pipeline of succeeding instructions from the same hardware thread that might be in flight. Exceptions should only arise in limited places in the pipeline.Is this 12 cycles from the point of view of the affected thread, or from the hardware? With the warp schedulers, we have multiple warps in progress, and we would need to track them separately.
Per lane exception handling is hardly a big deal. A queue of exceptions (i.e. a per-work-item FIFO) would allow the hardware to track all hardware threads if they all hit a shitstorm of exceptions one after the other, before the first invocation of the handler has completed.Worst-case, the hardware would have to track exceptions (possibly different ones?) from every lane in every warp that is currently in progress at the point of the first exception, wherever that first appears in the pipeline.
Since GT200 has exception handling perhaps this means it's cheaper in GF100?Nvidia claims to have changed the internal ISA to a load/store one, whereas the earlier variants had memory operands that would have been nightmarish to track as part of an ALU instruction.
I zoomed in and I think I can see what you're referring to. There's hints of structure along those two edges, with a "paired" structure on each side.My interpretation is that within each core, there are rectangular bands of straight silicon on the upper and lower edges, with one end marked by regions that look like the SRAMs for the register file.
Sandwiched between them would be stuff I attribute to special function and scheduling.
Also, looking at the die it seems you can tell how fast the logic's clocked by whether it's light or dark!
Shame the whole picture is so blurry.
Jawed