As far as GPUs are concerned I'm not even a software guy, just a wooden chair pundit :smile:
In terms of the memory interface, per se, I'm not proposing a change. Instead I'm proposing that the coalesced data accesses against memory are mirrored by allowing the same (or similar) re-sorting to be performed on work items, regardless of "warp", for the purposes of execution. By dynamically constructing warps that end-up matching the accesses against memory the GPU needs less work items in flight to alleviate the latency increases caused by incoherent memory accesses - which would allow the GPU to allocate more registers per work item.
With coalesced memory fetches, say, but un-coalesced use of that data, the data has to wait around longer, on die, before it is used. This then reduces the overall effective capacity for data fetched from memory, in comparison with a GPU that reduces the life-time of that data by coalescing the use of data.
In G80/GT200 currently the windower has a private block of memory out of which it feeds the ALUs. Currently it seems this doesn't have the capacity to prevent waterfalling occurring, e.g. random fetches from constants/registers. I'm proposing an extension to this, first so that it can mitigate waterfalls and secondly (due to increased capacity) perform inter-work group coalescing of operands, regardless of where the operands are sourced: register, constant or memory.
But what about nested branching and loops, with the worst case being a nesting of loops?
I'm looking to the windower's operand memory as a method to insulate the ALUs from register bank conflicts (waterfalling).
Currently the windower does 16-wide fetches from registers, though it only feeds 8-wide to the ALUs. This allows it "time" to re-sort operands for a warp, thus hiding the banking latency when fetching r0 and r13, say, for an instruction. It also covers for the greed of the ALUs, since they'll happily suck in 4 operands per clock (MAD+MI : 3 operands MAD, 1 operand transcendental/interpolator).
It seems that two 16-wide fetches from registers are actually required to keep up with the MAD+MI units, since they can consume 32 operands per clock (8 lanes for each, 3 operands MAD, 1 operand MI).
I presume NVidia abstracts this - e.g. the hardware is actually doing one 32-wide fetch but it's doing it for a pair of warps. It presents this to the programmer as a 16-wide fetch per warp, but of course a pair of warps is time-sliced "per instruction".
To give the windower increased capacity, as I'm suggesting, would necessarily increase the width of fetches or increase the number of parallel fetches (ports). Since NVidia uses the windower and "wider-than-the-ALU" fetches to "simulate" multi-porting, I figure they'd elect to widen the fetches.
It may be that a 32-wide fetch is as wide as practicable. In which case, ahem, the ALU bandwidth would have to be reduced - half-clocked or only 4 lanes wide. Any which way, clearly the cost of the windower/instruction-issuer increases in my proposal. No idea if this cost is actually worth paying, though - need a simulator for that
Jawed