If in this scenario, the first GT200 warp is completed after 750 cycles, then all the data that's fetched in the first 749 cycles is sat around on die waiting to be used.
Then the problem here is on-chip storage pressure, where long-lived data leads to starvation in terms of buffers or registers, not the idling SIMDs themselves.
Buffers probably won't allow incomplete warps to take an entry, and registers may be intractable for most of the duration.
Just how complex a deallocation procedure is present for the partially completed warps?
We could clear holes of data out of the source register sections devoted to a work group, leading to a fragmented space that might not be desirable until the warp is nearly finished anyway.
Is the scheme able to make multiple windows, or does it stall if it runs into another gather op?
The ability to continue deep is more desirable for getting work done, but the amount of persistent and dynamically generated meta data can make the cost insane.
I'm suggesting a scoreboard for a single "barrier" per work item (thread). It's a bit field, 1 meaning pending barrier, 0 all clear - this amounts to 128 bytes per multiprocessor.
If the windower knows that any barriers are outstanding it can scan across work items for warp-wide sets (32-wide, 4 phases of 8). If some work items happen to constitute default warp allocations, then cool. Otherwise, coalesce work items to make temp-warps.
What would "then cool" entail? We've already built up our scheduler to handle uncool situations, so how much would we save if we still kept standard scheduling logic around?
What probability is there that one work item in a set of 32 is not ready, thus necessitating the more involved method?
How frequently are we going to scan this kilobyte table for resolved barriers?
At a size of 1024, there is going to be a fair amount of time where we would expect at least one item to change form cycle to cycle.
The method for coalescing is a degenerate form of a sort, where we try to move the 0 flagged units to a space that becomes the issue list for the temp warps to draw from.
Does this method permit going down any deeper than one window? Is encountering another barrier and setting the work-unit's flag back to 1 ambiguous?
We will be rescanning every entry this 1K entry table every cycle, and doing the degenerate sorting up to 32 times in the lifespan of a barrier.
The worst-case example would be 32 ready units spread out evenly throughout all 1024 entries, and this happening 32 times.
(This means the work group has been beating around the issue logic for 1024 cycles, hopefully with no other useful work needing to be done on other workgroups.)
This is the scheduling cost, and it would be prior to the other costs that I suspect might pop up from breaking the association between the SIMD lane and work units.
Separately, there's clearly a cost involved in implementing a crossbar from the windower out to the ALU lanes, as there's no associativity between a work-item and the ALU lane it'll occupy.
That would be one of the large costs I am concerned about.
The windower hardware is trying to map from a set of 1024 to a warp of 32.
The lack of associativity also means register identifier translation to hardware location is done without the aid of reference values that can be derived from hardware lane position and the scheduler's cycle counter, since neither value can be assumed consistent across the units at the SIMD at the point of issue.
I'm not sure why you include the whole chip.
I misinterpreted the original description so that it appeared to relax the SIMD relationship to the point it sounded like any SIMD, regardless of multiprocessor, would do.
The scoreboard doesn't need to score barriers per operand - merely per work-item.
The math was concerning the amount of data that would have to be generated by the scheduler and sent out for instruction issue, depending on what level of the hierarchy is made aware of the coalesced warps.
With associativity kept, SIMD scheme can exploit the placement of the lanes. It might be as little as the plain register identifiers and a 2-bit cycle counter for a 4-cycle issue, with this signal being shared by the whole SIMD, with everything past that point able to derive the needed value based on SIMD lane and cycle counter.
For a MADD, that's 26 bits of operand identifiers.
edit: add 8 for the destination, was only thinking reads
Breaking the association, the next step up is sending the work unit identifier along with the register identifier. This would be calculated in the worst-case SIMD-width times.
That's 8*(10+24) bits for a MADD on a warp that has been coalesced.
edit: 8*(10+32) with destination computed
The register data total was something I was pondering if we just cut the post-issue logic down to the point that the units just got shoveled data directly when an instruction issued.