The decoder scheme seems to be simplest when looking at the SIMD position in a different warp, not just any thread on any warp in the work group.
A simple decoder running on the 1/0 barrier table would be inadequate, since the decoders, if they did actually issue a work unit out of warp order, would not have the state to know not to do it again the next cycle.
It would follow that the incoming threads are inserted into a warp with the same PC. I'm not sure what it would mean to assign a thread to a SIMD that was executing a different instruction entirely.
I assumed that the collection process would be multi-cycle. The mention of bank conflicts and other items would indicate that throwing work units around effectively at random would make it much worse.I don't want to be a bore before heading off to bed, but the patents I linked need a closer read. e.g. you'll find the operand collector has latency (i.e. it takes more than one clock for an operand to cross from collector to ALUs) and a crossbar for feeding the ALUs. Any operand can go to any one or more ALU lanes (well, we knew this from CUDA documentation).
The compiler would most likely give up trying anything complex as far as allocation goes if it finds a barrier-inducing condition. Nothing statically compiled to avoid bank conflicts or other issues is going to be respected anyway and we're leaning on the random operand collector to make up for it.Also, as it turns out, register allocation within the register file is a whole kettle of fish. There's potential here for interaction with registers/blocks of registers that the compiler/driver sees as being the target of incoherent accesses (e.g. registers used within a loop).
Additionally there is not a single scoreboard per multiprocessor, but the instruction dependency (per-warp and per-instruction) and operand collection (per-warp, per-instruction and per-operand) scoreboards run independently.
Is there a reason why the operand collector would care what operation it's feeding?Available instructions from the instruction-dependency scoreboard are submitted to the operand collector (i.e. op-code + operand addresses + resultant address to handle the resultant's return).
How many of those rules are unaffected by a switch to arbitrary lane allocation?So scoreboarding is, effectively, an hierarchical process and it seems the operand collector can reject requests for instruction execution based on a pile of rules.
The 1024 entry barrier scoreboard that tracks over the duration of a gather would be allocated for how long?Also of note is the lifetime of rows in the scoreboards - in the operand collector, lifetime is very short generally. Warps are effectively permanently allocated only in the instruction dependency scoreboard, as far as I can tell.
So something like a barrier scoreboard would be orthogonal, again. Warps that get picked by the barrier scorer will then be tested against the instruction-dependency scoreboard, and then the operand collector will perform final arbitration for the "random" fetching required to populate ALU lanes, where by taking a stream of warps' bit-masks of valid work items - instead of just a single warp-instruction as would be the case when barrier handling is not required.
It's not so much blind as aware of a wider problem space than what static linkage would need to scan.The operand collector already does most of this - just the final piece of the jigsaw, being "blind" to warp when assembling operands for ALU lanes is needed, I believe.
I just started skimming.Tomorrow, I'll go through Fung's stuff...
It can be made to work. My concern is that it can be made to do so in a reasonable physical envelope.Well if there is any chance of this working it seems locking threads into a SIMD lane would be much easier so that's probably the way it would go.
The barrier table proposal is a bit more complex. The decoders on that table would try to detect ready units that can procede to process through the instructions in a given window of instructions.Not sure what you mean by this. The decoders don't know anything about warp ordering. Don't they just blindly calculate register addresses based on thread iD?
The work units are sitting on barriers, and the example given was a gather which by the vagaries of memory access might be able to randomly supply some units far in advance of others.Yeah, but the point of the indexing is to limit the breadth of the search for an insertion location in the warp pool. Why do you think it has to scan the entire 1K set of work units each cycle?
Even without dynamic warp formation there'll still be bank conflicts.I assumed that the collection process would be multi-cycle. The mention of bank conflicts and other items would indicate that throwing work units around effectively at random would make it much worse.
Register allocation is very complex in NVidia's design. The reason for the complexity is because of single-porting of the register file and the opportunities to reduce bank conflicts that the operand collector provides. In other words, for given patterns of register usage there are good and bad allocations. I'm merely inferring that there's an opportunity to implement rules based specifically upon a warp-forming operand collector, if such a thing were built, simply to maximise performance in this situation... e.g. use of contiguous register slots versus use of discontiguous slots versus fat-allocation versus thin allocation versus warp-phased allocation, etc.The compiler would most likely give up trying anything complex as far as allocation goes if it finds a barrier-inducing condition. Nothing statically compiled to avoid bank conflicts or other issues is going to be respected anyway and we're leaning on the random operand collector to make up for it.
Plus, by running them in a pipelined fashion, evaluation throughput (even with more complex rules) can be maintained at the cost of latency.From that description, the two scoreboards don't need the same data, and there is economy of signalling and data generation work if they are kept separate.
Yes, definitely. Whatever scheme one comes up with for determining the scope and depth of divergence tracking, actual assembly of temp warps is costly.The instruction dependency scoreboard might not care too much about the coalesced warps.
The operand collection scoreboard's job is measurably more complex the more arbitrary work unit assignment to a lane is.
As far as I can tell the operand collector is effectively performing instruction issue, so it is assembling instructions with entirely literal data for the ALU lanes to consume.Is there a reason why the operand collector would care what operation it's feeding?
Lane allocation is already arbitrary as far as operands are concerned. I have to admit my first reaction to these rules is "uh? why reject instructions?". It may be due to limited capacity in the scoreboard if it's seen as having to stick around for too long.How many of those rules are unaffected by a switch to arbitrary lane allocation?
Simplistically, as long as a gather persists - and, effectively de-activated when there's no incoherence causing empty ALU-lanes or ALU stalls.The 1024 entry barrier scoreboard that tracks over the duration of a gather would be allocated for how long?
It's interesting just how complicated NVidia's made its warp/instruction/operand windowing. The salient point, where I came in as it were, is that I suspect with relatively minor tweaks in the grand scheme, it could make a really robust scheduler that eats all the forms of incoherence that afflict SIMD machines.It's a functional idea.
It's words like random, arbitration, and stream that look fine on paper or in software that worry me.
As far as a physical implementation goes, I have concerns.
I haven't got there - note my earlier caveat that lane association is partly a red herring (only partly because work items still need to be sieved to obtain non-colliding lanes in a temp warp).I just started skimming.
They didn't like breaking lane association either.
The bank conflicts from trying to move things around led to some pretty significant performance drops in simulation.
Yeah - branch-divergence, on its own, if it can be made to work, would be a significant step forward.The problem for branch divergence is in many ways more tidy than the one for coalescing work units based on barrier resolution.
Branches are encountered and divergence are determined within a fixed and determinate period of time.
Barrier scoring is just an outline idea - my desire is to consider how to solve more than just control flow divergence, since DWF should be a solution to all these problems.The barrier table persists for the lifetime of a work group waiting on an event such as a gather operation, and it needs to be updated repeatedly.
The scheme actually could be significantly simplified, I think, if we enforced some kind of lane restriction instead of trying to cover every part of the problem space.
Yep, which is why I was inspired by Timothy's initial questioning of the likelihood of something more like MIMD...In the intervening time, I think we've learned a lot more about dispatch, to the point where I thought we had decided that register fetching wasn't necessarily as straightforward as we had originally thought, which would make it easier to juggle threads around...?
I'm not implying that there wouldn't be. Compiler or programmer efforts to help minimize such conflicts would be negated past a barrier, once the hardware starts to reorder units.Even without dynamic warp formation there'll still be bank conflicts.
A catch-all solution may be hard to optimize for all cases.I'm merely inferring that there's an opportunity to implement rules based specifically upon a warp-forming operand collector, if such a thing were built, simply to maximise performance in this situation... e.g. use of contiguous register slots versus use of discontiguous slots versus fat-allocation versus thin allocation versus warp-phased allocation, etc.
I haven't seen the part where he said it was impossible, just that having multiple threads with the same "home lane" was undesirable.For example, one thing Fung doesn't realise is that any operand can go to any lane - there's a crossbar after the operands have been retrieved. This makes some of his modelling flawed - I'm at section 4.2 by the way, reading carefully...
It may be difficult or undesirable to turn the process off. The data paths and logic don't vanish if the GPU shifts down to DWF-off, and the DWF hardware could just as easily generate the standard case on its own, if put to it.Predication must be retained because it's impossible to construct non-divergent warps and also it might be prudent to apply thresholding to DWF (varies by type?), e.g. only perform DWF if the multiprocessor is due to run out of available warps within X cycles, that kind of thing - since DWF has a start-up latency due to an increase in operand collection latency.
Yep, that's what I'm getting at, but not just for gather, but for the register file. Nice bank access patterns are clearly the reason NVidia has implemented a rule set for register allocation. They'll never be perfect, but bear in mind that the worst-case waterfall is 1/16th throughput, so any improvement to be gained there is a huge huge benefit.Compiler or programmer efforts to help minimize such conflicts would be negated past a barrier, once the hardware starts to reorder units. [...]
A catch-all solution may be hard to optimize for all cases.
For gather operations, perhaps an optimization would be to keep operands fed by gathers as spread evenly as possible between banks, in order to minimize the number of times bank conflicts arise.
None of the operand collector architectures he describes in section 4.1 match what NVidia has implemented. He's simply missed what NVidia's done.I haven't seen the part where he said it was impossible, just that having multiple threads with the same "home lane" was undesirable.
It was vague. The CUDA documentation is quite thorough in describing the restrictions on bandwidth that result from various RF->ALU lane mappings.What is the throughput of the crossbar, and with what restrictions?
I presume you're referring to the MAD and MI pipes as having distinct subsets of register file? My interpretation is that that's unworkable...The diagrams I saw seemed to indicate each SIMD pipe had a direct connection with its local subset, and then there were connections between the sets of unspecified capacity.
Agreed.It may be difficult or undesirable to turn the process off. The data paths and logic don't vanish if the GPU shifts down to DWF-off, and the DWF hardware could just as easily generate the standard case on its own, if put to it.
Having discovered the rats nest (bit harsh) that is NVidia's windowing, I'm not convinced there'd be much difference either way. Which is why I think DWF would actually be an easy-win for NVidia in the D3D11 generation. And why I'm kinda excited. Fingers-crossed!Having both modes seems akin to having both a manual and automatic transmission stuck on a car.
So, I remain hopeful that NVidia will tackle the gamut of incoherency issues with one fell swoop in GT300 :smile:Bank conflicts in shared memory access, as noted by Hwu et al. [27], have been a major performance bottleneck in GPU's performance. While this thesis focuses on using dynamic warp formation for improving control flow efficiency in a SIMD pipeline, we have also observed that it is possible to regroup threads, in a fashion similar to how we propose to do this for lane aware shecudling, to eliminate bank conflict in cache or shared memory access. The exact implementaton and hardware cost of this idea should be explored in the future.
Does it strike anyone else that these tricks should get even more effective with smaller batch sizes?
And does it strike anyone else that smaller batch sizes become more cost-efficient when the register file becomes bigger (ala GT200 doubling it) or the units become more powerful (ala RV770: INT32 MULs & shifts for every ALU)?
Batch sizes, you mean warp sizes (nv speak) or wavefront sizes (ati speak) don't you?
Yeah, they're interchangeable terms here.Batch sizes, you mean warp sizes (nv speak) or wavefront sizes (ati speak) don't you?
Nope, the converse. The graphs Fung provides are evidence enough (since there's less to gain with smaller batches), and the GPUSA stats I provided earlier reinforce this - even if they're a rather basic first impression.Does it strike anyone else that these tricks should get even more effective with smaller batch sizes?
I don't think I agree with that. You're looking at existing workloads. If you take generic workloads, random divergence is going to bit you in the backside no matter your warp size as long as it's more than one (aka MIMD).Nope, the converse. [...] (since there's less to gain with smaller batches)[...]then you have to weigh the cost of the extra control overhead versus dynamic warp formation.[...]
I don't think I agree with that. You're looking at existing workloads. If you take generic workloads, random divergence is going to bit you in the backside no matter your warp size as long as it's more than one (aka MIMD).