Jawed
Legend
Yeah, the hardware knows the size of the domain of shared registers so it's a simple check.This appears consistent with a simple VLIW design.
It's better than how some of the original VLIWs would simply read and write, heedless of hazards, but it's a simple check for the scheduler to pick up a potential conflict and inject a NOP into a wavefront's instruction stream, rather than try to piece through an instruction packet's read and write operands.
The in-pipeline registers, used to avoid read-after-write latency, have the data, but still there's no time to get that data in place if the underlying latency of these registers is 4 cycles, say.If there were forwarding within the cluster, this latency could be avoided, but that's 16 5-way bypass networks per SIMD and a tag check per ALU.
Yeah, still too early to tell whether LDS is viable. Funny, coming up to 3 years after G80 and one year after RV770 and still no idea if AMD has caught up.Maybe they haven't settled on a final scheme for data sharing?
Global registers are an incremental addition to what is already there.
The LDS is an addition, but with minimal disruption to the already existing design.
Maybe AMD doesn't want to commit too much for a low-level detail that they might be revamping.
Jawed