Bob said:A point that hasn't been addressed is how much additional hardware is needed to keep such a unified architecture utilized at 100% (or even 90%). If you need ~40% more transistors due to more FIFOs, register file ports, thead management, scoreboarding, etc, then a 60% efficient GPU with 40% more hardware would do just as well as a 100% efficient GPU. Note only that, but it'll have a higher peak performance, which opens the door for more optimzations.
If the 100% processor is defined as having a performance of 1, a 60% processor of the same size has performance 0.6 and a 60% processor which is 40% larger has a performance of 0.84. You'd need ~66% more transistors to get the same kind of speed
Jawed said:I think it's fair to say everyone was expecting that this branch commonality would operate at the quad level in NV40, but it's turned out (through experiment) to measure at a larger level of granularity. The loss of efficiency, here, is catastrophic.
It means that developers have avoided implementing shader code that performs dynamic per-pixel branching.
Forgive my somewhat poor memory, but isn't that basically what the comments in that ATI presentation about SM3.0 said? That branching in the NV40 was too slow to be useful?