Unified architecture requires dynamic resource allocation to avoid pipeline stalls with IMR, but resource allocation could be static with tilers due to decoupled geometry, which means 0 control logic overhead for tile-based architectures
Agreed in theory although none of my points had anything with the extra control logic
To simplify the reasoning, I assumed that even on an IMR the control logic would be less expensive than adding more units to achieve the same performance (i.e. imagine a design with identical VS and PS units that are statically allocated to one or the other at design/manufacturing time to save that control logic; I would expect that to always be slower than an unified design).
Regarding geometry vs pixel processing logic, the specifics are often overlooked:
- On an IMR, if the triangle setup unit is the bottleneck, then the PS/TMUs will stall. So if you have a very high polygon count object that is nearly entirely outside the frustum or at least fails the depth test, you will pay the geometry processing cost of that object in full.
- On a TBR, only the *average* VS vs PS throughput matters, as you can do the pixel processing of the previous render/frame while doing the geometry processing of the current one. So if the geometry processing for the entire render/frame isn't the overall bottleneck, then that invisible object will basically come for free. This is an often overlooked advantage of TBRs for geometry processing.
- On both IMRs and TBRs, you ideally want to have all the fixed-function units busy as often as possible, so that they don't become a bottleneck at other points in time. For example, on a very naive unified shader TBR where you did geometry processing and pixel processing sequentially with no overlap, if the geometry stage was triangle setup limited, then the shader core will mostly stall for that duration. On the other hand if geometry and pixel processing happen in parallel, the geometry part may essentially come "for free".
- When doing VS and PS overlap in an unified architecture (either IMR or TBR), it's very frequent for the VS not to use the TMUs while the PS might be TMU limited. So ideally you want fine-grained overlap between VS and PS inside every individual shader core/cluster so that the VS ALU instructions basically come for free when the PS is TMU limited.
The last point especially is very easy to get wrong if you try to get away with simple control logic, although it's hard to say how much it matters for typical workloads as they'd need to be both VS_heavy and TMU-heavy for this to be significant; arguably part of GLBenchmark 2.5 does fit in that category FWIW.
One thing I'm still curious about in Tegra 4's case is whether the PS ALUs are really FP20 and if so how they're getting DX9_3 conformance... I assume they've got a dedicated normalisation unit (ala NV4x for FP16 but higher precision here) which would help reduce precision issues in typical workloads as normalisation is a frequent problem.