Frankly, I'm not sure shader state switching would be much of an issue with any traditional tiler - maintaining a pool of ready (pixel) threads the size of the (traditionally small) tile would be a natural solution to minimize thread-switching costs. IIRC, SGX53x did this with up to 16 threads, though I don't remember if there were any limitations to the amount of different shader programs that could participate in the scheme.One of the classical selling points of Mali GPUs is "No renderer state change penalty"; all Mali cores from Mali55 to the T604 have the ability to switch render state intra-tile with zero cycles of overhead - including TMU state.
What I don't see being as trivial, is the TMU state, and particularly texture caches. The latter, alone, would pose a massive bandwidth multiplicity problem, if those caches were really offering no penalty for context switching. By 'no penalty' I'm referring to eviction of hot data and the associated extra cache misses. So I'm really curious to see how Mali achieves the advertised behavior.
Just to make clear, what I'm referring to is basically this: imagine a TBDR tile having a (post-occlusion) content (no translucencies, for simplicity) of:Don't you usually have a different drawcall for every model, in order to change the modelview matrix? If that's the case and what you're saying is true TBDR would rarely help you, since a model will rarely occlude itself. Even if the texture binding stays the same, it could still have totally different texture coordinates from one pixel to the other, so I dunno.. being able to handle texture state changes from pixel to pixel shouldn't be a huge problem. Cache wouldn't be explicitly flushed, but you'd get potentially worse locality of reference - but that's going to happen no matter how you draw the tile.
Code:
AAABBCCC
AABBBCCC
CBBBBCCC
CCCBBCCC
where A-, B- and C-marked zones are pixels belonging to drawcalls A, B and C respectively. The order I think would be natural for filling in the tile would be if each of these zones got processed without intermixing pixels from other zone, i.e. other drawcalls, or IOW, in a drawcall-by-drawcall order. Aside from minimising shader context switching, such a scheme would also optimise the utilisation of texture caches.
I'm not sure the fact IMG suggest binning by opaqueness/alpha-op is related so much to shader context switching, as it is to avoiding stalling the the entire deferred-shading mechanism through the frag_kill -type of ops which is apha-testing.I do think there are some render-catch up events in the tile rendering, at the very least when going between opaque, alpha, and alpha test (hence why IMG suggests binning by this), and quite likely when changing shaders. But for things like texture binding or uniform changes, I'm not so sure. So I expect the granularity to be a little coarser than per-draw call.
It's quite likely the case, but to reiterate: it's not the shader context switch that I see the major issue with - it's the extra stress on the texture caches that would be brought by any fill-in scheme other than drawcall-by-drawcall.Somewhere I recall IMG material claiming that the USSEs can switch to completely independent thread contexts with unique program position et al, so it's not completely out of the question that it can switch shaders on a per-pixel level within the tile. It'd just have to tag the pixels with a shader number (and deal with the case where that gets exhausted)