Isn't the main benefit of TBDRs is that you don't need to go off-chip as IMRs meaning you need less bandwidth for a given workload?
I'm assuming you meant to say "as much as IMRs" there, not claim that you don't have to go off-chip at all. That's the benefit of the "TB" part, which saves on render-target bandwidth. Mainly by reducing standard workloads to a single stream to the framebuffer instead of having multiple writes per overdrawn pixel or read-modify-writes for alpha blending, and not having to update a depth/stencil buffer off-chip (note that depth buffer compression techniques on modern IMRs will help reduce the number of times you have to update the raw buffer due to overdraw, but you can still count on at least once per pixel, I think)
Render to texture probably still goes off chip then back on, I wouldn't expect there's any optimization bypassing this in the case where the texture is used in the same tile, which is probably not all that common..
The "DR" part saves compute resources and bandwidth, mainly by preventing texture lookups for occluded pixels. Of course this only translates to bandwidth savings when the texture isn't in cache.
On the flip side more bandwidth is used for vertex processing vs an IMR because the vertex stream has to be output back to main memory in the form of binned data, then read it back in the tile processing stages. Some claim that this is something like "doubling" the bandwidth, but in reality it's a lot less because the binned data doesn't include anything that is frustum or backface culled (latter is usually somewhere near 50% the triangles) and it's compressed. And not all data has to be read for every stage.. and with indexed vertexes on an IMR you can't stream all the vertexes straight to the GPU, what you're indexing has to be resident in memory first, and for static data it has to be read from memory anyway. With a shared memory device I think it'd look something like this:
IMR:
- CPU writes vertexes with all data to RAM
- CPU writes vertex indexes to GPU command FIFO
- GPU reads + shades vertexes with all data from vertex cache which reads from RAM where necessary, dispatches for rendering
Tile based:
- CPU writes vertexes with all data to RAM
- CPU writes vertex indexes to GPU command FIFO
- GPU reads vertex clip-space coordinates from RAM, culls/clips, then passing reads the rest of the vertex data, compresses, and writes to tile bin (and some additional bandwidth for maintaining the tiling data structures, which I'm sure are cached to some extent)
- GPU reads from tile bin to render the tile
There is some bandwidth increase as well where the tile binner has to create new vertexes or where vertexes get reproduced for where triangles get split across/included in multiple tiles. It'd actually be interesting to see some figures in just how many additional vertexes are created for a tiler that splits triangles. I've heard that Mali and maybe others don't split triangles at all and render the whole thing in each tile they're present in with guard-band clipping. If true I wonder what the actual cost of guard-band is, if it's something that scales per number of scanlines or even worse, has to reject individual pixels that fall outside of it (there's no way that'd be true for a tiler right..?)