I would assume that the tile size matches the ROP cache size. However Nvidia hardware doesn't have dedicated ROP caches, so I'd assume that the tile buffer resizes on L2 cache (where they usually keep the ROP outputs). Did you pixel count the tile sizes? My guess would be something between [32x32, 128x128] as that's close to the footprint of traditional ROP caches.
Some years ago I did ROP cache experiments with AMD GCN (7970) in order to optimize particle rendering. GCN has dedicated ROP caches (16 KB color, 4 KB depth). In my experiment I split the rendering to 64x64 tiles (= 16 KB). This resulted in huge memory bandwidth savings (and over 100% performance increase), especially when the overdraw was large (lots of full screen alpha blended particles close to the camera). You can certainly get big bandwidth advantages also on AMD hardware, as long as you sort your workload (by screen locality) before submitting it.
It's hard to draw 100% accurate conclusions from the results. This doesn't yet prove whether Nvidia is just buffering some work + reordering on fly to reach better ROP cache hit ratio, or whether they actually do hidden surface removal as well (saving pixel shader invocations in addition to bandwidth). This particular test shader doesn't allow the GPU to perform any hidden surface removal, since it increases an atomic counter (it has a side effect).
To test HSR, you'd have to enable z-buffering (or stencil) and use [earlydepthstencil] tag in the pixel shader. This tag allows the GPU to skip shading the pixel even when it has side effects (DX documentation is incorrect about this). Submit triangles in back-to-front order to ensure that early depth doesn't cull anything with immediate mode rendering. I would be interested to see whether this results in zero overdraw on Maxwell/Kepler (in this simple test with some overlapping triangles and also with higher triangle counts).
It would also be interesting to know how many (vertex output) attributes fit to the buffer.
The new (Nvidia and Oculus) multiview VR extensions would definitely benefit from separating SV_Position part of the vertex shader to its own shader. This would also greatly benefit tiled rendering (do tile binning first, execute attribute shader later). I wouldn't be surprised if Nvidia did already something like this in Maxwell or Pascal, as both GPUs introduced lots of new multiview VR extensions.
I just wish Nvidia would be as open as AMD regarding to their GPU architecture