There's definitely some weirdness in the 970 that David tested though that is almost certainly related to there being some disabled clusters. On "fully enabled" parts you don't see any of that weird hashed run-ahead of multiple tiles - it's all very balanced and it goes from one tile to the next.
From my testing on a GTX 970, I think the explanation for the weirdness is something like:
The screen is split into tiles of 16x16 px, which are assigned to 4 interleaved partitions. (My terminology is probably all wrong but never mind). Each partition corresponds to a GPC, and the number of tiles in each is proportional to the number of SMMs. The 970 has 13 SMMs in total so the partitions are unequal. Each partition gets rasterised almost completely independently.
I don't know how to tell whether pixel shaders for each tile are restricted to running on the SMMs in the GPC corresponding to that tile's partition, but I guess it would make sense if they were.
On my device I believe the assignment pattern is
Code:
p = [0,1,2,3,0,2,3,0,1,3,0,1,2]
partition(x, y) = p[(x + y*2) % 13]
where x,y are the tile index starting from the top left of the screen. That gives partition sizes in the ratio 4:3:3:3.
On David's video, his looks more like
Code:
p = [0,1,2,3,0,1,2,0,1,3,0,1,2]
partition(x, y) = p[(x + y*2) % 13]
That gives the ratio 4:4:3:2. I assume that corresponds to a different arrangement of disabled SMMs in his device.
The smaller partitions finish quicker, so the pattern becomes clearly visible as the partitions diverge.
From the video, the GTX 1070 (3 GPCs) looks more like
Those partitions are equal over an infinite area, but don't fit uniformly into the ~512x512 px region that gets rasterised first, so the pattern becomes visible when the partition that's smaller in the first region starts the next region before the others do. Devices with 2 or 4 GPCs should have a much less visible pattern, since everything divides nicely there.
(I'm mildly surprised they don't do something like "(x + y) % 3" on the 1070 to make thin vertical objects get distributed more evenly between the partitions.)
Vertices/triangles are fully buffered (with all attributes) on-chip, up to about ~2k triangles (depending on the SKU and vertex output size) before a tile "pass" is run. Again this gets a lot more complicated when not considering full screen triangles but I think keeping the original article high level makes sense.
It also looks compressed to me - I see it handling a lot more triangles per pass if I put duplicated values in the vertex shader outputs, than if the values are all unique. So that makes it even more complicated to analyse
(But I'm certainly not an expert so I'd be happy to learn if I'm misinterpreting all this stuff!)