Hilbert order? There was a paper in the 2001 Graphics Hardware from
McCool and others about implementing a recursive rasterization method that made use of Hilbert order. But it didn't went into the benefits of Hilbert order as a pattern for texture or framebuffer access. They key reason for Hilbert order was that the recursive algorithm could be implemented without a large stack using an automat that would walk through the whole framebuffer using a Hilbert space filling curve.
For the simulator I also use recursive rasterization but as I gave up on implementing something that realistic (researching rasterization wasn't that interesting) the algorithm and the recursivity is only in the emulation side, and the simulator only requests n tiles of fragment quads per cycle and expects on average to get them. I think the different levels of tiles (down to the quad) are generated in Morton order now.
After lazily implementing round robin for 'better' shader workload distribution I went towards a workload distribution based on tiles (as reported for ATI) and later implementing a Morton distribution of those tiles in memory and between the quad pipelines. Which reduced significantly the amount of extra texture bandwidth consumption and the unbalance accessing the memory.
The textures were always implemented with multiple levels of tiling and stored using Morton order at each level (down to the texel level).
About using the name 'threads' or whatever it becomes really confusing. When discussing with the other people working on the simulator I end talking about shader inputs (as we have also vertices and potentially any kind of input going into the shader units), fragments (as it becomes hard to forget that there isn't just fragments), threads, quads and groups. I tried to avoid batches because the other people working here weren't reading Beyond3D or other similar sources that used batches for 'fragments on fly in the shader' and the use of batches for primitives and the unit workload sent to the GPU with no internal state changes was already consolidated.
Now I tend to use quad as the minimun work unit for fragment processing with the shader 'thread' being a multiple of a quad. I then use 'thread', 'thread group' or just 'group' (tending to the later to avoid confusion) to the group of quads that are scheduled together on a shader and take n cycles to complete. I would say that 'thread' for me is more the hardware concept of having a PC, a position in a schedule window and other related state that is shared for all the quads in a group. A group is just the workload assigned to a thread which determines how the register storage is reserved. The available threads define the 'thread window' from where work is scheduled. You could have an architecture with the same number of threads but a different group size (R520 vs R580). Then another parameter would be how many of those quads in a group are processed in parallel in a shader (the number of ALUs).
For very large 'batches' of quads that are scheduled together (what old ATI GPUs and NVidia GPUs still seem to use) I don't like the term thread or group because there isn't any scheduling or true grouping. So I preffer to talk about queues (of quads) even if an implemention may be using and scheduling between a couple or two of those large queues.
But that is just my personal prefferences.