Bob said:
Texture isn't the only client to memory, especially not in DX10.
Well you got me. What other clients are you thinking of, that are specific to a unified architecture, as opposed merely to DX10.
Consider this example: The application clears Z to 1.0f and sets the depth compare mode to LESS. It then draws a green triangle at z = 0.5f followed by a perfectly overlapping red triangle at the same Z value.
The final output *must* be green. If you process either pixels or triangles in a different order than they were issued by the application, then you'll start seeing red.
You can't always predict latency from texture, so you can't predict how long the green triangle may be delayed with respect to the red triangle. So although you can run the PS for the two triangles in arbitrary order, you do need to resolve the Z test in ROP in the correct order.
That's a good argument.
I guess we'll have to ask ATI how they solved it
Does DX actually require that the triangle should be green? It looks to me like a classic example of a trap that programmers fall into, where they assume that submission order is rendering order. But I'm out of my depth here...
Oldest first is not always the most optimal scheduling policy. You can severily thrash your texture cache this way.
You can also deadlock this way: If the PS runs really slow, your oldest thread is now a vertex thread. But vertices are blocked by the raster unit because there are no free resources to run PS on. So you can't just pick the oldest thread. You now need to walk the thread list and find a thread that can run. This can be a rather involved and expensive process.
Which ATI is keeping secret as far as I can tell. It's simply alluded to in the patent.
But, anyway, since the post vertex cache is of a fixed size, the scheduler knows there's no point finishing yet more vertex batches. So PS automatically gets priority. And with deterministic execution times in the ALU pipes, the scheduler can see the stall coming.
If all PS is extremely texture-heavy, then of course the GPU will reach a stage where it becomes texture-bandwidth bound. Obviously you don't want to increase the problem by texture cache thrashing, so the scheduler needs to take account of shader state too (i.e. which batches are on the same triangle) and schedule them contiguously as each batch gets its texture results.
You don't only need to single-thread the setup engine. You need to ensure that the setup engine's queue is filled up in order. This means a lot more buffering at the output of the shader, if you want to be able to still do useful work while the rasterizer is busy (like, for example, pixel shader work to unblock the rasterizer).
Well I don't think buffer space is a particularly harsh constraint if you're talking about 20-50 (or so
) vertices at a few hundred bytes per vertex.
Presumably, also, vertices and triangles are sequentially ID'd, in order to make predication work under DX10, so ordering isn't a particularly difficult nut to crack.
It's not a question of latency (although that does play a big role), but of bandwidth. If you want to just do MADs, you need 128-bit * 3 reads / clock from arbitrary registers. RF banking can help, if you limit the number of threads you run.
But in a multi-pipe conventional GPU you have the same problem. Every clock you're loading/saving new registers, because on every clock you're shading different pixels.
So it's just a bigger version of an existing problem.
Consider a GPU with 2 unified shader pipes that can run either VS or PS threads. If you transform a triangle on one shader pipe, you don't want to pay the cost of transmitting all that data to the other shader pipe if you can help it. Instead, you want to keep everything on the same sahder pipe, and run PS threads for that triangle. That way, you don't need to shuffle vertex attributes throughout the whole chip.
The other shader unit can, in the mean time, work on some other triangle.
You may want to transmit big triangles between shader pipes though, so you don't starve the other pipes, but that cost can be rather expensive.
Xenos, by contrast, is always running different batches on consecutive clocks. It is continually swapping batches into and out-of context. So in Xenos it's immaterial "which pipe did the triangle".
And, anyway, even current GPUs only run one instruction per fragment, before switching that fragment out of context and switching another fragment (from the same batch) into context. So the register-swap-frenzy is nothing new.
500 M tris/sec * 3 vertices * 32-bit float * 64 scalar attributes == 384 GB/sec of internal bandwidth. If you way you only want to run with 8 scalar attributes at full speed (ie: position and some texcoords or colors), then your internal bandwidth is "only" 48 GB/sec. If you can somehow optimize this down to one vertex / triangle on average (which, btw, has a whole other set of issues), you're down to a more manageable 16 GB/sec. And you haven't even done any texturing yet!
Maybe it's time for that nicely detailed Xenos diagram:
The way I see it you're talking about a dedicated point-to-point link here. No other data travels down this link. A 256-bit link running at twice clock is going to be in the right ball-park.
Don't current GPUs already have a link to do this kind of work? If not, why is this unique to a unified architecture?
Jawed