nAo said:
Jawed said:
If that determines the maximum number of concurrently executing pixel shader command threads that can be executing
I can't see any logical reason why the number of pixel output per clock should be linked with the number of threads the hw is running.
I was trying to refer to the number of concurrently executing pixel shader command threads that generate a pixel for ROP, as opposed to the number of command threads in flight.
Obviously there's a huge queue of pixel shader command threads backed-up ready to generate pixels on the next cycle. In theory there are hundreds or thousands of pixel shader command threads in flight. But per clock there are only 8 pixels being fed for ROP, according to the leak.
This is why I think there are 8 Unified Shaders (two quads), each (quad?) fronted by an interpolator (for pixels - obviously vertices are handled separately). Each US can execute any combination of, perhaps, two command threads:
2 vertex
1 vertex, 1 pixel
2 pixel
and at the same time issue one or more texture operations (depending on how many TMUs it has).
I think each US has 2 ALU units (one per command thread - the patent even mentions this as a scenario, using two arbiters), each of which can co-issue two vec-4 and one scalar operations (three co-issued ops in each of the two ALU units, 6 ALUs per US in total).
I dunno what you'd do with so much co-issued shader code. Just stumbling around here, in the dark, speculating for the sake of it.
Obviously this configuration still leaves the question of what happens when two pixel threads in a US complete at the same time - presumably the Render Backend 350 in the patent (page 4):
http://www.beyond3d.com/forum/viewtopic.php?t=21708
will accept both pixels and queue them. Who knows eh? You'd expect that the Render Backend would have to synch-up the Interpolator-generated multisamples, with their owning pixel, and so that makes for a natural queue that needs to be constructed.
The figure of 8 pixels per clock coming out of R500 is unexpectedly low, to be quite honest. It seems that R500 always produces multisamples for edge pixels, so 4xMSAA comes free, compensating somewhat for the "mere" 8 pixels per clock.
In other words you can't compare the pixel fill rate of R500 to R420, for example, because R500's maximum pixel fill rate is the same with 0x and 4xMSAA.
Jawed