Jawed
Legend
In G80 pixel shader batches (warps) are 32-wide. Vertex batches (and presumably primitive batches?) are 16-wide.
How is this implemented? If G80 is, at heart, a 4-clock-per-instruction pipeline, what happens on the other two clocks of a batch?
If G80 is really a 2-clock per instruction pipeline, then why doesn't CUDA allow 16-object warps?
Generally small vertex batches are seen as an advantage, because dynamic flow-control will suffer less slowdown when a batch requires incoherent branches.
At the same time, vertex texturing (at least in DX9) is seen as a minority interest, so there's little interest in making vertex shader execution tolerant of fetch latency (where bigger batches help). Yet, in D3D10, latency-hiding becomes much more important due to the theoretical richness of GS/VS code.
So, how is G80 batching vertices? What effect is it having on throughput and why?
Is this the result of a trade-off based upon post-transform cache size? ROP fillrate (e.g. for Z-only passes)? Peak sampling rate? Or is it nothing more than a bias towards good dynamic branching performance?
Jawed
How is this implemented? If G80 is, at heart, a 4-clock-per-instruction pipeline, what happens on the other two clocks of a batch?
If G80 is really a 2-clock per instruction pipeline, then why doesn't CUDA allow 16-object warps?
Generally small vertex batches are seen as an advantage, because dynamic flow-control will suffer less slowdown when a batch requires incoherent branches.
At the same time, vertex texturing (at least in DX9) is seen as a minority interest, so there's little interest in making vertex shader execution tolerant of fetch latency (where bigger batches help). Yet, in D3D10, latency-hiding becomes much more important due to the theoretical richness of GS/VS code.
So, how is G80 batching vertices? What effect is it having on throughput and why?
Is this the result of a trade-off based upon post-transform cache size? ROP fillrate (e.g. for Z-only passes)? Peak sampling rate? Or is it nothing more than a bias towards good dynamic branching performance?
Jawed