G80 Vertex Shading Batches - 16 vertices - how? why?

Jawed

Legend
In G80 pixel shader batches (warps) are 32-wide. Vertex batches (and presumably primitive batches?) are 16-wide.

How is this implemented? If G80 is, at heart, a 4-clock-per-instruction pipeline, what happens on the other two clocks of a batch?

If G80 is really a 2-clock per instruction pipeline, then why doesn't CUDA allow 16-object warps?

Generally small vertex batches are seen as an advantage, because dynamic flow-control will suffer less slowdown when a batch requires incoherent branches.

At the same time, vertex texturing (at least in DX9) is seen as a minority interest, so there's little interest in making vertex shader execution tolerant of fetch latency (where bigger batches help). Yet, in D3D10, latency-hiding becomes much more important due to the theoretical richness of GS/VS code.

So, how is G80 batching vertices? What effect is it having on throughput and why?

Is this the result of a trade-off based upon post-transform cache size? ROP fillrate (e.g. for Z-only passes)? Peak sampling rate? Or is it nothing more than a bias towards good dynamic branching performance?

Jawed
 
G80 is 2 clocks per instruction for the SPs, not 4 (not for all instructions, granted, but most). Pixel batch size for regular rendering is a function of the rasteriser, I had assumed (8 quads at a time) :oops:

And I think threads per warp under CUDA is restricted because of register count, although I'm not 100% sure, it's been ages since I poked it. Arun might recall.
 
I've been wondering about this too. I have no idea if this is correct, but my guess is it's an attempt to have their cake and eat it too:

I think it's reasonable to assume that each 8-way SIMD unit has a fixed upper bound on the number of batches it can have "live" at any given time. Given that limit, allowing double-size batches lets you have more live threads at a time, improving your ability to hide latency. On the other hand, double-size batches have worse dynamic branching efficiency, and a higher chance of wasting cycles if there aren't lots of threads available.

David Kirk did an excellent job bluffing about unified architectures, but a lot of what he said is true. In particular, it's hard to make a unified architecture that's as efficient for a particular workload as a non-unified architecture. So looking at current apps designed around last gen GPUs, having 16-way geometry and 32-way pixel seems like an attempt to cater to both workloads:

Geometry
  • few(er) threads, apps designed for low geom : pixel ratios
  • little latency tolerance needed
  • relatively branchy (VS DB perf has been good since NV30/R300)

Pixel
  • lots of threads (relatively)
  • high latency tolerance needed
  • normally straight-line

With 16-way batches, G80 is actually a step backwards for VS DB efficiency, but with 32-way batches it's a huge improvement over G7x/R4xx PS DB efficiency and roughly on par with R5xx.


I have a crazier theory about these double-wide batches and the missing MUL: if and when it shows up I wouldn't be surprised if it only helps pixel threads, and only when executing a MUL instruction within a batch. In other words you couldn't issue a MAD in one batch in parallel with a MUL in another batch. Instead, 16 threads in the batch would execute a MUL on the MAD pipeline and the other 16 threads would execute the MUL on the SFU pipeline, so the 32-way batch would only consume 2 shader clocks of time. The four-clock scheduling time allowed for pixel threads would give the scheduler enough time to set up this special case.
 
G80 is 2 clocks per instruction for the SPs, not 4 (not for all instructions, granted, but most).
OK, I've misconstrued that.

Pixel batch size for regular rendering is a function of the rasteriser, I had assumed (8 quads at a time) :oops:
Or maybe it's related to interpolation-scheduling, since interpolation occurs on-demand?

And I think threads per warp under CUDA is restricted because of register count, although I'm not 100% sure, it's been ages since I poked it. Arun might recall.
I can't figure out the meaning of the implied relationship there :cry:

Jawed
 
Just to be sure, you guys are talking about the number of clocks between changes in the instruction being executed by a SP, right?

I would think latency is more, and throughput is still 1 cycle per instruction other than special functions.
 
I can't figure out the meaning of the implied relationship there :cry:
I was just thinking that there's a limited upper bound on registers per thread (per warp), and maybe that's the reasoning for capping warps per cluster/multiprocessor. I'm honestly not sure, and a quick glance through Kirk's CUDA presentations and the programmer's guide doesn't bring an answer (that I can see, I've just skipped through both).
 
I have a crazier theory about these double-wide batches and the missing MUL: if and when it shows up I wouldn't be surprised if it only helps pixel threads, and only when executing a MUL instruction within a batch. In other words you couldn't issue a MAD in one batch in parallel with a MUL in another batch. Instead, 16 threads in the batch would execute a MUL on the MAD pipeline and the other 16 threads would execute the MUL on the SFU pipeline, so the 32-way batch would only consume 2 shader clocks of time. The four-clock scheduling time allowed for pixel threads would give the scheduler enough time to set up this special case.

Wow, I think that beats even Jawed's stuff for creativity :D It would be pretty cool if true since there will be no need to compile for instruction co-issue but isn't this an even hairier proposition? The compiler/scheduler will have to try to schedule MULs at the same time the SFU pipeline is free? And what are the implications of variable 2 or 4 cycle pixel batches?

Actually, coming to think of it, didn't Nvidia explicitly state that each individual G80 SP was capable of a MAD+MUL co-issue?
 
Did somebody noticed, that according to many vertex shader tests of G80GTS and GTX (example), their VS performance differs by ~12%. The performance difference should be about 50% /1,33(SPs)x1,12(frequency) = 1,5x/

Difference of 12% corresponds to different clocks (1350MHz vs. 1200MHz ~ 12%) and one of my friends vouched, that GTS at 1350MHz has the same VS performance as GTX at 1350MHz.

It seems, that both GTS and GTX use the same (fixed) number of SPs for vertex processing, or G80 isn't fully unified and only limited number of SPs (same for GTS and GTX) can be used for vertex shading... is this a HW issue, or driver problem? Couldn't this be the reason for lower GS performance?
 
Ok, the question is why GTX and GTS@1350MHz has the same score in 3DM06 VS tests and G84 scores similar to G80 (or even better)?
 
Hmm, that's interesting.

Playing with the RightMark results that trinibwoy has linked, indicates that GTX is ~49% faster than GTS (theoretically it should be 50%) - based on averaging all those tests.

Also, amusingly, if you take the theoretical scalar throughput of HD2900XT, which is 5*64*742 = 237440 scalar instructions (ignoring SF), HD2900XT is 37% faster theoretically than 8800GTX. In the RightMark tests, it averages 38% faster.

So, it seems that the 3DMk06 VS tests are very impure tests, infected by other aspects of the architecture.

Peculiarly the static and dynamic-branching tests at the end there (the last two tests) don't behave very well on G80. Am I interpreting that correctly? Are those last two tests for branching?

Jawed
 
I think this is a question of what the 3DMk06 tests render on screen. I've got no idea what's rendered.

Also, it might be an artefact of setup throughput. If setup is bottlenecked through a single unit for all vertices then maybe the setup rate of the GPU is being tested by 3DMk06's complex test?

Jawed
 
Peculiarly the static and dynamic-branching tests at the end there (the last two tests) don't behave very well on G80. Am I interpreting that correctly? Are those last two tests for branching?

Yeah performance drops on the static test for some reason. But I'd be a lot more concerned about R600 DB performance in that fractal shader on the following page.
 
Back
Top