G80 Vertex Shading Batches - 16 vertices - how? why?

Jawed · May 7, 2007

In G80 pixel shader batches (warps) are 32-wide. Vertex batches (and presumably primitive batches?) are 16-wide.

How is this implemented? If G80 is, at heart, a 4-clock-per-instruction pipeline, what happens on the other two clocks of a batch?

If G80 is really a 2-clock per instruction pipeline, then why doesn't CUDA allow 16-object warps?

Generally small vertex batches are seen as an advantage, because dynamic flow-control will suffer less slowdown when a batch requires incoherent branches.

At the same time, vertex texturing (at least in DX9) is seen as a minority interest, so there's little interest in making vertex shader execution tolerant of fetch latency (where bigger batches help). Yet, in D3D10, latency-hiding becomes much more important due to the theoretical richness of GS/VS code.

So, how is G80 batching vertices? What effect is it having on throughput and why?

Is this the result of a trade-off based upon post-transform cache size? ROP fillrate (e.g. for Z-only passes)? Peak sampling rate? Or is it nothing more than a bias towards good dynamic branching performance?

Jawed

Rys · May 7, 2007

G80 is 2 clocks per instruction for the SPs, not 4 (not for all instructions, granted, but most). Pixel batch size for regular rendering is a function of the rasteriser, I had assumed (8 quads at a time)

And I think threads per warp under CUDA is restricted because of register count, although I'm not 100% sure, it's been ages since I poked it. Arun might recall.

armchair_architect · May 7, 2007

I've been wondering about this too. I have no idea if this is correct, but my guess is it's an attempt to have their cake and eat it too:

I think it's reasonable to assume that each 8-way SIMD unit has a fixed upper bound on the number of batches it can have "live" at any given time. Given that limit, allowing double-size batches lets you have more live threads at a time, improving your ability to hide latency. On the other hand, double-size batches have worse dynamic branching efficiency, and a higher chance of wasting cycles if there aren't lots of threads available.

David Kirk did an excellent job bluffing about unified architectures, but a lot of what he said is true. In particular, it's hard to make a unified architecture that's as efficient for a particular workload as a non-unified architecture. So looking at current apps designed around last gen GPUs, having 16-way geometry and 32-way pixel seems like an attempt to cater to both workloads:

Geometry

few(er) threads, apps designed for low geom : pixel ratios
little latency tolerance needed
relatively branchy (VS DB perf has been good since NV30/R300)

Pixel

lots of threads (relatively)
high latency tolerance needed
normally straight-line

With 16-way batches, G80 is actually a step backwards for VS DB efficiency, but with 32-way batches it's a huge improvement over G7x/R4xx PS DB efficiency and roughly on par with R5xx.

I have a crazier theory about these double-wide batches and the missing MUL: if and when it shows up I wouldn't be surprised if it only helps pixel threads, and only when executing a MUL instruction within a batch. In other words you couldn't issue a MAD in one batch in parallel with a MUL in another batch. Instead, 16 threads in the batch would execute a MUL on the MAD pipeline and the other 16 threads would execute the MUL on the SFU pipeline, so the 32-way batch would only consume 2 shader clocks of time. The four-clock scheduling time allowed for pixel threads would give the scheduler enough time to set up this special case.

Jawed · May 7, 2007

Rys said:
G80 is 2 clocks per instruction for the SPs, not 4 (not for all instructions, granted, but most).

OK, I've misconstrued that.

Pixel batch size for regular rendering is a function of the rasteriser, I had assumed (8 quads at a time)

Or maybe it's related to interpolation-scheduling, since interpolation occurs on-demand?

And I think threads per warp under CUDA is restricted because of register count, although I'm not 100% sure, it's been ages since I poked it. Arun might recall.

I can't figure out the meaning of the implied relationship there

Jawed

Mintmaster · May 7, 2007

Just to be sure, you guys are talking about the number of clocks between changes in the instruction being executed by a SP, right?

I would think latency is more, and throughput is still 1 cycle per instruction other than special functions.

Jawed · May 7, 2007

Mintmaster said:
Just to be sure, you guys are talking about the number of clocks between changes in the instruction being executed, right?

It's what I mean, yeah.

Jawed

Rys · May 7, 2007

Jawed said:
I can't figure out the meaning of the implied relationship there

I was just thinking that there's a limited upper bound on registers per thread (per warp), and maybe that's the reasoning for capping warps per cluster/multiprocessor. I'm honestly not sure, and a quick glance through Kirk's CUDA presentations and the programmer's guide doesn't bring an answer (that I can see, I've just skipped through both).

trinibwoy · May 7, 2007

armchair_architect said:
I have a crazier theory about these double-wide batches and the missing MUL: if and when it shows up I wouldn't be surprised if it only helps pixel threads, and only when executing a MUL instruction within a batch. In other words you couldn't issue a MAD in one batch in parallel with a MUL in another batch. Instead, 16 threads in the batch would execute a MUL on the MAD pipeline and the other 16 threads would execute the MUL on the SFU pipeline, so the 32-way batch would only consume 2 shader clocks of time. The four-clock scheduling time allowed for pixel threads would give the scheduler enough time to set up this special case.

Wow, I think that beats even Jawed's stuff for creativity

It would be pretty cool if true since there will be no need to compile for instruction co-issue but isn't this an even hairier proposition? The compiler/scheduler will have to try to schedule MULs at the same time the SFU pipeline is free? And what are the implications of variable 2 or 4 cycle pixel batches?

Actually, coming to think of it, didn't Nvidia explicitly state that each individual G80 SP was capable of a MAD+MUL co-issue?

no-X · May 26, 2007

Did somebody noticed, that according to many vertex shader tests of G80GTS and GTX (example), their VS performance differs by ~12%. The performance difference should be about 50% /1,33(SPs)x1,12(frequency) = 1,5x/

Difference of 12% corresponds to different clocks (1350MHz vs. 1200MHz ~ 12%) and one of my friends vouched, that GTS at 1350MHz has the same VS performance as GTX at 1350MHz.

It seems, that both GTS and GTX use the same (fixed) number of SPs for vertex processing, or G80 isn't fully unified and only limited number of SPs (same for GTS and GTX) can be used for vertex shading... is this a HW issue, or driver problem? Couldn't this be the reason for lower GS performance?

trinibwoy · May 26, 2007

Rightmark disagrees.

http://www.hardware.fr/articles/671-4/ati-radeon-hd-2900-xt.html

Also, Andy and Rys have already pointed out the most probable culprit for the reported GS deficiency and it has nothing to do with fixed allocations of shaders.

no-X · May 26, 2007

Ok, the question is why GTX and GTS@1350MHz has the same score in 3DM06 VS tests and G84 scores similar to G80 (or even better)?

Jawed · May 26, 2007

Hmm, that's interesting.

Playing with the RightMark results that trinibwoy has linked, indicates that GTX is ~49% faster than GTS (theoretically it should be 50%) - based on averaging all those tests.

Also, amusingly, if you take the theoretical scalar throughput of HD2900XT, which is 5*64*742 = 237440 scalar instructions (ignoring SF), HD2900XT is 37% faster theoretically than 8800GTX. In the RightMark tests, it averages 38% faster.

So, it seems that the 3DMk06 VS tests are very impure tests, infected by other aspects of the architecture.

Peculiarly the static and dynamic-branching tests at the end there (the last two tests) don't behave very well on G80. Am I interpreting that correctly? Are those last two tests for branching?

Jawed

no-X · May 26, 2007

GF8600GTS and GF8800GTX in 3DMark06 Vertex shader test:

http://www.bjorn3d.com/read.php?cID=1090&pageID=3367

simple VS: G84 is faster than G80
complex VS: G84 ~ G80

It's really strange. What could be the limiting factor for this test and G80?

Jawed · May 26, 2007

I think this is a question of what the 3DMk06 tests render on screen. I've got no idea what's rendered.

Also, it might be an artefact of setup throughput. If setup is bottlenecked through a single unit for all vertices then maybe the setup rate of the GPU is being tested by 3DMk06's complex test?

Jawed

trinibwoy · May 26, 2007

Jawed said:
Peculiarly the static and dynamic-branching tests at the end there (the last two tests) don't behave very well on G80. Am I interpreting that correctly? Are those last two tests for branching?

Yeah performance drops on the static test for some reason. But I'd be a lot more concerned about R600 DB performance in that fractal shader on the following page.

Jawed · May 26, 2007

trinibwoy said:
But I'd be a lot more concerned about R600 DB performance in that fractal shader on the following page.

Oh dear, is that broken (driver/compiler) or a black hole?

Jawed

3dcgi · May 26, 2007

no-X said:
GF8600GTS and GF8800GTX in 3DMark06 Vertex shader test:

http://www.bjorn3d.com/read.php?cID=1090&pageID=3367

simple VS: G84 is faster than G80
complex VS: G84 ~ G80

It's really strange. What could be the limiting factor for this test and G80?

Maybe the vertex cache was sized incorrectly on G80 and is correct for G84. This could come into play if there aren't enough ALU instructions in the shader to hide latency.

G80 Vertex Shading Batches - 16 vertices - how? why?

Jawed

Rys

Graphics @ AMD

armchair_architect

Jawed

Mintmaster

Jawed

Rys

Graphics @ AMD

trinibwoy

Meh

no-X

trinibwoy

Meh

no-X

Jawed

no-X

Jawed

trinibwoy

Meh

Jawed

3dcgi

Similar threads