Is this how shader instruction batching works?

Jawed

Legend
In NVidia's latest GPUs, a batch of approximately 1000 pixel fragments executes together.

There appear to be two options for batched execution:

1. In G70, for example, you might have 24 fragments being shaded by a single instruction, all at the same time (SIMD).

On the next clock cycle, the next 24 fragments will be shaded - by the same instruction.

This repeats until every fragment has been shaded by this instruction. That's 42 iterations.

Then the next instruction is loaded-up and the GPU performs another 42 iterations on groups of 24 fragments.

2. The GPU processes the entire shader, one instruction at a time, for 24 fragments.

It then proceeds to run the entire shader for each of the next 41 groups of 24 fragments, until the entire batch is shaded.

The conflict I have here is that 1. seems to require a vast amount of per-fragment state data to be kept - not just register values but also (potentially) the results of a texture operation if that is the instruction being executed across all 1000 fragments simultaneously.

It also raises the spectre of fairly disastrous performance when trying to shade batches of less than approximately 1000 fragments.

2. Makes me ask "where does the 1000-fragment batch size come from, if the GPU can actually operate on batches of 24 fragments?" What am I missing?

So does anyone know how batches are processed?

Various evidence seems to point to 1. being the most likely execution scheme (the extremely poor behaviour of NV40 in handling per-fragment dynamic branching and the two-level texture cache, with L2 being shared across all fragment pipes).

Do ATI's GPUs operate in nominally the same way (but with a much smaller batch size)?

Jawed
 
It's 1, but I'm not sure all quad pipelines are working on the same batch.

The conflict I have here is that 1. seems to require a vast amount of per-fragment state data to be kept - not just register values but also (potentially) the results of a texture operation if that is the instruction being executed across all 1000 fragments simultaneously.
The result of a texture operation is just one register value. Yes, you need a large register file, and NVidia states that there's enough space for 4 float4 registers per fragment at full speed. If batch size is 1024 fragments, that's 64 KiB for PS registers alone.

But if you want to hide texturing latency, there's no way around a large register file.


R3xx/R4xx pipelines are split into a texturing and an ALU loop. I guess they could actually work in different ways.
 
AFAIK it's 256 quads in flight per quad engine.

Batches can be smaller but there is an overhead per batch so the setup engine is optimized to create batches as big as possible.

When working on very big triangles (like in fillrate benchmarks) the 6800's setup engine "send" one big batch (1024 quads) to the 4 quad engines but the 7800's setup engine send 6 smaller batches (256 quads), one to each quad engine. At least that's what my branching tests showed when I was working on the 7800 review. Because of that with some branching cases the 7800 is way faster than the 6800.
 
Side comment:
I wonder if one could get away with vastly shorter ALU pipelines by separating texturing pipelines from ALU pipelines, thus requiring less register space?
 
Chalnoth said:
Side comment:
I wonder if one could get away with vastly shorter ALU pipelines by separating texturing pipelines from ALU pipelines, thus requiring less register space?

Yes but it's difficult to keep good performances with dependent texture reads.

I think Intel does that with GMA. Each of the 4 pipelines has 2 MADs and a texture unit but it looks like the texturing unit is separated.

For example

TEXLD R0, T0, S0
TEXLD R1, T1, S1
TEXLD R2, T2, S2
MAD R3, R0, R1, R2
MAD R3, R3, R3, R3
MAD R3, R3, R3, R3
MAD R3, R3, R3, R3
MAD R3, R3, R3, R3
MAD R3, R3, R3, R3

take 3 cycles with a GMA.
 
What's puzzling me is why the pipeline in NV40/G70 is apparently as long as it is.

If we assume that the effective minimum batch size is 256 fragments on each quad, and the pipeline is running one shader instruction repeatedly for the entire batch, what is the reason for making the pipeline apparently "so long"?

The implication is that the pipeline is 64 "cycles long".

Is that right? What on earth is the pipeline doing for 64 cycles?... Is this a side effect of the incredible variety of instruction combinations that may be executed concurrently?

Jawed
 
Most of the time those instructions are likely just sitting in a queue waiting to be executed. The pipeline is long to hide texture latency.
 
Jawed said:
What's puzzling me is why the pipeline in NV40/G70 is apparently as long as it is.

If we assume that the effective minimum batch size is 256 fragments on each quad, and the pipeline is running one shader instruction repeatedly for the entire batch, what is the reason for making the pipeline apparently "so long"?

The implication is that the pipeline is 64 "cycles long".

Is that right? What on earth is the pipeline doing for 64 cycles?... Is this a side effect of the incredible variety of instruction combinations that may be executed concurrently?

Jawed

Waiting for memory....
Reading memory none sequentially could cost of the order of 100cycles at 500Mhz. Of course you only pay that sort of penalty for the first read, the following sequential ones will be almost an order of magnitude faster.

It's the difference between quoted latency times and real latency as seem by the CPU/GPU.
 
Tridam said:
Moreover the pipeline (+ loopback buffer) should be 256 cycles long not 64.

:oops: I divided down too far.

Sometime, over the weekend, I will take a French->English translator to your review of the 7800GTX to try to understand your branching benchmarks.

Jawed
 
So, this article shows that 6800Ultra processes fragments in batches of 4096, SIMD across all four quad-pipelines, whereas 7800GTX processes in batches of 1024, MIMD (per quad-pipeline).

That's quite a significant change in the architecture going from 6 series to 7 series, but this seems to be the first article that discusses the point explicitly. Shame it's been obscured. Kudos to HardWare.fr!

I presume the performance gain for 7800GTX with small batches (64x16, 64x24, 64x32) is poor simply because the shader isn't long enough to show much performance difference between the branches.

Anyway, it seems to me that a 1024-fragment batch is still far too big to interest devs (but hey, I'm not a dev).

So far I've presumed that ATI's R300-or-later GPUs have used a 256-fragment batch size, because ATI's evaluations led them to a tile size of 16x16 (though size is programmable in R420).

Would a batch size of 256 make it noticeably more worthwhile for devs to implement per-fragment branching? It kind of seems unlikely to me.

Jawed
 
It really depends upon what you want to do with dynamic branching.

If you want to use the branching as an "early out" for lighting or shadowing, then no matter how inefficient it is dynamic branching can be a gain. See, for example, Humus' demo that uses an early-out for lighting, and bear in mind that the use of the stencil buffer for branching may not always be possible.
 
Back
Top