Jawed
Legend
In NVidia's latest GPUs, a batch of approximately 1000 pixel fragments executes together.
There appear to be two options for batched execution:
1. In G70, for example, you might have 24 fragments being shaded by a single instruction, all at the same time (SIMD).
On the next clock cycle, the next 24 fragments will be shaded - by the same instruction.
This repeats until every fragment has been shaded by this instruction. That's 42 iterations.
Then the next instruction is loaded-up and the GPU performs another 42 iterations on groups of 24 fragments.
2. The GPU processes the entire shader, one instruction at a time, for 24 fragments.
It then proceeds to run the entire shader for each of the next 41 groups of 24 fragments, until the entire batch is shaded.
The conflict I have here is that 1. seems to require a vast amount of per-fragment state data to be kept - not just register values but also (potentially) the results of a texture operation if that is the instruction being executed across all 1000 fragments simultaneously.
It also raises the spectre of fairly disastrous performance when trying to shade batches of less than approximately 1000 fragments.
2. Makes me ask "where does the 1000-fragment batch size come from, if the GPU can actually operate on batches of 24 fragments?" What am I missing?
So does anyone know how batches are processed?
Various evidence seems to point to 1. being the most likely execution scheme (the extremely poor behaviour of NV40 in handling per-fragment dynamic branching and the two-level texture cache, with L2 being shared across all fragment pipes).
Do ATI's GPUs operate in nominally the same way (but with a much smaller batch size)?
Jawed
There appear to be two options for batched execution:
1. In G70, for example, you might have 24 fragments being shaded by a single instruction, all at the same time (SIMD).
On the next clock cycle, the next 24 fragments will be shaded - by the same instruction.
This repeats until every fragment has been shaded by this instruction. That's 42 iterations.
Then the next instruction is loaded-up and the GPU performs another 42 iterations on groups of 24 fragments.
2. The GPU processes the entire shader, one instruction at a time, for 24 fragments.
It then proceeds to run the entire shader for each of the next 41 groups of 24 fragments, until the entire batch is shaded.
The conflict I have here is that 1. seems to require a vast amount of per-fragment state data to be kept - not just register values but also (potentially) the results of a texture operation if that is the instruction being executed across all 1000 fragments simultaneously.
It also raises the spectre of fairly disastrous performance when trying to shade batches of less than approximately 1000 fragments.
2. Makes me ask "where does the 1000-fragment batch size come from, if the GPU can actually operate on batches of 24 fragments?" What am I missing?
So does anyone know how batches are processed?
Various evidence seems to point to 1. being the most likely execution scheme (the extremely poor behaviour of NV40 in handling per-fragment dynamic branching and the two-level texture cache, with L2 being shared across all fragment pipes).
Do ATI's GPUs operate in nominally the same way (but with a much smaller batch size)?
Jawed