I wonder why there are no DB figures for the xt1900
It's a conspiracy! It turns out ATI DB performance was so bad, they paid off Mike to suppress the results, using the broken GL driver excuse!
(for the humour impaired, it's a joke!)
I wonder why there are no DB figures for the xt1900
Each fp32 register consumes 16 bytes.Thinking more about it.. since it's now possible to hide texture latency via arithmetic ops having a 16 kb register file per cluster should not be as bad as I thought.
Each fp32 register consumes 16 bytes.
nAo said:I wonder why there are no DB figures for the xt1900
So, you need 4x as many 32-bit registers thenMaybe each virtual DX10 register does, but the low level SP registers would not be 128-bit vectorized registers, but 32-bit scalar registers.
Why? it's a scalar architecture. IMHO they simply fetch 4bytes x 16 per operand per clock per cluster from the register file.Each fp32 register consumes 16 bytes.
No, you're still thinking in terms of a non scalar architectureIf each fragment in flight is assigned 2x fp32s in the register file, that's 32 bytes.
There's 32 fragments per batch. That's 1KB.
So, 16KB would allow for only 16 batches in flight, per cluster.
Jawed
that is what the register file is for.You need to actually hold the contents of registers in memory somewhere.
Of course it doesn't, SPs would stall. But my premise was that this shader uses 4 vec4 registers, so it MUST do much more ops to use all that input and output operands, that's why your scenario is highly unlikelyWhat happens when the shader does a sequence of scalar ops across all fragments in flight? 256 fragments in flight (your number, not mine) will take 16 clocks for a scalar instruction, at 1350MHz. Does G80 hide the latency then?
OK, well you've decided to draw an arbitrary line at a 16KB register file, which is clearly incapable of hiding the latency of lots of situations.Of course it doesn't, SPs would stall. But my premise was that this shader uses 4 vec4 registers, so it MUST do much more ops to use all that input and output operands, that's why your scenario is highly unlikely
actually you failed to show such a situation since your hypothesis can't be valid if you take my premiseOK, well you've decided to draw an arbitrary line at a 16KB register file, which is clearly incapable of hiding the latency of lots of situations.
and it's wrong, good nightI just picked the most obvious exception.
Jawed
What about a co-issued attribute interpolation? Or scalar special-function?It seems each SP can compute a scalar madd per cycle so it needs 3 FP32 as input and 1 FP32 as output.
Yeah, bedtime for me too.If they store the same register duplicated 16 x N (N=2?) times in contiguous locations in the register file they might need to split it in something like four 4kb banks, each bank having a 512 bit data path.
wrote enought bs tonight, it's time to go to bed
Because branching doesn't work like we want under GLSL with ATI... They try to unroll loops and predicate branches in all public drivers. We have DX versions of the test, they used to behave, but there was a subtle change in the DX spec that is breaking the test for *both* vendors, so we need to figure out how to fix that...
It seems each SP can compute a scalar madd per cycle so it needs 3 FP32 as input and 1 FP32 as output.
If they store the same register duplicated 16 x N (N=2?) times in contiguous locations in the register file they might need to split it in something like four 4kb banks, each bank having a 512 bit data path.
wrote enought bs tonight, it's time to go to bed
I was analyzing what I consider to be some kind of minimum register file bw arrangement.What about a co-issued attribute interpolation? Or scalar special-function?
Poor choice of words, my bad. I meant that enough space must be allocated in the register file to save the content of the register X for the M pixels in flight (beining M probably a multiple of 16)Why would you need to dupe? You don't mean actual duplication of the contents of the register?
What we can reasonably expect is to have SPs working all together within a cluster on the same instruction at any given clock cycle so that you don't need a register file with a lot of ports but just one with a very wide data bus, as long as you can be smart when register space is allocated in the register fileThe problem is that the 16KB register file would have to have 16x3 read (16 SPs * 3-op MADD / cycle) + 16 write ports, which is practiaclly impossible to design from a circuit perspective. This might be feasible if there were 16 banks each 3-read, 1-write. However this depends on how the PDC shared memory works. In the worst case you'd need to get 3 shared-memory operands from 3 different banks to all 16 SPs.