Basic,
Just had another (i think pretty good) idea :
Each pipe either processes a 2x2 block of pixels (or subpixels
for 4xSSAA).
Now, each pixel pipe can issue at 4 32bit fp ops per cycle,
and is executing 4 pixels p1,p2,p3,p4 concurrently as follows
(instructions i1, i2, ...in) :
Code:
cycle instruction
1 i1 (operating on p1.x, p2.x, p3.x, p4.x)
2 i2 (operating on p1.y, p2.y, p3.y, p4.y)
3 i3 (operating on p1.z, p2.z, p3.z, p4.z)
4 i4 (operating on p1.w, p2.w, p3.w, p4.w)
5 i5 (operating on p1.x, p2.x, p3.x, p4.x)
6 i6 (operating on p1.y, p2.y, p3.y, p4.y)
7 i7 (operating on p1.z, p2.z, p3.z, p4.z)
8 i8 (operating on p1.w, p2.w, p3.w, p4.w)
...
This means
- Since we are operating on individual components of the 4 pixels at a
time, the actual underlying ISA can be scalar (not SIMD). Read/Write
masks and arbitrary swizzles are trivial to implement since all that is
required is compile time register renaming. Furthermore, a SIMD op
operating on 1,2,3 SIMD register components will have a throughput
of 4 pixels in 1,2,or 3 cycles (in the absence of stalls).
- less i-cache bandwidth needed (smaller instructions)
- arithmetic instructions with latency of 1 to 4 cycles have their latency
hidden.
- 4 copies of the register file are required, but you only need a quarter
of the read and write b/w to each.
- DDX and DDY implementation can use the arithmetic units of a single
pipe, and need only interact with the register files of a single pipe.
In fact, DD[X|Y] is a simple subtraction followed by a multiply with a
constant scaling factor - an fmac unit can handle this.
- texture loads : produce 4 filtered texels at a time, can share
LOD/anisotropy calculations across the 4 pixels.
- Better texture cache efficiency (same TMU and cache handling
texture loads on addresses very close to eachother), possibly
some computational optimizations possible from the 2x2 sample
block organization?
Now, each of the pipes can be made independent (process different pixels from different triangles and/or with different shader programs, need not operate on (or output) data in lockstep).
I need to think about how branching would work with this scheme...
Regards, Serge