Pavlos said:
In my previous post NUM_PIXELS_X and NUM_PIXELS_Y are not the dimensions of the final image but the dimensions of the pixel block you are working on (2x2). So I think the memory and the bandwidth requirements are not an issue since you need only 4 copies of the shader state.
Oh, sorry, I know I must have misunderstood that...
The other approach (the one I was referring) is to shade all the pixels of this 2x2 block at the same time. Every instruction of the shader then must operate on a 2x2 grid of pixels, not a single pixel. So the instructions must take as input a 2x2 matrix and output a 2x2 matrix of values (scalars, vectors, matrices and whatever d3d defines). Much like the examples I gave in my previous post.
I've got good news and bad news.
The bad news is that I only have 8 SSE registers, which each have 4 floating-point components. Most shader operations take two 4D vectors as input and one as output. If you do that operation on 2x2 pixels at once, you would need 12 registers (or 8 for 3D vectors, or 8-6 when two operands are equal). The problem with that is that for every instruction I mostly need to load and store all registers. Some data can be kept in registers, but it won't be an exception that 256 bytes have to be moved from and to the cache. For a simple add instruction that's unattractive...
The good news is that the Pentium has a feature to make up for its relatively low register count, namely register renaming. What it does is that two data-independant instructions, but which operate on the same registers, can execute in parallel by using different physical registers. For example this means that a very tight loop can be at a different iteration at the same time if they have no data dependency!
So my best idea was to process the pixels in a 2x2 block sequentially, but only the part before the dsx/dsy. This way very little register loading/storing between instructions is required, and physically it can still execute independently in parallel. Once we reach the dsx/dsy instruction, we store the register which we wish to differentiate. Once the 2x2 block is done, we continue with the rest of the shader, starting with the dsx/dsy.
As for the implementation, you can convert very easily the examples I gave (and the rest of instructions) to a sequence of SSE instructions and then you can use Softwire to compile them at load time (Is there any problems with that?).
SoftWire is -not meaning to give myself too much credit- extremely suited for the situation. Its build-in automatic register allocator made it possible to work with symbolic names instead of directly with registers (although still possible), and still have the performance of hand-written assembly. So it's very easy to redesign things without the trouble of remembering what registers hold what data. I also plan an automatic sheduler, so that in my above design dependencies can be avoided and parallel execution improved even more.
Also note that the only way to take full advantage of the SSE instructions is to shade 4 or more pixels at once (not sequentially), since shaders usually use scalars and three component vectors* and it's sub-optimal to use SSE for operations between two operands.
I've discussed this with Dio as well, but I'm not convinced it would give a big performance increase. I've even done some tests and they clearly showed that memory operations, even when it's in the cache, have a considerable latency. In the worst case using your implementation would result in three memory operations and only one arithmetic operation per pixel for a simple add instruction. In my implementation there's a much greater chance that the data is already in registers and it translates to only one arithmetic operation.
As for the pixels in the group outside of the triangle you cat set the corresponding is_active[][] flag (see my previous post) to zero and the shader will never touch them.
With my implementation that's even simpler. Every instruction can have its own control which is only needed at branch instructions (and dsx/dsy and writing results).
As far as I can see, the only shortcoming of this approach (also used by Pixar and many others) is the little overhead it introduces on every instruction, to check if each pixel is active.
I wouldn't call it a small overhead, since it's needed for every pixel and every shader instruction. That's an extra index calculation, memory lookup, compare and jump instruction. In my implementation the shader can just keep executing for the same pixel until a new branch or dsx/dsy is reached. Although more complex to implement that would require a minimal extra overhead at shader execution time.
Of course, if you find something else I’m very interested to hear it. I’ m facing the same problems with my RenderMan renderer and the corresponding instructions and I want the shaders to execute as fast as possible.
Well, I'm probably very demanding but I was actually asking how to do things even faster.
My main problem is that it requires so much to execute these dsx/dsy instructions correctly. Especially since the 2x2 blocks have so many pixels that fall outside the polygon seems very unoptimal. It's horrible for tiny polygons, but for 'medium' or even 'large' polygons. I can draw cases of polygons with 50 pixels that need 30 pixels extra, or 200 pixels with 60 pixels extra. I don't know what the 'average' size for polygons is, but they are getting smaller with every generation of games and you could easily be computing 20% pixels 'too much'. I know some people who would kill for a performance increase like that.
* I’m not sure if d3d exposes only four component vectors, but if that‘s the case then it will probably change in future versions. RenderMan doesn’t even define a four component vector and GLslang defines also a vec2 and vec3 datatype. Four component vectors aren’t very useful for shading.
There are also scalar SSE instructions so in average I'm only loosing 1/4 performance, but that's only on the arithmetic operations and is nothing compared to the extra memory insturctions needed when processing pixels in parallel. Furhtermore I attempt to pack some interpolants together.
Anyway, I'm having exams now so I can't experiment with it all...
Your SRT renderer is very impressive!