Mintmaster
Veteran
I was just thinking about a few things regarding shaders. Now NVidia has continually said that a PS 1.4 shader can ALWAYS be done with PS 1.1 using multiple passes. Now, it isn't always easy to break up a longer program into several smaller ones through multipass (storing intermediate values in the colour buffer) but I can see how it can be done pretty much all the time. PS 1.1 also has other limitations that could create a longer overall shader, and the bandwidth requirements can go up quite a bit due to the extra colour buffer reads and writes between passes.
Now lets apply this to R300. It supports multiple render targets. I'm not sure how many, but I think I read 4 somewhere. This means that it can store 16 scalars or 4 vectors between passes, making it much easier to split up a long, independent calculation (it would be quite hard to think of a shader that would need more values than this ALL the time). These values could be read back into the next shader program as texture inputs (of which there are plenty). Because we are moving towards a high-level shading language, the compiler can figure out where to put these breaks and do multipass automatically. R300 also has a 160 instruction limit, so each segment can be almost this long, thus amortizing the extra multipass bandwidth over this long instruction sequence, which couldn't realistically be bandwidth limited anyway due to its length.
You may not even have to resend the geometry either, as you could likely store any remaining vertex shader output values in one of the render targets and do the multipass as a quad. If the pixel shader programs are so long though, it is unlikely that it will be geometry limited anyway.
This idea came to me when I saw how they did raytracing on the 9700 at Siggraph, where they actually did a similar thing to what I'm suggesting. I don't think there will be many shaders with a more complicated final goal than that, yet each intermediate shader was rather simple. All that is missing is the implementation by the driver team in the form of a compiler.
Any thoughts?
Now lets apply this to R300. It supports multiple render targets. I'm not sure how many, but I think I read 4 somewhere. This means that it can store 16 scalars or 4 vectors between passes, making it much easier to split up a long, independent calculation (it would be quite hard to think of a shader that would need more values than this ALL the time). These values could be read back into the next shader program as texture inputs (of which there are plenty). Because we are moving towards a high-level shading language, the compiler can figure out where to put these breaks and do multipass automatically. R300 also has a 160 instruction limit, so each segment can be almost this long, thus amortizing the extra multipass bandwidth over this long instruction sequence, which couldn't realistically be bandwidth limited anyway due to its length.
You may not even have to resend the geometry either, as you could likely store any remaining vertex shader output values in one of the render targets and do the multipass as a quad. If the pixel shader programs are so long though, it is unlikely that it will be geometry limited anyway.
This idea came to me when I saw how they did raytracing on the 9700 at Siggraph, where they actually did a similar thing to what I'm suggesting. I don't think there will be many shaders with a more complicated final goal than that, yet each intermediate shader was rather simple. All that is missing is the implementation by the driver team in the form of a compiler.
Any thoughts?