Are high instruction limits really needed for pixel shaders?

Mintmaster

Veteran
I was just thinking about a few things regarding shaders. Now NVidia has continually said that a PS 1.4 shader can ALWAYS be done with PS 1.1 using multiple passes. Now, it isn't always easy to break up a longer program into several smaller ones through multipass (storing intermediate values in the colour buffer) but I can see how it can be done pretty much all the time. PS 1.1 also has other limitations that could create a longer overall shader, and the bandwidth requirements can go up quite a bit due to the extra colour buffer reads and writes between passes.

Now lets apply this to R300. It supports multiple render targets. I'm not sure how many, but I think I read 4 somewhere. This means that it can store 16 scalars or 4 vectors between passes, making it much easier to split up a long, independent calculation (it would be quite hard to think of a shader that would need more values than this ALL the time). These values could be read back into the next shader program as texture inputs (of which there are plenty). Because we are moving towards a high-level shading language, the compiler can figure out where to put these breaks and do multipass automatically. R300 also has a 160 instruction limit, so each segment can be almost this long, thus amortizing the extra multipass bandwidth over this long instruction sequence, which couldn't realistically be bandwidth limited anyway due to its length.

You may not even have to resend the geometry either, as you could likely store any remaining vertex shader output values in one of the render targets and do the multipass as a quad. If the pixel shader programs are so long though, it is unlikely that it will be geometry limited anyway.

This idea came to me when I saw how they did raytracing on the 9700 at Siggraph, where they actually did a similar thing to what I'm suggesting. I don't think there will be many shaders with a more complicated final goal than that, yet each intermediate shader was rather simple. All that is missing is the implementation by the driver team in the form of a compiler.

Any thoughts?
 
Many short shaders in multipass is the worst possible thing you could do for performance when multipassing. If it is possible to do the same thing without multipass, it will be much faster (Unless you really tweak-out the design so that there is no stall from switching passes...I don't think that's going to be possible with all, or even many, multipass algorithms).
 
Re: Are high instruction limits really needed for pixel shad

While this is true, the advantage that the NV30 will have is that for those programs > 160 (or 96) and less than 1024 (well, actually less, depending on number of constants used) all intermediate values will be available in registers rather than being written out to memory. So you get around the write and read penalties associated with this. Of course, as you pointed out the cost isn't quite that bad since you probably won't need to multipass full length shaders very often.
 
Actually, multipassing long shaders is less of a performance hit (percentage-wise) as it's already taking so long to execute the shader that there won't be much loss from the stall in switching to the next pass.

There may also be additional hits from multipassing if certain processing needs to be done during each pass (and would only be done once otherwise). I don't really know how common this sort of issue would be, though.
 
I think it depends on the geometry load. If the vertex shaders are long, and there is a large amount of geometry, then multipassing long pixel shaders results in multiple passes over a huge geometry database and evaluating expensive vertex shaders. It also eats up more AGP bus traffic, more frame buffer bandwidth, etc
 
Re: Are high instruction limits really needed for pixel shad

CMKRNL said:
While this is true, the advantage that the NV30 will have is that for those programs > 160 (or 96) and less than 1024 (well, actually less, depending on number of constants used) all intermediate values will be available in registers rather than being written out to memory. So you get around the write and read penalties associated with this. Of course, as you pointed out the cost isn't quite that bad since you probably won't need to multipass full length shaders very often.

Write and read penalties will probably be the least of your performance worries if you a running extremely long pixel shaders as execution time within the shader ALU will tend to dominate.

The penalties could start to add up if the splitting of the shader was done in an inefficient manner - you would generally want to break the shader into the largest program chunks possible to minimise the round trips to memory and keep the ratio of instructions to reads/writes high.
 
If using some large number of ALU instructions and/or dependent texture lookups, then it's (on current hardware) proceeding at fractions of a pixel per clock anyway.

Of course, float intermediates are big and expensive in bandwidth - but if you're only rendering 1 pixel every 4+ clocks, and you've 512 bits of bandwidth per memory clock.... obviously there is plenty to go round.

That also obviously gives plenty of time for the geometry and rasterisation to be handled, unless your geometry is down to tiny triangles (which the developer shouldn't allow to happen to avoid geometry aliasing). Also, complex pixel shaders tend to replace geometry (John Carmack again - the use of bump maps in Doom3 to replace geometry).
 
Re: Are high instruction limits really needed for pixel shad


Write and read penalties will probably be the least of your performance worries if you a running extremely long pixel shaders as execution time within the shader ALU will tend to dominate.


Yep, absolutely. My point was that given equivalent ALU processing power, that would be one of the advantages of the NV30 over R300 (minor as it may be in the big picture). Of course, at this stage we're not sure about the processing power of NV30. With an 8x2 architecture and enough constants to possibly require fewer 'short' passes, it may have another advantage there. As someone pointed out in another thread, what we really need here is how many ops/cycle the GPU can do to determine performance advantages of one design over another.

Another aspect of this that needs to be considered is what is the penalty for uploading a shader to the HW? Presumably they are all cached in VRAM, but it doesn't sound like they are fetched/dispatched out of VRAM. If that's the case and the programs are in some type of register space, what is the context switch latency, especially on long shaders? This is one area that the R300 may actually have an advantage in with it's shorter shaders. eg. NV30 will only have to upload the size of the shader but what if you had a 500 op shader and a 50 op shader that you switched between several times in a frame? Or perhaps developers will now have to sort their tri's by shaders to minimize state switch penalties.
 
Chalnoth said:
Actually, multipassing long shaders is less of a performance hit (percentage-wise) as it's already taking so long to execute the shader that there won't be much loss from the stall in switching to the next pass.

There may also be additional hits from multipassing if certain processing needs to be done during each pass (and would only be done once otherwise). I don't really know how common this sort of issue would be, though.

This is what I was saying. At 160 instructions, the execution time would be long enough that the multipass overhead would be next to insignificant - maybe only a few percent of execution time. I assume when you started this post with "Actually", you are reconsidering your first statement.
 
Re: Are high instruction limits really needed for pixel shad

CMKRNL said:
Yep, absolutely. My point was that given equivalent ALU processing power, that would be one of the advantages of the NV30 over R300 (minor as it may be in the big picture). Of course, at this stage we're not sure about the processing power of NV30.

As you say, there will be few advantages one way or the other except the performance of the pixel shader unit and texture lookup unit. But going on to assume nv30 is faster - well, that's not much to do with 'are high instruction limits really needed for pixel shaders' - is it relevant to this thread?

It could be that R300 is faster than nv30. We just don't know! All we know is nv30 has support for longer programs, and we are attempting to discuss if it is an advantage to have support for longer programs.

So far the conclusion appears to be that as long as you have support for a certain size, larger programs should be able to multipass for only a very limited performance cost.


CMKRNL said:
Or perhaps developers will now have to sort their tri's by shaders to minimize state switch penalties.

Undoubtedly, they will have to do this, wherever the shaders are stored or how big your instruction count is. Texture cache coherency is a reason many engines (e.g. Quake3) already sort by shader.
 
DemoCoder said:
I think it depends on the geometry load. If the vertex shaders are long, and there is a large amount of geometry, then multipassing long pixel shaders results in multiple passes over a huge geometry database and evaluating expensive vertex shaders. It also eats up more AGP bus traffic, more frame buffer bandwidth, etc

Because you can store quite a few values inbetween, it may be possible that resending the geometry isn't necessary. Think of what the pixel shader actually needs from the vertex shader. Things like texture coordinates, texture samples, diffuse/specular colours, and I guess anything else that is new to DX9 (maybe extra variables for vertex to pixel shader communication?). If you can use as many of those values as needed in the first pass, you could store the remaining values in the render target.

Now, for subsequent passes you would only have to draw a bounding 2D box with a write mask stored in one of the render targets from a previous pass. Assuming you fit all of the previous data in the render targets, these are all you need to continue the pixel shader, without the additional overhead of geometry. I could see complications with MSAA, but the drivers should be able to work around it.
 
Mintmaster said:
This is what I was saying. At 160 instructions, the execution time would be long enough that the multipass overhead would be next to insignificant - maybe only a few percent of execution time. I assume when you started this post with "Actually", you are reconsidering your first statement.

Do you know how long it takes to execute a 160-instruction shader? Or how long a pipline stall from a state change is?

I read not too long ago that the original GeForce had 600-800 stages, in one geometry pipeline and four pixel pipelines. That's probably about 100 stages in each pixel pipeline. Modern hardware probably has more. 160 instructions, though it probably wouldn't take 160 clocks, would still allow for a significant stall.

Granted, you shouldn't have to stall at every pixel, as the drivers should be able to batch geometry and send a number through at once. I just don't see how it could be insignificant in relation to the pipeline depth.
 
The hardware guys work pretty hard to make sure the pipeline stalls as infrequently as possible. There's no requirement that changing the pixel shader requires a complete pipeline flush. It all depends on the implementation.

If you assume you're staying in the same state for 10,000 clocks (that might be less than a hundred pixels with a 1000-instruction shader) then 100 clocks for a stall is clearly pretty insignificant.

In reality, most shaders will render a lot more pixels than this (advanced shading on a 10x10 quad is pretty pointless!) so even if the shader has to multipass and take the full hit ten times (which, because of the effort put into reducing pipeline stalls, may not happen) it will still be similarly insignificant.

In the future, as pixel shader throughput rises and pipelines get longer, stalls might become more important, but not in this generation.
 
The only thing that you have to keep in mind is that as we go into the future, complex shaders will be used on smaller and smaller surfaces. That is, more pieces of the image will have their own unique shaders.

Regardless, the real meat will be in the benchmarks. I'd like to see some highly-complex shaders that have optimized code for both the R300 and NV30, and see how both do. Of particular interest would be auto-multipass generated shaders.

It is very true that such benchmarks will not be valid for a few years, and thus will not hold a huge amount of validity at all (since it'll be very hard to predict how games will render in a few years), but may give us some idea of whether or not the increased shader size can really improve performance into the future (or, perhaps just for high-end development, if nVidia wants to market the NV3x in the truly high-end).
 
Chalnoth said:
Mintmaster said:
This is what I was saying. At 160 instructions, the execution time would be long enough that the multipass overhead would be next to insignificant - maybe only a few percent of execution time. I assume when you started this post with "Actually", you are reconsidering your first statement.

Do you know how long it takes to execute a 160-instruction shader? Or how long a pipline stall from a state change is?

I read not too long ago that the original GeForce had 600-800 stages, in one geometry pipeline and four pixel pipelines. That's probably about 100 stages in each pixel pipeline. Modern hardware probably has more. 160 instructions, though it probably wouldn't take 160 clocks, would still allow for a significant stall.

Granted, you shouldn't have to stall at every pixel, as the drivers should be able to batch geometry and send a number through at once. I just don't see how it could be insignificant in relation to the pipeline depth.

In that ATI education flash animation, they said the pipeline can execute a texture lookup, a texture address op, and a math op per clock. Suppose that the instruction sequence took 50 clocks to complete per pixel pipe. A measy 10000 pixel object (100x100, only 1/70 of a 1024x768 screen) would then take 10000/8*50=62500 clocks to complete. If it takes 1500 clocks to set up between passes, that's only about 2.5% additional overhead.

Remember the bigger picture - so long as we are concerned so much about performance, we are talking about real-time graphics. Such huge pixel shader lengths (>300) could not be used on the whole scene even at 640x480, so even a 10% decrease in performance for these sections of the screen would not contribute a whole lot to the overall framerate anyway.
 
While I'm thinking that really huge pixel shader sizes aren't that important since multipass can be used for them, vertex shaders could be different, you can't multipass a vertex shader. While it's possible to fallback on software to process a huge vertex shader, in quite possibly all cases, the hardware vertex shader is always going to be faster than software. And quite a bit faster too, especially if the CPU needs to be doing something else at the same time.
 
Not all gaming situations will allow for nice easy multipass.

One example might be a future game that in addition to using highly-complex shaders, also has hordes of enemies (similar to Serious Sam). Even if the hardware could effectively do an entire opponent before switching passes, it would have to do this hundreds of times in some scenes, which could easily slow down performance a significant amount.

But, I'd still like to see some benchmarks meant to stress this particular difference in the two architectures. Hopefully we'll have one by the end of the year.
 
Yep, the real hard case would be shaders on triangles that are meant to be transparent. It's always hard to multipass transparency
 
I'd imagine that you'd render the first pass to an off-screen buffer for transparent triangles that need multipass.

Of course, that's far from efficient as you'd pretty much need a full-screen buffer for each triangle used in this way (until it is merged during the final pass with the primary framebuffer).
 
Offscreen buffer isnt such a bad idea. It would allow you to do some form of simulated refraction if you really wanted to. If everything was setup just right in your engine, you could use the 'back buffer' as a texture, and then play with texture coords to create an effect that would look a bit like refraction, but would not be completely accurate.
 
Back
Top