If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
I was just thinking about a few things regarding shaders. Now NVidia has continually said that a PS 1.4 shader can ALWAYS be done with PS 1.1 using multiple passes. Now, it isn't always easy to break up a longer program into several smaller ones through multipass (storing intermediate values in the colour buffer) but I can see how it can be done pretty much all the time. PS 1.1 also has other limitations that could create a longer overall shader, and the bandwidth requirements can go up quite a bit due to the extra colour buffer reads and writes between passes.
Now lets apply this to R300. It supports multiple render targets. I'm not sure how many, but I think I read 4 somewhere. This means that it can store 16 scalars or 4 vectors between passes, making it much easier to split up a long, independent calculation (it would be quite hard to think of a shader that would need more values than this ALL the time). These values could be read back into the next shader program as texture inputs (of which there are plenty). Because we are moving towards a high-level shading language, the compiler can figure out where to put these breaks and do multipass automatically. R300 also has a 160 instruction limit, so each segment can be almost this long, thus amortizing the extra multipass bandwidth over this long instruction sequence, which couldn't realistically be bandwidth limited anyway due to its length. You may not even have to resend the geometry either, as you could likely store any remaining vertex shader output values in one of the render targets and do the multipass as a quad. If the pixel shader programs are so long though, it is unlikely that it will be geometry limited anyway. This idea came to me when I saw how they did raytracing on the 9700 at Siggraph, where they actually did a similar thing to what I'm suggesting. I don't think there will be many shaders with a more complicated final goal than that, yet each intermediate shader was rather simple. All that is missing is the implementation by the driver team in the form of a compiler. Any thoughts? |
|
|
|
|
|
#2 |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
Many short shaders in multipass is the worst possible thing you could do for performance when multipassing. If it is possible to do the same thing without multipass, it will be much faster (Unless you really tweak-out the design so that there is no stall from switching passes...I don't think that's going to be possible with all, or even many, multipass algorithms).
|
|
|
|
|
|
#3 |
|
Junior Member
Join Date: Jul 2002
Posts: 91
|
While this is true, the advantage that the NV30 will have is that for those programs > 160 (or 96) and less than 1024 (well, actually less, depending on number of constants used) all intermediate values will be available in registers rather than being written out to memory. So you get around the write and read penalties associated with this. Of course, as you pointed out the cost isn't quite that bad since you probably won't need to multipass full length shaders very often.
|
|
|
|
|
|
#4 |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
Actually, multipassing long shaders is less of a performance hit (percentage-wise) as it's already taking so long to execute the shader that there won't be much loss from the stall in switching to the next pass.
There may also be additional hits from multipassing if certain processing needs to be done during each pass (and would only be done once otherwise). I don't really know how common this sort of issue would be, though. |
|
|
|
|
|
#5 |
|
Regular
Join Date: Feb 2002
Location: California
Posts: 4,732
|
I think it depends on the geometry load. If the vertex shaders are long, and there is a large amount of geometry, then multipassing long pixel shaders results in multiple passes over a huge geometry database and evaluating expensive vertex shaders. It also eats up more AGP bus traffic, more frame buffer bandwidth, etc
|
|
|
|
|
|
#6 | |
|
Member
Join Date: May 2002
Location: Santa Clara
Posts: 584
|
Quote:
The penalties could start to add up if the splitting of the shader was done in an inefficient manner - you would generally want to break the shader into the largest program chunks possible to minimise the round trips to memory and keep the ratio of instructions to reads/writes high. |
|
|
|
|
|
|
#7 |
|
Senior Member
Join Date: Jul 2002
Location: UK
Posts: 1,758
|
If using some large number of ALU instructions and/or dependent texture lookups, then it's (on current hardware) proceeding at fractions of a pixel per clock anyway.
Of course, float intermediates are big and expensive in bandwidth - but if you're only rendering 1 pixel every 4+ clocks, and you've 512 bits of bandwidth per memory clock.... obviously there is plenty to go round. That also obviously gives plenty of time for the geometry and rasterisation to be handled, unless your geometry is down to tiny triangles (which the developer shouldn't allow to happen to avoid geometry aliasing). Also, complex pixel shaders tend to replace geometry (John Carmack again - the use of bump maps in Doom3 to replace geometry). |
|
|
|
|
|
#8 |
|
Junior Member
Join Date: Jul 2002
Posts: 91
|
Write and read penalties will probably be the least of your performance worries if you a running extremely long pixel shaders as execution time within the shader ALU will tend to dominate. Yep, absolutely. My point was that given equivalent ALU processing power, that would be one of the advantages of the NV30 over R300 (minor as it may be in the big picture). Of course, at this stage we're not sure about the processing power of NV30. With an 8x2 architecture and enough constants to possibly require fewer 'short' passes, it may have another advantage there. As someone pointed out in another thread, what we really need here is how many ops/cycle the GPU can do to determine performance advantages of one design over another. Another aspect of this that needs to be considered is what is the penalty for uploading a shader to the HW? Presumably they are all cached in VRAM, but it doesn't sound like they are fetched/dispatched out of VRAM. If that's the case and the programs are in some type of register space, what is the context switch latency, especially on long shaders? This is one area that the R300 may actually have an advantage in with it's shorter shaders. eg. NV30 will only have to upload the size of the shader but what if you had a 500 op shader and a 50 op shader that you switched between several times in a frame? Or perhaps developers will now have to sort their tri's by shaders to minimize state switch penalties. |
|
|
|
|
|
#9 | |
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Quote:
|
|
|
|
|
|
|
#10 | ||
|
Senior Member
Join Date: Jul 2002
Location: UK
Posts: 1,758
|
Quote:
It could be that R300 is faster than nv30. We just don't know! All we know is nv30 has support for longer programs, and we are attempting to discuss if it is an advantage to have support for longer programs. So far the conclusion appears to be that as long as you have support for a certain size, larger programs should be able to multipass for only a very limited performance cost. Quote:
|
||
|
|
|
|
|
#11 | |
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Quote:
Now, for subsequent passes you would only have to draw a bounding 2D box with a write mask stored in one of the render targets from a previous pass. Assuming you fit all of the previous data in the render targets, these are all you need to continue the pixel shader, without the additional overhead of geometry. I could see complications with MSAA, but the drivers should be able to work around it. |
|
|
|
|
|
|
#12 | |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
Quote:
I read not too long ago that the original GeForce had 600-800 stages, in one geometry pipeline and four pixel pipelines. That's probably about 100 stages in each pixel pipeline. Modern hardware probably has more. 160 instructions, though it probably wouldn't take 160 clocks, would still allow for a significant stall. Granted, you shouldn't have to stall at every pixel, as the drivers should be able to batch geometry and send a number through at once. I just don't see how it could be insignificant in relation to the pipeline depth. |
|
|
|
|
|
|
#13 |
|
Senior Member
Join Date: Jul 2002
Location: UK
Posts: 1,758
|
The hardware guys work pretty hard to make sure the pipeline stalls as infrequently as possible. There's no requirement that changing the pixel shader requires a complete pipeline flush. It all depends on the implementation.
If you assume you're staying in the same state for 10,000 clocks (that might be less than a hundred pixels with a 1000-instruction shader) then 100 clocks for a stall is clearly pretty insignificant. In reality, most shaders will render a lot more pixels than this (advanced shading on a 10x10 quad is pretty pointless!) so even if the shader has to multipass and take the full hit ten times (which, because of the effort put into reducing pipeline stalls, may not happen) it will still be similarly insignificant. In the future, as pixel shader throughput rises and pipelines get longer, stalls might become more important, but not in this generation. |
|
|
|
|
|
#14 |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
The only thing that you have to keep in mind is that as we go into the future, complex shaders will be used on smaller and smaller surfaces. That is, more pieces of the image will have their own unique shaders.
Regardless, the real meat will be in the benchmarks. I'd like to see some highly-complex shaders that have optimized code for both the R300 and NV30, and see how both do. Of particular interest would be auto-multipass generated shaders. It is very true that such benchmarks will not be valid for a few years, and thus will not hold a huge amount of validity at all (since it'll be very hard to predict how games will render in a few years), but may give us some idea of whether or not the increased shader size can really improve performance into the future (or, perhaps just for high-end development, if nVidia wants to market the NV3x in the truly high-end). |
|
|
|
|
|
#15 | ||
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
Quote:
Remember the bigger picture - so long as we are concerned so much about performance, we are talking about real-time graphics. Such huge pixel shader lengths (>300) could not be used on the whole scene even at 640x480, so even a 10% decrease in performance for these sections of the screen would not contribute a whole lot to the overall framerate anyway. |
||
|
|
|
|
|
#16 |
|
Monochrome wench
|
While I'm thinking that really huge pixel shader sizes aren't that important since multipass can be used for them, vertex shaders could be different, you can't multipass a vertex shader. While it's possible to fallback on software to process a huge vertex shader, in quite possibly all cases, the hardware vertex shader is always going to be faster than software. And quite a bit faster too, especially if the CPU needs to be doing something else at the same time.
|
|
|
|
|
|
#17 |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
Not all gaming situations will allow for nice easy multipass.
One example might be a future game that in addition to using highly-complex shaders, also has hordes of enemies (similar to Serious Sam). Even if the hardware could effectively do an entire opponent before switching passes, it would have to do this hundreds of times in some scenes, which could easily slow down performance a significant amount. But, I'd still like to see some benchmarks meant to stress this particular difference in the two architectures. Hopefully we'll have one by the end of the year. |
|
|
|
|
|
#18 |
|
Monochrome wench
|
Yep, the real hard case would be shaders on triangles that are meant to be transparent. It's always hard to multipass transparency
|
|
|
|
|
|
#19 |
|
Join Date: May 2002
Location: New York, NY
Posts: 12,678
|
I'd imagine that you'd render the first pass to an off-screen buffer for transparent triangles that need multipass.
Of course, that's far from efficient as you'd pretty much need a full-screen buffer for each triangle used in this way (until it is merged during the final pass with the primary framebuffer). |
|
|
|
|
|
#20 |
|
Monochrome wench
|
Offscreen buffer isnt such a bad idea. It would allow you to do some form of simulated refraction if you really wanted to. If everything was setup just right in your engine, you could use the 'back buffer' as a texture, and then play with texture coords to create an effect that would look a bit like refraction, but would not be completely accurate.
|
|
|
|
|
|
#21 |
|
Junior Member
|
anything of over 100 instructions probably won't go so fast even at low resolutions, and so we're not talking about games but offline rendering for say, 3DStudio Max. In this case the perhaps 5% loss from multipassing isn't a big deal.
By the time the hardware is fast enough to run realtime games at high resolutions with 100+ shader instructions the hardware will support much more than 100 or 160 or whatever. |
|
|
|
|
|
#22 |
|
Junior Member
Join Date: Feb 2002
Posts: 87
|
Multipass vs. Single-pass:
Single pass is most preferred because it typically allows you to maintain the precision without any loss. But, he longer it is, the slower it will be. Multi-pass, with DX9, can help alleviate the precision issues between passes. But, since you do alot of internal setup multiple times, you are slowed that way. So, single pass is quicker, just how quicker depends on situation. |
|
|
|
|
|
#23 | |
|
Regular
Join Date: Feb 2002
Location: California
Posts: 4,732
|
Quote:
This seems overly complicated and I still think it will run slower (because of all the state change/setup) than just using a single pass with long shader. |
|
|
|
|
|
|
#24 |
|
Senior Member
Join Date: Jul 2002
Location: UK
Posts: 1,758
|
It cannot even be said that on the same piece of hardware multipass is always slower than single pass if the shader program is sufficiently complex.
I have previously performed some research that indicates that there is no requirement that single pass is faster than multipass. Indeed, depending on obscure factors (the implementation, the FIFO sizes in the hardware, etc.) multipass can be significantly faster if pathological conditions are met. I do not know if these pathological conditions will be common. The proof of the pudding will be in the eating. In the same way that knowing that architecture X is specced at '2.4 gigatexels/s' and architecture Y can do '2.0 gigatexels/s' (in the theoretical numbers) doesn't mean that X will run Quake 3 faster than Y. Let's wait for the benchmarks. |
|
|
|
|
|
#25 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,019
|
Quote:
|
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Nvidia's unified compiler technology | rwolf | 3D Architectures & Chips | 103 | 04-Nov-2003 14:26 |
| Discussion of general purpose processor architecture cont. | Gubbi | Hardware & Software Talk | 11 | 19-Jun-2003 15:58 |
| How does the NV30 really store PS programs? | Arun | 3D Architectures & Chips | 19 | 20-Feb-2003 13:54 |
| GF4 has inflated 3dmarks scores so says the INQ..... | jb | 3D Architectures & Chips | 126 | 19-Jun-2002 23:35 |
| nVIDIA Cg Compiler & Language Embraced By Industry | Dave Baumann | Press Releases | 0 | 14-Jun-2002 21:27 |