Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 30-Jul-2002, 03:16   #1
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,779
Default Are high instruction limits really needed for pixel shaders?

I was just thinking about a few things regarding shaders. Now NVidia has continually said that a PS 1.4 shader can ALWAYS be done with PS 1.1 using multiple passes. Now, it isn't always easy to break up a longer program into several smaller ones through multipass (storing intermediate values in the colour buffer) but I can see how it can be done pretty much all the time. PS 1.1 also has other limitations that could create a longer overall shader, and the bandwidth requirements can go up quite a bit due to the extra colour buffer reads and writes between passes.

Now lets apply this to R300. It supports multiple render targets. I'm not sure how many, but I think I read 4 somewhere. This means that it can store 16 scalars or 4 vectors between passes, making it much easier to split up a long, independent calculation (it would be quite hard to think of a shader that would need more values than this ALL the time). These values could be read back into the next shader program as texture inputs (of which there are plenty). Because we are moving towards a high-level shading language, the compiler can figure out where to put these breaks and do multipass automatically. R300 also has a 160 instruction limit, so each segment can be almost this long, thus amortizing the extra multipass bandwidth over this long instruction sequence, which couldn't realistically be bandwidth limited anyway due to its length.

You may not even have to resend the geometry either, as you could likely store any remaining vertex shader output values in one of the render targets and do the multipass as a quad. If the pixel shader programs are so long though, it is unlikely that it will be geometry limited anyway.

This idea came to me when I saw how they did raytracing on the 9700 at Siggraph, where they actually did a similar thing to what I'm suggesting. I don't think there will be many shaders with a more complicated final goal than that, yet each intermediate shader was rather simple. All that is missing is the implementation by the driver team in the form of a compiler.

Any thoughts?
Mintmaster is offline   Reply With Quote
Old 30-Jul-2002, 04:47   #2
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Many short shaders in multipass is the worst possible thing you could do for performance when multipassing. If it is possible to do the same thing without multipass, it will be much faster (Unless you really tweak-out the design so that there is no stall from switching passes...I don't think that's going to be possible with all, or even many, multipass algorithms).
Chalnoth is offline   Reply With Quote
Old 30-Jul-2002, 04:51   #3
CMKRNL
Junior Member
 
Join Date: Jul 2002
Posts: 91
Default Re: Are high instruction limits really needed for pixel shad

While this is true, the advantage that the NV30 will have is that for those programs > 160 (or 96) and less than 1024 (well, actually less, depending on number of constants used) all intermediate values will be available in registers rather than being written out to memory. So you get around the write and read penalties associated with this. Of course, as you pointed out the cost isn't quite that bad since you probably won't need to multipass full length shaders very often.
CMKRNL is offline   Reply With Quote
Old 30-Jul-2002, 04:57   #4
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Actually, multipassing long shaders is less of a performance hit (percentage-wise) as it's already taking so long to execute the shader that there won't be much loss from the stall in switching to the next pass.

There may also be additional hits from multipassing if certain processing needs to be done during each pass (and would only be done once otherwise). I don't really know how common this sort of issue would be, though.
Chalnoth is offline   Reply With Quote
Old 30-Jul-2002, 07:11   #5
DemoCoder
Regular
 
Join Date: Feb 2002
Location: California
Posts: 4,732
Default

I think it depends on the geometry load. If the vertex shaders are long, and there is a large amount of geometry, then multipassing long pixel shaders results in multiple passes over a huge geometry database and evaluating expensive vertex shaders. It also eats up more AGP bus traffic, more frame buffer bandwidth, etc
DemoCoder is offline   Reply With Quote
Old 30-Jul-2002, 07:47   #6
andypski
Member
 
Join Date: May 2002
Location: Santa Clara
Posts: 584
Default Re: Are high instruction limits really needed for pixel shad

Quote:
Originally Posted by CMKRNL
While this is true, the advantage that the NV30 will have is that for those programs > 160 (or 96) and less than 1024 (well, actually less, depending on number of constants used) all intermediate values will be available in registers rather than being written out to memory. So you get around the write and read penalties associated with this. Of course, as you pointed out the cost isn't quite that bad since you probably won't need to multipass full length shaders very often.
Write and read penalties will probably be the least of your performance worries if you a running extremely long pixel shaders as execution time within the shader ALU will tend to dominate.

The penalties could start to add up if the splitting of the shader was done in an inefficient manner - you would generally want to break the shader into the largest program chunks possible to minimise the round trips to memory and keep the ratio of instructions to reads/writes high.
andypski is offline   Reply With Quote
Old 30-Jul-2002, 13:25   #7
Dio
Senior Member
 
Join Date: Jul 2002
Location: UK
Posts: 1,758
Default

If using some large number of ALU instructions and/or dependent texture lookups, then it's (on current hardware) proceeding at fractions of a pixel per clock anyway.

Of course, float intermediates are big and expensive in bandwidth - but if you're only rendering 1 pixel every 4+ clocks, and you've 512 bits of bandwidth per memory clock.... obviously there is plenty to go round.

That also obviously gives plenty of time for the geometry and rasterisation to be handled, unless your geometry is down to tiny triangles (which the developer shouldn't allow to happen to avoid geometry aliasing). Also, complex pixel shaders tend to replace geometry (John Carmack again - the use of bump maps in Doom3 to replace geometry).
Dio is offline   Reply With Quote
Old 30-Jul-2002, 14:01   #8
CMKRNL
Junior Member
 
Join Date: Jul 2002
Posts: 91
Default Re: Are high instruction limits really needed for pixel shad


Write and read penalties will probably be the least of your performance worries if you a running extremely long pixel shaders as execution time within the shader ALU will tend to dominate.


Yep, absolutely. My point was that given equivalent ALU processing power, that would be one of the advantages of the NV30 over R300 (minor as it may be in the big picture). Of course, at this stage we're not sure about the processing power of NV30. With an 8x2 architecture and enough constants to possibly require fewer 'short' passes, it may have another advantage there. As someone pointed out in another thread, what we really need here is how many ops/cycle the GPU can do to determine performance advantages of one design over another.

Another aspect of this that needs to be considered is what is the penalty for uploading a shader to the HW? Presumably they are all cached in VRAM, but it doesn't sound like they are fetched/dispatched out of VRAM. If that's the case and the programs are in some type of register space, what is the context switch latency, especially on long shaders? This is one area that the R300 may actually have an advantage in with it's shorter shaders. eg. NV30 will only have to upload the size of the shader but what if you had a 500 op shader and a 50 op shader that you switched between several times in a frame? Or perhaps developers will now have to sort their tri's by shaders to minimize state switch penalties.
CMKRNL is offline   Reply With Quote
Old 30-Jul-2002, 14:25   #9
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,779
Default

Quote:
Originally Posted by Chalnoth
Actually, multipassing long shaders is less of a performance hit (percentage-wise) as it's already taking so long to execute the shader that there won't be much loss from the stall in switching to the next pass.

There may also be additional hits from multipassing if certain processing needs to be done during each pass (and would only be done once otherwise). I don't really know how common this sort of issue would be, though.
This is what I was saying. At 160 instructions, the execution time would be long enough that the multipass overhead would be next to insignificant - maybe only a few percent of execution time. I assume when you started this post with "Actually", you are reconsidering your first statement.
Mintmaster is offline   Reply With Quote
Old 30-Jul-2002, 14:33   #10
Dio
Senior Member
 
Join Date: Jul 2002
Location: UK
Posts: 1,758
Default Re: Are high instruction limits really needed for pixel shad

Quote:
Originally Posted by CMKRNL
Yep, absolutely. My point was that given equivalent ALU processing power, that would be one of the advantages of the NV30 over R300 (minor as it may be in the big picture). Of course, at this stage we're not sure about the processing power of NV30.
As you say, there will be few advantages one way or the other except the performance of the pixel shader unit and texture lookup unit. But going on to assume nv30 is faster - well, that's not much to do with 'are high instruction limits really needed for pixel shaders' - is it relevant to this thread?

It could be that R300 is faster than nv30. We just don't know! All we know is nv30 has support for longer programs, and we are attempting to discuss if it is an advantage to have support for longer programs.

So far the conclusion appears to be that as long as you have support for a certain size, larger programs should be able to multipass for only a very limited performance cost.


Quote:
Originally Posted by CMKRNL
Or perhaps developers will now have to sort their tri's by shaders to minimize state switch penalties.
Undoubtedly, they will have to do this, wherever the shaders are stored or how big your instruction count is. Texture cache coherency is a reason many engines (e.g. Quake3) already sort by shader.
Dio is offline   Reply With Quote
Old 30-Jul-2002, 14:48   #11
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,779
Default

Quote:
Originally Posted by DemoCoder
I think it depends on the geometry load. If the vertex shaders are long, and there is a large amount of geometry, then multipassing long pixel shaders results in multiple passes over a huge geometry database and evaluating expensive vertex shaders. It also eats up more AGP bus traffic, more frame buffer bandwidth, etc
Because you can store quite a few values inbetween, it may be possible that resending the geometry isn't necessary. Think of what the pixel shader actually needs from the vertex shader. Things like texture coordinates, texture samples, diffuse/specular colours, and I guess anything else that is new to DX9 (maybe extra variables for vertex to pixel shader communication?). If you can use as many of those values as needed in the first pass, you could store the remaining values in the render target.

Now, for subsequent passes you would only have to draw a bounding 2D box with a write mask stored in one of the render targets from a previous pass. Assuming you fit all of the previous data in the render targets, these are all you need to continue the pixel shader, without the additional overhead of geometry. I could see complications with MSAA, but the drivers should be able to work around it.
Mintmaster is offline   Reply With Quote
Old 30-Jul-2002, 14:49   #12
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Quote:
Originally Posted by Mintmaster
This is what I was saying. At 160 instructions, the execution time would be long enough that the multipass overhead would be next to insignificant - maybe only a few percent of execution time. I assume when you started this post with "Actually", you are reconsidering your first statement.
Do you know how long it takes to execute a 160-instruction shader? Or how long a pipline stall from a state change is?

I read not too long ago that the original GeForce had 600-800 stages, in one geometry pipeline and four pixel pipelines. That's probably about 100 stages in each pixel pipeline. Modern hardware probably has more. 160 instructions, though it probably wouldn't take 160 clocks, would still allow for a significant stall.

Granted, you shouldn't have to stall at every pixel, as the drivers should be able to batch geometry and send a number through at once. I just don't see how it could be insignificant in relation to the pipeline depth.
Chalnoth is offline   Reply With Quote
Old 30-Jul-2002, 16:11   #13
Dio
Senior Member
 
Join Date: Jul 2002
Location: UK
Posts: 1,758
Default

The hardware guys work pretty hard to make sure the pipeline stalls as infrequently as possible. There's no requirement that changing the pixel shader requires a complete pipeline flush. It all depends on the implementation.

If you assume you're staying in the same state for 10,000 clocks (that might be less than a hundred pixels with a 1000-instruction shader) then 100 clocks for a stall is clearly pretty insignificant.

In reality, most shaders will render a lot more pixels than this (advanced shading on a 10x10 quad is pretty pointless!) so even if the shader has to multipass and take the full hit ten times (which, because of the effort put into reducing pipeline stalls, may not happen) it will still be similarly insignificant.

In the future, as pixel shader throughput rises and pipelines get longer, stalls might become more important, but not in this generation.
Dio is offline   Reply With Quote
Old 30-Jul-2002, 16:20   #14
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

The only thing that you have to keep in mind is that as we go into the future, complex shaders will be used on smaller and smaller surfaces. That is, more pieces of the image will have their own unique shaders.

Regardless, the real meat will be in the benchmarks. I'd like to see some highly-complex shaders that have optimized code for both the R300 and NV30, and see how both do. Of particular interest would be auto-multipass generated shaders.

It is very true that such benchmarks will not be valid for a few years, and thus will not hold a huge amount of validity at all (since it'll be very hard to predict how games will render in a few years), but may give us some idea of whether or not the increased shader size can really improve performance into the future (or, perhaps just for high-end development, if nVidia wants to market the NV3x in the truly high-end).
Chalnoth is offline   Reply With Quote
Old 30-Jul-2002, 16:38   #15
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,779
Default

Quote:
Originally Posted by Chalnoth
Quote:
Originally Posted by Mintmaster
This is what I was saying. At 160 instructions, the execution time would be long enough that the multipass overhead would be next to insignificant - maybe only a few percent of execution time. I assume when you started this post with "Actually", you are reconsidering your first statement.
Do you know how long it takes to execute a 160-instruction shader? Or how long a pipline stall from a state change is?

I read not too long ago that the original GeForce had 600-800 stages, in one geometry pipeline and four pixel pipelines. That's probably about 100 stages in each pixel pipeline. Modern hardware probably has more. 160 instructions, though it probably wouldn't take 160 clocks, would still allow for a significant stall.

Granted, you shouldn't have to stall at every pixel, as the drivers should be able to batch geometry and send a number through at once. I just don't see how it could be insignificant in relation to the pipeline depth.
In that ATI education flash animation, they said the pipeline can execute a texture lookup, a texture address op, and a math op per clock. Suppose that the instruction sequence took 50 clocks to complete per pixel pipe. A measy 10000 pixel object (100x100, only 1/70 of a 1024x768 screen) would then take 10000/8*50=62500 clocks to complete. If it takes 1500 clocks to set up between passes, that's only about 2.5% additional overhead.

Remember the bigger picture - so long as we are concerned so much about performance, we are talking about real-time graphics. Such huge pixel shader lengths (>300) could not be used on the whole scene even at 640x480, so even a 10% decrease in performance for these sections of the screen would not contribute a whole lot to the overall framerate anyway.
Mintmaster is offline   Reply With Quote
Old 30-Jul-2002, 16:41   #16
Colourless
Monochrome wench
 
Join Date: Feb 2002
Location: Somewhere in outback South Australia
Posts: 1,254
Send a message via ICQ to Colourless Send a message via MSN to Colourless
Default

While I'm thinking that really huge pixel shader sizes aren't that important since multipass can be used for them, vertex shaders could be different, you can't multipass a vertex shader. While it's possible to fallback on software to process a huge vertex shader, in quite possibly all cases, the hardware vertex shader is always going to be faster than software. And quite a bit faster too, especially if the CPU needs to be doing something else at the same time.
Colourless is offline   Reply With Quote
Old 30-Jul-2002, 16:53   #17
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Not all gaming situations will allow for nice easy multipass.

One example might be a future game that in addition to using highly-complex shaders, also has hordes of enemies (similar to Serious Sam). Even if the hardware could effectively do an entire opponent before switching passes, it would have to do this hundreds of times in some scenes, which could easily slow down performance a significant amount.

But, I'd still like to see some benchmarks meant to stress this particular difference in the two architectures. Hopefully we'll have one by the end of the year.
Chalnoth is offline   Reply With Quote
Old 30-Jul-2002, 17:26   #18
Colourless
Monochrome wench
 
Join Date: Feb 2002
Location: Somewhere in outback South Australia
Posts: 1,254
Send a message via ICQ to Colourless Send a message via MSN to Colourless
Default

Yep, the real hard case would be shaders on triangles that are meant to be transparent. It's always hard to multipass transparency
Colourless is offline   Reply With Quote
Old 30-Jul-2002, 17:38   #19
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

I'd imagine that you'd render the first pass to an off-screen buffer for transparent triangles that need multipass.

Of course, that's far from efficient as you'd pretty much need a full-screen buffer for each triangle used in this way (until it is merged during the final pass with the primary framebuffer).
Chalnoth is offline   Reply With Quote
Old 30-Jul-2002, 18:21   #20
Colourless
Monochrome wench
 
Join Date: Feb 2002
Location: Somewhere in outback South Australia
Posts: 1,254
Send a message via ICQ to Colourless Send a message via MSN to Colourless
Default

Offscreen buffer isnt such a bad idea. It would allow you to do some form of simulated refraction if you really wanted to. If everything was setup just right in your engine, you could use the 'back buffer' as a texture, and then play with texture coords to create an effect that would look a bit like refraction, but would not be completely accurate.
Colourless is offline   Reply With Quote
Old 30-Jul-2002, 20:05   #21
Scott C
Junior Member
 
Join Date: Jun 2002
Posts: 43
Send a message via AIM to Scott C
Default

anything of over 100 instructions probably won't go so fast even at low resolutions, and so we're not talking about games but offline rendering for say, 3DStudio Max. In this case the perhaps 5% loss from multipassing isn't a big deal.

By the time the hardware is fast enough to run realtime games at high resolutions with 100+ shader instructions the hardware will support much more than 100 or 160 or whatever.
Scott C is offline   Reply With Quote
Old 30-Jul-2002, 20:08   #22
nooneyouknow
Junior Member
 
Join Date: Feb 2002
Posts: 87
Default

Multipass vs. Single-pass:

Single pass is most preferred because it typically allows you to maintain the precision without any loss. But, he longer it is, the slower it will be.

Multi-pass, with DX9, can help alleviate the precision issues between passes. But, since you do alot of internal setup multiple times, you are slowed that way.

So, single pass is quicker, just how quicker depends on situation.
nooneyouknow is offline   Reply With Quote
Old 30-Jul-2002, 20:50   #23
DemoCoder
Regular
 
Join Date: Feb 2002
Location: California
Posts: 4,732
Default

Quote:
Originally Posted by Mintmaster
Now, for subsequent passes you would only have to draw a bounding 2D box with a write mask stored in one of the render targets from a previous pass. Assuming you fit all of the previous data in the render targets, these are all you need to continue the pixel shader, without the additional overhead of geometry. I could see complications with MSAA, but the drivers should be able to work around it.
You still need the screen z values to do perspective correct texture sampling and shadings. When I render with these hypothetical 2D bounding boxes, where do I get the Z from? Turn Z-buffer into texture and sample it?

This seems overly complicated and I still think it will run slower (because of all the state change/setup) than just using a single pass with long shader.
DemoCoder is offline   Reply With Quote
Old 30-Jul-2002, 22:58   #24
Dio
Senior Member
 
Join Date: Jul 2002
Location: UK
Posts: 1,758
Default

It cannot even be said that on the same piece of hardware multipass is always slower than single pass if the shader program is sufficiently complex.

I have previously performed some research that indicates that there is no requirement that single pass is faster than multipass. Indeed, depending on obscure factors (the implementation, the FIFO sizes in the hardware, etc.) multipass can be significantly faster if pathological conditions are met. I do not know if these pathological conditions will be common.

The proof of the pudding will be in the eating. In the same way that knowing that architecture X is specced at '2.4 gigatexels/s' and architecture Y can do '2.0 gigatexels/s' (in the theoretical numbers) doesn't mean that X will run Quake 3 faster than Y. Let's wait for the benchmarks.
Dio is offline   Reply With Quote
Old 31-Jul-2002, 04:07   #25
3dcgi
Senior Member
 
Join Date: Feb 2002
Posts: 2,019
Default

Quote:
This idea came to me when I saw how they did raytracing on the 9700 at Siggraph, where they actually did a similar thing to what I'm suggesting. I don't think there will be many shaders with a more complicated final goal than that, yet each intermediate shader was rather simple. All that is missing is the implementation by the driver team in the form of a compiler.
During the Siggraph demo they ran Quake 3 on the 9700 with raytraced shadows. It didn't look like all of the textures were on the models and the only things raytraced were the shadows. It looked to be running at around 10 fps and the speaker said there were about 500 passes to do the ray tracing. So it is definitely possible to do a lot of passes, but the performance is obviously affected. Still it didn't look to bad.
3dcgi is online now   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Nvidia's unified compiler technology rwolf 3D Architectures & Chips 103 04-Nov-2003 14:26
Discussion of general purpose processor architecture cont. Gubbi Hardware & Software Talk 11 19-Jun-2003 15:58
How does the NV30 really store PS programs? Arun 3D Architectures & Chips 19 20-Feb-2003 13:54
GF4 has inflated 3dmarks scores so says the INQ..... jb 3D Architectures & Chips 126 19-Jun-2002 23:35
nVIDIA Cg Compiler & Language Embraced By Industry Dave Baumann Press Releases 0 14-Jun-2002 21:27


All times are GMT +1. The time now is 20:45.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.