Are high instruction limits really needed for pixel shaders?

anything of over 100 instructions probably won't go so fast even at low resolutions, and so we're not talking about games but offline rendering for say, 3DStudio Max. In this case the perhaps 5% loss from multipassing isn't a big deal.

By the time the hardware is fast enough to run realtime games at high resolutions with 100+ shader instructions the hardware will support much more than 100 or 160 or whatever.
 
Multipass vs. Single-pass:

Single pass is most preferred because it typically allows you to maintain the precision without any loss. But, he longer it is, the slower it will be.

Multi-pass, with DX9, can help alleviate the precision issues between passes. But, since you do alot of internal setup multiple times, you are slowed that way.

So, single pass is quicker, just how quicker depends on situation.
 
Mintmaster said:
Now, for subsequent passes you would only have to draw a bounding 2D box with a write mask stored in one of the render targets from a previous pass. Assuming you fit all of the previous data in the render targets, these are all you need to continue the pixel shader, without the additional overhead of geometry. I could see complications with MSAA, but the drivers should be able to work around it.

You still need the screen z values to do perspective correct texture sampling and shadings. When I render with these hypothetical 2D bounding boxes, where do I get the Z from? Turn Z-buffer into texture and sample it?

This seems overly complicated and I still think it will run slower (because of all the state change/setup) than just using a single pass with long shader.
 
It cannot even be said that on the same piece of hardware multipass is always slower than single pass if the shader program is sufficiently complex.

I have previously performed some research that indicates that there is no requirement that single pass is faster than multipass. Indeed, depending on obscure factors (the implementation, the FIFO sizes in the hardware, etc.) multipass can be significantly faster if pathological conditions are met. I do not know if these pathological conditions will be common.

The proof of the pudding will be in the eating. In the same way that knowing that architecture X is specced at '2.4 gigatexels/s' and architecture Y can do '2.0 gigatexels/s' (in the theoretical numbers) doesn't mean that X will run Quake 3 faster than Y. Let's wait for the benchmarks.
 
This idea came to me when I saw how they did raytracing on the 9700 at Siggraph, where they actually did a similar thing to what I'm suggesting. I don't think there will be many shaders with a more complicated final goal than that, yet each intermediate shader was rather simple. All that is missing is the implementation by the driver team in the form of a compiler.

During the Siggraph demo they ran Quake 3 on the 9700 with raytraced shadows. It didn't look like all of the textures were on the models and the only things raytraced were the shadows. It looked to be running at around 10 fps and the speaker said there were about 500 passes to do the ray tracing. So it is definitely possible to do a lot of passes, but the performance is obviously affected. Still it didn't look to bad.
 
During the Siggraph demo they ran Quake 3 on the 9700 with raytraced shadows. It didn't look like all of the textures were on the models and the only things raytraced were the shadows. It looked to be running at around 10 fps and the speaker said there were about 500 passes to do the ray tracing. So it is definitely possible to do a lot of passes, but the performance is obviously affected. Still it didn't look to bad.

This is interesting.. At 500 passes the instruction count is *off the hook* yet it still managed 10 FPS... that means that (major generaization) The 9700 could run instruction counts as high as the Nv30's touted 1024 with more than playable frame rates... if fact 10 Passes would be a joke compared to 500.. We could be talking some serious FPS.

At least thats my logical conclusion.
 
Colourless said:
Yep, the real hard case would be shaders on triangles that are meant to be transparent. It's always hard to multipass transparency
Funnily enough, it was actually quite easy to do this on Dreamcast because it had two mini buffers on chip. Of course, it didn't have a massive amount of fillrate, so you wouldn't want to do too many passes on too many objects.
 
Multipass transparency isn't hard. You just use offscreen buffers (plural, now, it seems) the same size as the screen to store intermediates, and then they are texture sources to the next pass.
 
Well I just thought of a benefit of supporting very long shaders for real-time gaming.

Here what I'm assuming: n pixel shader pipelines, operating
in lockstep (at any given time each pipe is running the same
instruction of the same pixel shader). 1024 max instruction count,
with a single i-cache large enough to hold them all.

Say you have an app with 64 instruction shaders on average.
16 of these could be held in the cache. Furthermore, precomputed
values for pixel shader "temp storage" and perhaps even texturing state could be placed in the icache.

So here's my hypothetical scenario. The vertex shader recieves some
geometry using vs A and ps A. It starts crunching it. Next in the command sequence (presumably buffered to some extent) is a request to render some other geometry with vs B and ps B.

The pixel shader i-cache is filled with ps B and associated setup information while ps A is still running, leading to much lower shader-switch overhead.

comments?

Regards,
Serge
 
psurge:
Could well work, possibly managed in the driver if the hardware supported specifying a non 0 execution address.

In fact something similar to this is possible using vertex shaders on Xbox. You can literally fill the memory with multiple shaders and not pay the normally large cost associated with shader loads.
 
ERP, thanks for the info - since you seem to know about this stuff- could describe how setup for the constant registers is handled in VSes? Isn't there a separate shader you run to set these-up?

Also it seems that switching vertex programs would be much faster than switching pixel shaders since you don't have to manage setup for texturing engines, interpolators, (texture cache flushes?).

on a somewhat related note, i think it would be beneficial to allow arbitrary assignment of pixels to pipes (so long as each pixel needs the same shading program executed for it), even if pixels are from multiple triangles.
 
ERP, thanks for the info - since you seem to know about this stuff- could describe how setup for the constant registers is handled in VSes? Isn't there a separate shader you run to set these-up?

I only really know the specifics of the NV2A.
The values are just shoved into the push buffer following a command that tells the chip how many there are and where to put them.
It is possible on NV2A to run a special vertex shader program that can write to the constant registers, because of the nature of the registers and the way the instruction stream is executed this sort of program will be slower and can be a LOT slower than a standard vertex program. There is no way to access this functionality in DX8, so I assume that this sort of functionality is not common on DX8 class hardware.

Also it seems that switching vertex programs would be much faster than switching pixel shaders since you don't have to manage setup for texturing engines, interpolators, (texture cache flushes?).

A lot of people seem to believe that state changes are expensive, and this used to be true, However most modern hardware is must less state sensitive on NV2A at least very few state changes cause significant pipeline stalls. Literally removing all texture and pixel shader related state changes in our app made no measurable performance difference. And we only partially sort on material.
On a DX8 part my guess would be that Vertex shaders would be more expensive to change than pixel shaders, but this is simple because of their size and the necessity of loading a number of constants (matrices etc) before they can be used.


on a somewhat related note, i think it would be beneficial to allow arbitrary assignment of pixels to pipes (so long as each pixel needs the same shading program executed for it), even if pixels are from multiple triangles

While I agree (although just allowing inside a tri would probably be enough) and it would alleviate the poor fillrate exhibited on small triangles, I imagine it would require a significant amount of logic to implement.
 
ERP said:
psurge:
Could well work, possibly managed in the driver if the hardware supported specifying a non 0 execution address.

In fact something similar to this is possible using vertex shaders on Xbox. You can literally fill the memory with multiple shaders and not pay the normally large cost associated with shader loads.

You don't think that the NV2* architecture loads the programs into registers on the chip? The programs are only 128 instructions long at most, and you could store one instruction per dword. That a total of a measly 512 bytes. They're executed on every single vertex, so what benefit would they have for keeping the program in memory rather than on chip? You'd greatly increase the memory load doing that, and the only way you'd get an advantage is if you drew thousands of objects that had only few vertices and each had their own shader, which I bet would never happen in a real game.


PSurge, isn't your argument supporting small shader program sizes? Its quite possible that ATI has some on chip memory to store more than one PS, but they just had some architectural (or maybe just practical) reason for limiting the pixel shader execution size. Even so, I don't think it would take that long for a chip to change shaders. You wouldn't need an entire pipeline flush if it was designed properly.
 
You don't think that the NV2* architecture loads the programs into registers on the chip? The programs are only 128 instructions long at most, and you could store one instruction per dword. That a total of a measly 512 bytes. They're executed on every single vertex, so what benefit would they have for keeping the program in memory rather than on chip? You'd greatly increase the memory load doing that, and the only way you'd get an advantage is if you drew thousands of objects that had only few vertices and each had their own shader, which I bet would never happen in a real game.

To clarify the memory is on chip.
But more often than not vertex shaders are nearer 20 or 30 instructions than 128, so you can load a number of shaders into the available 128 instruction slots, then you can swap vertex programs at no cost.
If the shader isn't loaded then normaly the code for the shader in placed directly into the pushbuffer, and it's enough data that it can cause the front end of the vertex pipeline to starve.

In our app for example we load shaders only twice for the whole scene a total of I somewhere between 20 and 30 shaders, as opposed to having to load the shaders for each material change which occur 200-400 times a frame. FWIW it's a measurable performance difference (around 5%).
 
Small note: VS shader instructions are most likely 64 bit, not 32 bit.

One thing I've seen in nvidias optimization documents is that they say that switching between a few small shaders is fast, switching between many and/or long shaders is slow. That seems to imply that it's not just NV2A that can put more than one shader in internal memory as long as they are small.
 
This could imply that having space for 1024 ops has a second advantage other than accomodating very complex shaders in a single pass. Technically, they should be able to store considerably more shaders in the same program space and switch between them with very low latency. It's unclear what the exact program space on the R300 is. Is the 64 color/64 alpha ops what is in hardware, or only what is exposed due to DX9? It's certainly possible that they support 256 or something internally, but I would have thought that they would tout it in their marketechture documents if that's the case.
 
Mintmaster,

Err not sure if I understand what you meant, but...we are talking about loading programs into on chip caches (like the i-cache in Athlons/P4s). Loading into a register file doesn't make sense to me.

I'm not arguing that support for long shaders is necessary. I'm just commenting that when you do support such long shaders, you may get benefits for the relatively short ones likely to be used in the near future
(since everyone has been saying that 1024 instruction slots is massive overkill for real-time applications). It wasn't meant as an argument for or against NV/ATI.

Regarding your other points - yes ATI may have space to hold more than 1 long shader on-chip at a time, I don't know, I haven't seen any r300 design docs. With space for 160, I think they will be able to load several "game" pixel programs at a time. Furthermore, if you can effectively load a pixel program while the current one is running, you may well be able to hide most of the latency of loading a pixel program with space for just 2 of them (assuming you have the spare bandwidth to get the program from video memory without stalling anything else).
 
I made some educated guesses about the NV30 instrution set, and did a more accurate calculation for the instruction size, and noticed that it didn't even fit in 64 bit. I managed to get it down to 66-67 bits, and that was realy a squeeze. Maybe they've done some limitations to the instruction set (deRISCifications like not allowing all registers to be used as any argument), but otherwise they've probably crossed the 64 bit border.

(And I haven't even counted the pack/unpack instructions, since I have no idea of the syntax.)

Btw, if someone wonder why why I think this is interesting:
This is an occupational injury I've got. We've done several application specific RISC processors, and I've been involved in the instruction sets. So now I'll go into "instruction set analyzing" mode whenever I see one. :D
 
Back
Top