The DX9 situation

Kristof · Aug 27, 2002

Chalnoth said:
But if you're just talking about which parts of the 4D vector to operate on, wouldn't that only take 4 bits, not 8?

No, but for a simple writemask (on/off) yes. Swizzling allows you to select x,y,z or w for each of the 4 positions. So we have 4 possible cases (x,y,z,w) which requires 2 bits to encode, and there are 4 positions, so 4x2=8. Lots of possible combinations can be used when swizzling like "xxwy", "xxxx", "yxyx", "wzyx", etc etc... which is why the bit cost is so high.

arjan de lumens good point but even with tht its still quite a bit of bandwidth for devices that are already bandwidth limited. I am not saying its impossible though its just that you move closer and closer to a CPU and with that come the increased inefficiencies of it.

K-

KimB · Aug 27, 2002

Hrm, just realized something else!

If the instruction size is indeed 128-bit, then the vertex programs on the NV30 won't fit into the caches anyway.

Namely, 256 128-bit instructions make for 4MB. I seriously doubt the NV30 will have 4MB of caches.

Additionally, 1024 128-bit instructions (for the PS...but perhaps these won't be full 128-bit? I dunno) would be 16MB.

So, I guess my thought earlier that inifinite-length would be unmanagemable just got shot out the window. Perhaps the NV30's limited sizes just make for more effient caching? Regardless, since it now seems obvious that the instructions are already spilling over into video memory, I see no problem in simply unlocking the limitation on the size of the shaders.

This also begs the question as to what kind of enhancements the R300 and NV30 have in order to handle shaders that spill over into main graphics memory. I would be very interested to see how both processors handle large shaders, and it will be even more interesting to see a comparison between hardware that can execute data-dependent branches.

pascal · Aug 27, 2002

Do you mean 4Kbytes, right? :-?

Xmas · Aug 27, 2002

that was a good one

Basic · Aug 27, 2002

Quick look at VS.2.0

There's one large instruction with 3 sources:
mad DST, SRC1, SRC2, SRC3
It uses most of the instruction space.

==== DST ====

dest registers: r# / oPos / oPts / oFog / oT# / oD# =>5 bit
dest mask: 4 bit

==== SRCn ====

I'm not sure if all registers can be read in all instructions (integer regs). But if we assume that v#, r#, c#, c[base+A.?] can be used, then that's 16+16+1+4=37 "regs" => 6 bit. (Yes, I counted c# and c[base+A.?] as one "reg" each.) But let's use those 6 bits as good as possible.

00rrrr // rrrr = r# number
01vvvv // vvvv = v# number
10aabb // aa = address component to use for indexing, bb=msb's of base
1100bb // bb = msb's of const address
1101** // other fun stuff
111*** // other fun stuff

It's only possible to use one constant per instruction (even if it's possible to use it in more than one place). So the 6 lsb's of the base address can be stored in one place for all SRCn. If more than one SRCn refers to the constant array, then use bb from the first (or any of them, they should be equal).

The swizzle needs to say what register component to use for each ALU input component => 4x2bit.

You can have a sign on each SRCn => 1 bit.

This coding is used for all SRCn.

==== TOTAL ====
Right now we have (5+4) + 3*(6+4*2+1) + 6 = 60 bit

What happens with the other instructions?
Those instructions doesn't have SRC3, so we can do a trick in that coding. Use one of the "other fun stuff"-codes in SRC3 to denote that it isn't used. If that's the case, then it's possible to use the 8 swizzle bits of SRC3 to denote what instruction it is.

The rest of the coding is trivial, and you realy doesn't need more than 60 bit. The trick to reduce it to 59 bit is left to the reader as an exercise.

But since it's reasonable to round it up to 64 bit, you could pull the two 'bb' bits out of the SRCn code, and put it together with the other base-bits. That makes it a 62 bit instruction, but with an easier decoding.

This should be simple enough for a RISC processor.

Xmas · Aug 27, 2002

Basic said:
The trick to reduce it to 59 bit is left to the reader as an exercise.

Hehe, that's a simple one. It's a multiply...

Kristof · Aug 27, 2002

Interesting overview...

Anyway the more functionality or registers the bigger the instruction size

Oh and on the pixel shader things get really interesting with co-issue !

K-

Ilfirin · Aug 27, 2002

Mintmaster said:
I always thought it's sort of neat when two minds independently have the same idea. Then again, I suppose this is not exactly inventing the wheel. I would hope that ATI and NVidia are already doing this.

The way I see it there are two ways to accomplish the goal:
-Start making hardware that execute infinetly long shaders and take a rather large speed decrease across the board now (as a fully programmable graphics card without limits is going to take up resources that would normally go to faster executions), and use the next few generations of cards to make it viable
-Make an API/compiler the decompiles infinitely long shaders to smaller multi-pass shaders that work on all of today's and tomorrow's hardware

The first requires hardware to be built, which won't happen for atleast another 6-12 months (and it would have to have been planned out about a year ago), and then it requires the LCD to raise up to that point.. very long waits. It also means taking a performance hit in the short term, for the long term benefit. Consumers won't likely "get" this and just not buy the cards that do this, making it very impractical.

The second requires no extra hardware to be built, can be done now and requires no wait for the LCD to rise (except to a ps1.0 class card). As cards get better and faster at executing longer shaders, less and less passes are done and speed increases; if you don't use longer than normal shaders you won't get the hit. This would essentially solve the LCD problem in the functionality department.

Seems to be a fairly easy decision to make.

KimB · Aug 28, 2002

pascal said:
Do you mean 4Kbytes, right?

Argh, I hate it when I screw up like that...only three orders of magnitude off

KimB · Aug 28, 2002

Ilfirin said:
The way I see it there are two ways to accomplish the goal:
-Start making hardware that execute infinetly long shaders and take a rather large speed decrease across the board now (as a fully programmable graphics card without limits is going to take up resources that would normally go to faster executions), and use the next few generations of cards to make it viable
-Make an API/compiler the decompiles infinitely long shaders to smaller multi-pass shaders that work on all of today's and tomorrow's hardware

The second choice only makes sense for pixels shaders. To my knowledge, there is no way to effectively do the same for vertex shaders, currently.

And yes, it may be beneficial, from a performance perspective, to go ahead and do software-side multipass (due to the required bandwidth from passing shader instructions to the chip). However, as another poster pointed out, it may be possible to significantly reduce the required bandwidth for shader ops by forcing all pipelines to operate on the same instructions. This is not very farfetched when you consider that we do not, as yet, have any dynamic operations in the pixel shader (other than pixel-out conditions). Personally, I'm not sure that dynamic flow control in the pixel shader is a good idea, either. Regardless, this is a thing for hardware companies to figure out into the future.

Right now, for the pixel shader, we definitely need either auto-multipass software, or infinite-length shader programs. But, as far as the vertex shader is concerned, I believe the only option is to go for infinite-length programs. To my knowledge, the only alternative is to do some pre-processing on the CPU until the remaining processing can fit within the vertex program limitations. This may be very challenging to do well with dynamic programs.

Into the future, what we might see is the option to instead of outputting a vertex program to the pixel shader, output it to video memory (buffer size set by software...dependent upon number of vertices passed each pass), and then read that in for the next pass of the vertex shader. This may or may not improve memory bandwidth performance, but may be somewhat challenging for developers to manage properly.

GraphixViolence · Aug 28, 2002

Ilfirin said:
The second requires no extra hardware to be built, can be done now and requires no wait for the LCD to rise (except to a ps1.0 class card). As cards get better and faster at executing longer shaders, less and less passes are done and speed increases; if you don't use longer than normal shaders you won't get the hit. This would essentially solve the LCD problem in the functionality department.

Seems to be a fairly easy decision to make.

There's one problem with this... pre-DX9 hardware will not be flexible enough to handle arbitrarily complex shaders, no matter how many passes they are broken into. Some operations will be impossible to reproduce effectively without floating point support, and if intermediate results are limited to just 32bpp between passes, the lack of precision will become a big issue as well. So really, until ps2.0 hardware becomes the LCD (probably at least a couple of years away still), long shaders probably won't be very useful except for special cases (like previewing and rendering out movie frames).

Ilfirin · Aug 28, 2002

GraphixViolence said:
Ilfirin said:

The second requires no extra hardware to be built, can be done now and requires no wait for the LCD to rise (except to a ps1.0 class card). As cards get better and faster at executing longer shaders, less and less passes are done and speed increases; if you don't use longer than normal shaders you won't get the hit. This would essentially solve the LCD problem in the functionality department.

Seems to be a fairly easy decision to make.

Click to expand...

There's one problem with this... pre-DX9 hardware will not be flexible enough to handle arbitrarily complex shaders, no matter how many passes they are broken into. Some operations will be impossible to reproduce effectively without floating point support, and if intermediate results are limited to just 32bpp between passes, the lack of precision will become a big issue as well. So really, until ps2.0 hardware becomes the LCD (probably at least a couple of years away still), long shaders probably won't be very useful except for special cases (like previewing and rendering out movie frames).

It won't look as nice, but through pre and post processing each step you can achieve close, approximate data. Infinitly long shaders might be extremely slow and not very pretty on DX8 class hardware, but it works and that's all that really matters from a programming point of view. 320x240 looks really ugly too, but that's what you are going to have to run Doom III at if you have a GF1. Same scenario here, it will run on your DX8 cards, but it probably won't be very nice to look at.

And yes, it may be beneficial, from a performance perspective, to go ahead and do software-side multipass (due to the required bandwidth from passing shader instructions to the chip). However, as another poster pointed out, it may be possible to significantly reduce the required bandwidth for shader ops by forcing all pipelines to operate on the same instructions. This is not very farfetched when you consider that we do not, as yet, have any dynamic operations in the pixel shader (other than pixel-out conditions). Personally, I'm not sure that dynamic flow control in the pixel shader is a good idea, either. Regardless, this is a thing for hardware companies to figure out into the future.

You are kinda missing the point - do you want to wait for a piece of hardware to run unlimited length shaders, or do you want it to work on every DX8 and up card? With the second choice (EDIT: Second choice from my post earlier, ie: new api/language/compiler), once hardware came available that did unlimited length shaders it would simply not decomposs the shader at all and feed it to the card.

As far as the vertex shader part goes, yes that is correct - you can not multi-pass vertex shaders as there is no means of storing intermediate data.

But when you have pixel shaders of infinite length, the need for extremely long vertex shaders is alleviated a bit. I rarely use all 128 instructions now, and when I do go over that limit I can usually figure out a different way of writing it and get it done in half the instructions. Not that we don't need longer vertex shaders, but I don't think we need infinite by any means and certainly not as bad as is needed for pixel shaders.

EDIT:
We are all on the same page as far as the term infinite goes right? ie: None of us mean pixel shaders with endless loops? Just making sure.. when I say infinitely long shaders I mean that there is no limit to how long they can be but that they will be finite at some point.

KimB · Aug 28, 2002

Ilfirin said:
EDIT:
We are all on the same page as far as the term infinite goes right? ie: None of us mean pixel shaders with endless loops? Just making sure.. when I say infinitely long shaders I mean that there is no limit to how long they can be but that they will be finite at some point.

Right...I don't mean infinite-length loops. Additionally, there will always be limitations imposed on shader length by available memory, but those are far larger than the limitations we have today.

GraphixViolence · Aug 28, 2002

Ilfirin said:
It won't look as nice, but through pre and post processing each step you can achieve close, approximate data. Infinitly long shaders might be extremely slow and not very pretty on DX8 class hardware, but it works and that's all that really matters from a programming point of view. 320x240 looks really ugly too, but that's what you are going to have to run Doom III at if you have a GF1. Same scenario here, it will run on your DX8 cards, but it probably won't be very nice to look at..

I don't think it's that simple... I'm sure there's some effects that simply wouldn't work, unless you actually wrote a new fall-back shader and possibly even created new textures to run on older hardware. For example, what if you wanted to use high dynamic range environment maps in your scene... without support for floating point texture formats (or at least integer formats greater than 8 bits per channel, which also requires DX9), your hardware wouldn't even be able to read in the textures, so you would have to create new ones. There's no way a compiler could ever do those kinds of things for you, even if you were willing to accept an "ugly" image.

Ilfirin · Aug 29, 2002

GraphixViolence said:
Ilfirin said:

It won't look as nice, but through pre and post processing each step you can achieve close, approximate data. Infinitly long shaders might be extremely slow and not very pretty on DX8 class hardware, but it works and that's all that really matters from a programming point of view. 320x240 looks really ugly too, but that's what you are going to have to run Doom III at if you have a GF1. Same scenario here, it will run on your DX8 cards, but it probably won't be very nice to look at..

Click to expand...

I don't think it's that simple... I'm sure there's some effects that simply wouldn't work, unless you actually wrote a new fall-back shader and possibly even created new textures to run on older hardware. For example, what if you wanted to use high dynamic range environment maps in your scene... without support for floating point texture formats (or at least integer formats greater than 8 bits per channel, which also requires DX9), your hardware wouldn't even be able to read in the textures, so you would have to create new ones. There's no way a compiler could ever do those kinds of things for you, even if you were willing to accept an "ugly" image.

In this case you would simply convert the HDR texture map to a normal 32-bit INT texture map at load-time and use that where you would normally use the HDR map.

The problems arrise is with max simultaneous textures. If you were decomposing a long shader into a bunch of small, ps1.0 shaders you would be writing the intermediate steps to the render target, which would be fed as input into the next stage. This isn't just for the entire shader, but for the lifetime of each variable. So, if you have only 1 temporary variable that you change throughout the length of the shader and then use it at the output (ex: out = x * Tex0

, you would first have to evaluate 'x' to it's final value, which might take many many passes, output that to a render target and feed that render target as an input back into the original shader, which would be evalutated through the same way, to a final value.. actually let me give a very crappy example (pseudo code):

float4 Tex0 = lookup(Texture0);
float x = L dot H;
for(int i=0; i<16; i++) {
x*=x;
}
Out = x * Tex0;

x will be calculated over a few passes on ps1.0 hardware, to the final value of (x)^(x^16), the output for this stage is a grayscale texture that is then modulated with the original texture.

In this case, the shader could very well be decompossed to multi-passed DX8 shaders, but only 1 pass would be required for the actual directly related work (of modulating a texture with a specular value), whereas it might take quite a few passes for the calculation of 'x'. This shader would decomposs to:

*Stage 0 - calculate x*
over the course of many passes calculate the final value of x and store it in texture 'TexX'

*Stage 1 - calculate final image*
input: Texture0, TexX;
float4 Tex0 = lookup(Texture0);
float4 x = lookup(TexX);
Out = Tex0*x;

What was the point of this? I'm getting to that. This is with only 1 variable that was updated over the course of the shader - I am very unsure how well, if possible, it would work when you have a whole bunch of variables with the lifetime of the entire shader (by lifetime I mean the time between when it is initialized to when it's filled with it's final value). It should work.. but require a whole bunch of extra passes. Another example:

float4 Tex0 = lookup(Texture0);
float x = L dot H;
float y = L dot H;
float z = L dot H;
float w = L dot H;
for(int i=0; i<16; i++) {
x*=x;
}
for( i=0; i<16; i++) {
y*=y;
}
for( i=0; i<16; i++) {
z*=z;
}
for( i=0; i<16; i++) {
w*=w;
}
Out = x * y * z *w * Tex0;

(I have no idea why someone would want to do this, it's just for examples sake)

This would decomposs the same way, but you get a problem at the end - 5 textures(x, y, z, w, tex0), but DX8.0 cards only have 4 textures. Hence you would have to decomposs it even furthur and evaluate (x*y*z*w) in one pass, then (Result * Tex0) in another.

My point (sorry for going all round about.. i'm very tired right now) is that there is very little (if any) relation to the number of instructions of the long shader and the number of passes*number of instructions of the short shaders. You might end up with more passes of a few instructions each than the number of instructions of the long shader.. in other words: things might decomposs in very round about ways to a lot of passes; a card that can execute a 512 instruction shader might take a lot more than 2 passes to execute a 1024 instruction shader.

Just kinda showing one bad side of what I have been talking about in this thread.. now for the other (what GraphixViolence has been talking about).

Given that example shader above, it will be very bad if all the intermediate results for calculating the variables were all stored in INT format and rounded off. Say (taking the first shader) L.H = 2 (x=2), x SHOULD equal 2^65536 -1, or 2.0035e+19728 -1, but in the case of INT32 format what would actually be written is 255.. quite a big difference eh? And hence the second problem: you might end up with DRASTICALLY different results on cards that don't have sufficent output color depth. To the point where you don't even want it running on those cards (in this example even 128-bit would still be very much insufficent).. but - it would still run

With these downsides why would one still want this rather than wait for hardware you might ask? Well, the shaders shown here are examples of the rarest of cases. You are never going to need to raise something to the power of it's self to the power of 16, and at that point you would still have the problems of not enough precision on DX9 class hardware.. or any hardware for many many decades (if ever).

If a hardware company decided to go upon the path of executing without limits shaders in hardware TODAY (if they started 6-12 months ago, that's a different story), we wouldn't see that card for atleast 1.5 years, and then we would have to wait 2-4 years for the LCD to rise to that level to use the card. That's 3.5 years minimum and atleast 5.5 years max. Given the development time of games today, if you started working on your game the day the card was released, by the time you released your game that card would probably be the LCD; but that is still a lot of un-needed waiting.

In short:
Need the flexibility NOW, across everything from DX8 cards to DX9 and beyond BUT just because we have the flexibility doesn't mean we have to be careless and generally stupid when writing the shaders.

**** Post intentionally neglects cube maps, 3d textures and anything more than DX8 level pixel shader functionality.

**** In case this hasn't been established, the output of this process wouldn't be a shader, but something similar to a D3DX effect file

KimB · Aug 29, 2002

Ilfirin said:
float4 Tex0 = lookup(Texture0);
float x = L dot H;
for(int i=0; i<16; i++) {
x*=x;
}
Out = x * Tex0;

Yes, this is nitpicky, but it seems to me that this is actually x^(2^16), or x^32, not x^x^16 (Wouldn't that be more something like x ^= x; ?)

To the point where you don't even want it running on those cards (in this example even 128-bit would still be very much insufficent).. but - it would still run

Regarding this, as long as you're going to display the output, 128-bit should be enough for every calculation I am aware of.

For example, for basic color considerations on a 12-bit DAC, you only need roughly 14-16 bits of mantissa to maintain color accuracy over a large number of passes. You could even get away with just 12 bits of mantissa if everything is done in one pass (Assuming higher internal precision, of course). If you want to consider dynamic range, you cannot forget the ability of monitors to display that data. I seriously, seriously doubt that any monitor we have today will be able to properly display even the dynamic range made available with 64-bit fp color.

At least for normal color data, I see no reason to go for higher than 64-bit color, though it may be useful to go 128-bit if you're going to be doing lots of passes...since today's hardware seems to have 10-bit DACs, I think you'd probably find some benefits from going to 128-bit, assuming no error correction, after about four passes. With some forms of error correction, that amount can dramatically increase (high-frequency noise or centering errors about zero should do the trick).

For non-color data, I'm sure people can find ways to make full use of 128-bit storage formats.

gking · Aug 29, 2002

The problem with auto-multipassing any but the simplest of shaders is that eventually the extra work that you need to perform in order to get the multipass to work (recalculating temporary values, generating dozens/hundreds of intermediate textures, etc.) would make the "hardware accelerated" shader run slower than just emulating everything with the CPU.

And an auto-multipasser will not be intelligent enough to handle arbitrary biasing and scaling in order to get all intermediate values to fall within the limits of any fixed precision (this is actually an instance of the canonical "Hello, World" problem).

Basically, auto-multipass has two rather serious holdups:

1) Any shader that is easily decomposable by the compiler should be easily decomposable by a competent programmer (and, with more knowledge about the application, the programmer should be able to do a better job). Having the compiler do the work will save some time, but it probably won't be a huge savings, all things considered.

2) Any shader that isn't easily decomposed will probably run faster on the CPU than it would on the GPU, and all headaches imposed by fixed precision pipelines are simply avoided.

You are kinda missing the point - do you want to wait for a piece of hardware to run unlimited length shaders

We already have hardware that can run unlimited length shaders -- a CPU. Unless you're interested in hardware accelerating production renderers, the limitations imposed by GPU pixel shaders (especially on NV30) will _never_ affect you.

Ilfirin · Aug 29, 2002

Chalnoth said:
Ilfirin said:

float4 Tex0 = lookup(Texture0);
float x = L dot H;
for(int i=0; i<16; i++) {
x*=x;
}
Out = x * Tex0;

Click to expand...

Yes, this is nitpicky, but it seems to me that this is actually x^(2^16), or x^32, not x^x^16 (Wouldn't that be more something like x ^= x; ?)

You are probably right

I did all the double checking with '2' as x.. like I said - I was pretty damn tired when I wrote that

The DX9 situation

Kristof

KimB

pascal

Xmas

Porous

Basic

Xmas

Porous

Kristof

Ilfirin

KimB

KimB

GraphixViolence

Ilfirin

KimB

GraphixViolence

Ilfirin

KimB

gking

Ilfirin

Similar threads