The DX9 situation

alexsok

Regular
According to Reactor Critical:
We have already reported that Microsoft had decided to implement numerous versions of Pixel and Vertex shaders in their up and coming DirectX 9 in order to let the developers use something more than the target platform of this API offers. Now we have even more information, we are going to share everything with you.

After the DirectX 9 Beta 2 had appeared in the Web, a lot of new details have emerged. Together with Pixel and Vertex shaders 2.0, Microsoft will also introduce 2.1 and 3.0 versions of its shaders. The former is set to bring pixel and vertex shaders even more together: dynamic and static (based on external constants) branching is introduced, loops, sub-shaders and numerous other extensions.

According to what we have heard, shaders 3.0 are intended for NV30 because CineFX specs fully answer 3.0 demands, while 2.1 are for the updated version of Matrox Parhelia that is set for this Winter. We want to stress that next Spring both 3Dlabs and ATI launch updated versions of their current top-products: a new version of P10 VPU and the code-named R350.

We still have to find out if 3.0 shaders require the support of 2.1 shaders, moreover, we have no information about the shaders to be supported by R350 and the next-generation 3Dlabs`s VPU.

Keep in mind: every piece of information is unofficial at this point and sometimes may not be finalised.

Microsoft are making a mess out of everything again... :cry:
 
Sounds like typical Reactor Critical BS. The NV30 does not allow branching at all in pixel shaders. You can verify this by downloading Cg Toolkit beta 2, and writing branching code in Cg with the fp30 output profile. There are no loops or subroutine calls either.

Now, in vertex shaders it is a different story, but VS2.0 contains loops, procedural calls, and static branching. Only thing missing in DX9 V2.0 is dynamic (data dependent) branching.


What the NV30 does allow is conditional assignment. It appears that there are now condition code registers (HC, H4) and you can write code like

Set H4 with result of comparing register R0 with register R1

Later, you can say (pseudo syntax)

MOV R2, R0 (GT.xxxx)

Which is "move R0 into R2 if and only if the HC register's X component contains a "greater than" status.


The Pixel Shader 3.0 profile is actually BEYOND the NV30's capabilities. Cg can only support loops, for example, if it can unroll them, and loops with dynamic early exit or dynamic iteration count simply won't work.


PS3.0 basically doesn't exist in hardware and won't be supported by NVidia or ATI. If PS3.0 is the result of an NVidia campaign, it looks like they succeeded beyond their wildest dreams, because MS adopted a standard that even NVidia's own hardware can't handle.
 
Yeah, I read the DX9 beta 2 specifications myself and it's written there that PS 3.0 have support for static & dynamic flow control which the CineFX architecture doesn't support.

And RS themselves said that PS 3.0 go beyond the capabilities of NV30 and suddenly they changed their mind (the reason for that is a reader of theirs who sent them the "correction", read that on their Russian mirror).

So basically, PS 3.0 is implemented for future hardware to support?

Also, how hard would it be to implement the features of PS 3.0 in NV35 for example? Would that require a major change in the architecture or not?

BTW, thx for the lenghty reply, it's appreciated! :D
 
that does make some sense... about PS/VS 3.0 being towards hardware we don't have yet... I expect its hardware we'll have in our hands by this time next year though, and MS probably knows this.

makes sense, as that buys them more time for Dx10 and avoids the need for dx9.0, 9.1, 9.2 etc to keep adding support for new graphics chips that come out.
 
PS3.0 is simply the most "natural" evolution. Just add dynamic branching capability, and make PS3.0 and VS3.0 "equal" so that there is not much difference except what stream they are iterating over. Note that VS3.0 has up to 4 texture samplers in the vertex shader too!

After VS3.0/PS3.0 there is only one more thing you can do with flow control: get rid of all loop/depth restrictions. (e.g. allow infinite loops) and provide a stack so that procedure call depth can go as high as you want (within reason)

After that, the only thing standing between a shader and a general purpose CPU is the shader's ability to address memory (e.g. read from pointers, write to pointers, allow shaders to "share" info between cycles). Given the parallel/streaming nature of GPUs, I don't think they should remove this restriction for awhile.
 
MS do not want to release another major revision to the API until longhorn, thats why they are introducding PS/VS3.0 with a little extra future proofability. AFAIK there will not be a DX9.1 either - with VS/PS2.0 and 3.0 in DX9 will be kinda like DX9 and 9.1 all rolled into one.
 
Reactor Critical have posted a correction to their news (what comes as a no suprise):

According to the information published, NV30 does not support such function as texture lookup in vertex shader, what is one of the main features 3.0 can provide. Moreove, NV30 does not support branching in Pixel Shaders. In this case we doubt NV30 will support all the functions out of 3.0 and also want to beg your pardon for the misleading information that CineFX corresponds to Pixel Shaders 3.0.

As for Pixel Shader 2.1, we want to point out that this "standard" is quite "flexible" in terms of functionality, because the support of it can be declared either if a chip supports 1024 instructions without branching or 512 instructions with static branching.

According to Microsoft, Pixel Shaders 3.0 require the support of Pixe Shaders 2.1.
 
I'd just like to say that I'm not so certain that dynamic-length loops may not be all that great for 3D hardware, for performance reasons. Once you have dynamic-length loops, you cannot always prevent out of order execution, which means, to me, that you're going to need lots more transistors to keep things in line.

If you ask me, what we need instead are completely unlimited programs (i.e. totally virtualized memory...obviously there will be size limitations based on video memory, or perhaps even AGP memory), and, as you stated, the ability to access textures within the vertex shader (could be used, for example, to create a custom form of displacement mapping).

As for dynamic-length loops, pretty much all of those that I know of are based on something like, "Loop until the error is small enough." In a situation like this, you'd just figure out how many loops to either always or usually keep the error below a certain value are required, and always execute that number of loops. The alternative would be to always execute the maximum number of loops (set in the high-level shader program), and choose the one the program wants. This would cover any algorithms that aren't based on reducing the error with successive passes.

In my mind, this is the only way to prevent out of order execution. Personally, I'd rather see, say, a doubling of the pixel pipelines than risk problems with out of order execution.
 
Chalnoth said:
If you ask me, what we need instead are completely unlimited programs (i.e. totally virtualized memory...obviously there will be size limitations based on video memory, or perhaps even AGP memory), and, as you stated, the ability to access textures within the vertex shader (could be used, for example, to create a custom form of displacement mapping).

URGH... think bandwidth... complete virtualized memory would result in quite extensive extra data flow, the instruction dataflow alone can be pretty bad given the huge potential throughput of the vertex shader, not to mention the dataflow when you think about doing this with the registers.

Watch out for bringing the GPU too close to the CPU because things will get just as slow... we want to keep the GPU high speed and good at the task it is really designed for - we don't need another CPU.

K-
 
Those problems will only crop up when those super-long programs are put to use. It is will most certainly be more than possible to write a shader on hardware that has no limitations that will produce terrible performance, but it is also possible to write shaders that will run just as good as it runs on hardware that does have limitations (everything else the same, of course).

As I see it, we really need limitless-length shaders now to unify programming across forseeable future 3D hardware, as well as the ability to really put the optimizations in the court of the programmer. For example, a game developer may want to use just one uber-realistic shader and tons of simple ones. That uber-realistic shader (particularly if it's a vertex shader) may be undoable on today's hardware, making such a scenario impossible.

After the introduction of fully-virtualized hardware, future hardware would have the task of improving performance when the internal caches are overrun, as well as improving performance with algorithms that later gain significant usage.
 
It's not so much that we need hardware that can execute shaders of infinite length, as it is that we need a compiler/API that decomposses shaders of infinite length into multi-pass DX8/9/++ shaders. This isn't even anything you have to wait for, if you have the time there isn't anything stopping you from writing your own HLSL compiler that does this.

Virtualized memory is an aboslutely neccessary item, as it has been for the last decade. I am very disappointed that it is just now (read: soon) showing up on consumer level cards, and only then by one company (http://www.3dlabs.com/product/wildcatvp/index.htm). There are so many things wrong with the way memory and textures are handled right now..
 
I'm not sure that really counts as a consumer-level card. We can't even be sure if a consumer-level version will be coming out any time in the near future.

Anyway, you know what? You're right. I just realized that with virtualized instructions in the pixel shader, you would need to essentially pass at least the instructions that don't fit into cache every pixel. Even if the instrucions were only 16-bit, this could be quite a bit more than the pixel bandwidth required (texture reads, pixel out) after just a dozen or so instructions overflowed.

I do kind of wonder at the instruction size, though. Quick discussion on that.

I believe that all you need to reference for a VS instruction is an operation (8-9 bits should be plenty), a constant register for source (10 bits for the NV30) and a temporary register for the destination (4 bits for the NV30).

That should all fit in a 32-bit instruction set, but I suppose I'd have to go back and look at the PS/VS specs to be sure that all instructions could fit...but, even if you need to add an additional constant, as long as only 8 bits are needed for the operation, then you're still within 32-bits.

The PS have fewer consts/temp registers, so those may fit within 32-bits even more easily.
 
Chalnoth I agree, I really should get my vertex shader article done and published. The dataflow of the instructions is quite big. About instruction size... you can have up to 3 source arguments, 1 destination, you need full swizzling (costs 8 bits per source!) on the sources and write masks on the destination. You have 256 constants that have to be addressable in at least 5 different ways (direct and 4 relative base cases) for 2 arguments. Anyway I did a rough calculation in my vertex shader article (not ready... sigh) and got to 76bits per instruction (thats for a 3 source instruction and usually instructions have fixed static size) or 10bytes per instruction. The more registers there are, the more functionality the bigger the instruction gets and instruction overflows are a huge drain on bandwidth. Yes it can be done but it would probabaly make big shaders even slower.

Obviously if we assume that vertex performance is very low then you might get away with acceptable bandwidth.

I hope your not planning to virtualise the temporary registers because then you'll end up having to handle 32 vector values... yeah, lets write those out and read them back in later...

K-
 
The leaked document is titled:
D3D 9.0 Shader API Specification - Version 2.1
I guess thats where the 2.1 number coming from. :D

PS2.0 and VS2.0 seems like the common ground that every next-gen card will support.

When vendors approached Microsoft for inclusion of extra features MS finally realized their error in the shader versioning system:
There is no single compatibility path possible!

Card A supports feature A above PS2.0 while Card B supports feature B above PS2.0.
If they declare Card A as PS2.1 than there's no way to expose feature B.

So they declared a heap of extra features above 2.0 each of which can be available trough different capability flags.

The 3.0 shader versions are 2.0 and the sum of all these extra features (mostly).

It's likely that there won't be a fully VS3.0/PS3.0 compatible card for a while...
 
Kristof said:
The dataflow of the instructions is quite big. About instruction size... you can have up to 3 source arguments, 1 destination, you need full swizzling (costs 8 bits per source!) on the sources and write masks on the destination. You have 256 constants that have to be addressable in at least 5 different ways (direct and 4 relative base cases) for 2 arguments.

Why does swizzling cost anything in instruction size? Additionally, while this may be moving out on a limb, shouldn't it be possible to have some instructions that require operations on source temp registers (i.e. constants are moved to temp registers before processing), in order to reduce instruction size?

And why do you need constants to be addressable in different ways? Don't you just read them?
 
Ilfirin said:
It's not so much that we need hardware that can execute shaders of infinite length, as it is that we need a compiler/API that decomposses shaders of infinite length into multi-pass DX8/9/++ shaders.

I totally agree. I proposed this earlier, actually:
http://www.beyond3d.com/forum/viewtopic.php?t=1877&highlight=

Sure, longer instruction lengths are useful, and will make a performance difference in some situations, but I think these shader-decomposition compilers are a far greater priority than making a chip allowing super huge or even infinite instruction lengths. The multiple render targets will help a lot in this, allowing more temporaries between passes.

I always thought it's sort of neat when two minds independently have the same idea. Then again, I suppose this is not exactly inventing the wheel. I would hope that ATI and NVidia are already doing this.
 
Chalnoth said:
Why does swizzling cost anything in instruction size? Additionally, while this may be moving out on a limb, shouldn't it be possible to have some instructions that require operations on source temp registers (i.e. constants are moved to temp registers before processing), in order to reduce instruction size?

And why do you need constants to be addressable in different ways? Don't you just read them?

The vertex engine has to know which components to swizzle. How else would you encode that except in the actual instruction? Same with constants. All that stuff needs to be encoded in the instruction so the vertex shader engine can do the right thing.
 
Chalnoth said:
Why does swizzling cost anything in instruction size? Additionally, while this may be moving out on a limb, shouldn't it be possible to have some instructions that require operations on source temp registers (i.e. constants are moved to temp registers before processing), in order to reduce instruction size?

And why do you need constants to be addressable in different ways? Don't you just read them?

You need to know what parts of the 4D vector to operate on so swizzling is part of the instruction and it zaps up bits, same for modifiers like negating an input etc. Moving constants to temp does not seem a good idea, sounds like a waste of clockcycles.

Relative addressing is really nice for loops and skinning operations, basically you have a base address read from 1 of the 4 address register entries and you read relative to that base address in the constant address range. Trouble is that you need bits to indicate if you do direct addressing or relative addressing and if you do relative addressing you need to specify with which base address.

The more functionality the more encoding bits you need for the instruction. The bigger the instructions the more costly a load system becomes which allows infinite amounts of instructions.

K-
 
Kristof said:
You need to know what parts of the 4D vector to operate on so swizzling is part of the instruction and it zaps up bits, same for modifiers like negating an input etc. Moving constants to temp does not seem a good idea, sounds like a waste of clockcycles.

Yes...moving constants to temp may hurt performance (more likely, it would be hardware-accelerated, and wouldn't be a problem, I should hope...more of just a way of keeping instruction sizes small)

But if you're just talking about which parts of the 4D vector to operate on, wouldn't that only take 4 bits, not 8?

Regardless, you're probably right, that it requires more than 64-bits, for one extremely simple reason.

A constant takes up 128 bits right (4 32-bit floats)? And the NV30 apparently shares space for its constants and instructions, supporting 256 constants or 256 instructions (1024 each in the PS, I believe). That would make for 128-bit instruction lengths, too.
 
Some thoughts on infinite-length shader programs:

If you run N pipelines in lockstep, you should be able to route the same shader instructions to all the pipelines all the time. Also, if each pipeline has M cycles of latency, you could just feed the same instruction into the N pipelines for M cycles until you get the data in the pipelines ready for the next instruction. This way, you would be able to load a shader instruction from memory once and then use it N*M times before you need to load the next one. This does not take into account the effect of having an instruction cache, which would help loops and smaller shader programs as well.

The resulting memory traffic for e.g. vertex shading? Assuming M = 6 cycles, instruction word size = 96 bits, 50% cache hit rate and 300 MHz clock rate: 300 MBytes/sec.

Of course, data-dependent conditional jumps would make computations like this a fair bit more difficult - especially if the pipelines are allowed to fall out of lockstep.

edit: corrected error in calculation
 
Back
Top