Radeon 9700 and conditional assignment in PS?

Supposedly conditional assignment is a useful feature, which allows unrolling loops and such, an alternative to flow control. I was just wondering if the 9700 supported a newer form of conditional (as opposed to 8500) or if it could execute all branches and choose the most efficient (supposeldy like the nv30). Humus posted on another forum that the 8500 supports the conditionals "if so else", does the 9700 offer this, or a differing conditional command such as "if so then"?

Secondly, does the 9700 support 160 instructions in its ps, or is it actually less, without counting vector and scalar operations seperately?
 
Conditional Assignment is only supported by NV30.

R300 supports 160 instructions in a pass, but there is no any type of flow control (like in the NV30, although NV30 has what I wrote above).

Donno about the 8500 or what the R9700 offers in this regard...

Maybe Humus can enlighten us, I would appreciate it! :D
 
alexsok said:
Conditional Assignment is only supported by NV30.

Conditional assignment in PS1.4:
Code:
cmp r0, r1, r2, r0

Basicly it's:

Code:
if (r1.x >= 0)
     r0.x = r2.x
if (r1.y >= 0)
     r0.y = r2.y
if (r1.z >= 0)
     r0.z = r2.z
if (r1.w >= 0)
     r0.w = r2.w

Of cos arbitrary write masks are available to perform the operation in only some of the channels (Also new in PS1.4).

AFAIK, R9700 has no improvement over this, but it's already quite useable.
 
Conditional assignment in PS1.4:
Code:
cmp r0, r1, r2, r0

Basicly it's:

Code:
if (r1.x >= 0)
     r0.x = r2.x
if (r1.y >= 0)
     r0.y = r2.y
if (r1.z >= 0)
     r0.z = r2.z
if (r1.w >= 0)
     r0.w = r2.w

Of cos arbitrary write masks are available to perform the operation in only some of the channels (Also new in PS1.4).

AFAIK, R9700 has no improvement over this, but it's already quite useable.

So what are the condition codes mentioned in PS NV30 documents that are supposed to accomplish many things branching would?

They are the same thing, only extended for PS2.0 and beyond?

I'm kinda confused here...
 
Condition codes work differently. Basically the PS will have a set of flags, like a zero-flag, negative-flag, etc. A condition code is inserted into an instruction to make it only execute if one of the flags is set.

Let's say you have this C-code

if ( x == 0 )
{
y = 2;
z = 3;
}

You can execute this without branching by testing x against 0, and only execute the next 2 lines if the zero-flag is set.

[not real PS]

cmp x , 0
set (ZFLAG) y , 2
set (ZFLAG) z , 3

The PS will simply skip the 2 set instructions if the zero-flag isn't set. I think the original ARM processor had something similar :)
 
But that's constant-level branching. That could pretty much always be done by just using a different program anyway...it's just a memory usage/performance optimization.

Data-dependent branching just isn't possible (at least, without using multiple passes) to do unless there is some hardware support for it.

What the NV30 apparently allows are conditions, that can change during the execution of the program, that modify what values are written. In other words, the exact same amount of code is always executed, but the result of parts of that code may be thrown out depending on the condition codes.
 
CMP is available from PS.1.2 and above. And it is a dynamic selection.

What did you refer to as constant level branching?
 
In the ideal case, instruction predicates should result in less code being executed than if you actually branched on the data, ex:

if (x > y) x=y
else y=x

becomes:
SUBC TMP, X, Y;
MOV X(GT), Y;
MOV Y(LE), X;

which extends nicely to vectors

versus:
sub TMP, X, Y;
ble X, 0, SETY;
mov X, Y;
j DONE;
SETY:
mov Y, X;
DONE:

which must operate on a scalar at a time (except possibly on MIMD machines).

edit: obviously this is just a hypothetical example. the best way to implement the example conditional assignment in fragment shaders would be:

MIN X, X, Y;
MIN Y, X, Y;
 
gking said:
edit: obviously this is just a hypothetical example. the best way to implement the example conditional assignment in fragment shaders would be:

MIN X, X, Y;
MIN Y, X, Y;
What's the point of that second MIN? Obviously MOV Y, X would do the same job ;)
 
It wouldn't preserve the same semantics. If X > Y, using a MOV for the second min would cause Y to be modified when it shouldn't.
 
The values at the end would be identical, so writing all components in Y really wouldn't cause any side effects (in fact, MIN will overwrite all components in Y, too).
 
That's true, but I could imagine scenarios where you don't want side effects.

In any case, you original point is right, there are scenarios where predicates yield less code, especially if there are common subexpressions or algebriac factoring possibilities.

In fact, a real compiler would determine that after your IF statement, X and Y contain the same values and would use copy propagation to simply replace X with Y everywhere it could, optimizing the whole thing down to just

MIN Y, X, Y

with X "dead" after the assignment (and allocating X's register to be reallocated)
 
The condition codes can be emulated on the Radeon.

Essentially, NV30 has a special register that, when updated, indicates whether the result of the current opcode was less than, greater than, or equal to zero (or undefined). Since this register is only updated when the programmer requests, it's possible to forward propogate results to improve shader performance.

Radeon 8500/9700 have instructions (CMP and CND) which will provide exactly the same functionality; however, they won't perform as well as NV30 will in some shaders (conditional code updating and masking is free on NV30).

Conditional codes (aka instruction predicates) are a feature that, in some cases, can provide a nice performance boost -- they don't really enable any new types of effects.
 
I asked a similar question on OpenGL.org and I received an answer from Matt Craighead, a guy working at nVidia:

Right, the most obvious use for condition codes is to emulate branching. (Real branching _can_ be faster if it lets you skip a lot of instructions, and the branch is either almost always taken or almost always not taken.)
It can get annoying to write out all the code for using condition codes. Fortunately, the Cg compiler should be able to help you out here and compile your "if" statements with no trouble.

You can think of condition codes as being the rough equivalent to "CMOV" instructions in x86.

- Matt
 
gking said:
Conditional codes (aka instruction predicates) are a feature that, in some cases, can provide a nice performance boost -- they don't really enable any new types of effects.

I think this is the most important thing.

This is one of the key advantage that moving to HLSLs provides. One can write the shader code and the driver can optimize it to the current architecture.
It allows different different hardware implementation while maintaining compatibility.

Sticking to assembly would only lead to an x86 type fiasco.
(The most complicated task of today's processors is to convert the x86 opcodes to the internal (RISC) opcodes the processor implements.)
 
Hyp-X said:
I think this is the most important thing.

This is one of the key advantage that moving to HLSLs provides. One can write the shader code and the driver can optimize it to the current architecture.
It allows different different hardware implementation while maintaining compatibility.

But compilers are not magical beings that produce the perfect sequence code. That is the reason because in some cases you are still programming in assembly (even in RISC machines). You may not want to optimize all your programs (or shaders in this case), but the ones that are critical for performance sure you should even for different architectures.

Hyp-X said:
Sticking to assembly would only lead to an x86 type fiasco.
(The most complicated task of today's processors is to convert the x86 opcodes to the internal (RISC) opcodes the processor implements.)

The process of translating x86 instructions in the internal micro-ops that both Intel and AMD use from the Pentium Pro times is not a problem at all. The problem for performance is the same than in any other CPU: how to exploit ILP. The process of translation just add more stages to the pipeline, but as fetch/decode and execution are more or less decoupled in modern architectures this is not a problem, the added latency just affects misprediction penalty.

In fact in the P4 which uses a trace cache the process of translation is completely decoupled from the decode and execution as the CPU core is fetching already translated ops.
 
RoOoBo said:
But compilers are not magical beings that produce the perfect sequence code. That is the reason because in some cases you are still programming in assembly (even in RISC machines). You may not want to optimize all your programs (or shaders in this case), but the ones that are critical for performance sure you should even for different architectures.

No, compilers cannot compile perfectly optimally, particularly in a runtime environment.

However, there are significant advantages. It is for this reason that I fully support 3D Labs' approach to HLSL's: standardize the HLSL, let the hardware developers design the machine/compiler.

The main reason is simply this: video hardware is changing at a breakneck rate. If we get bogged down in a standardized instruction set, then that instruction set will hold progress back, just as has happened with the x86 architecture. While it is true that you can, for example, squeeze a little bit mroe performance out of x86 by going straight to the assembly, the truth is that our processors would be running one heck of a lot faster if the HLL's had been standardized instead of the processor instruction set.

One other example: What would you rather have in three years: a 1GHz GPU running on an equivalent of the x86 instruction set, or a 1GHz GPU running on an equivalent of a RISC instruction set? Which would be faster? Obvoiusly the more advanced one would be.

I'm really hoping that DX10 takes a "hands off" approach to assembly programming, and goes all HLSL. I also hope that 3DLabs' proposal to standardize the HLSL, not the assembly, goes through for OpenGL 2.0.

The process of translating x86 instructions in the internal micro-ops that both Intel and AMD use from the Pentium Pro times is not a problem at all.

Yes, it is a problem. Particularly for a GPU, having to decode would require far more precious transistors. On a GPU, those transistors could be put to use much more effectively than the same transistors in a CPU. And, just as you stated before, a compiler can't be quite as optimal as programming right to the assembly. Don't you think that the internal translator in those CPUs reduces performance?
 
Chalnoth said:
Yes, it is a problem. Particularly for a GPU, having to decode would require far more precious transistors.
Why would you do this in the hardware at all? There is a driver between the API and the hardware, you know.
 
Writing a compiler to generate near optimal code is alot more tractable a problem then doing it for CPUs. Of course, a fully optimizing compiler is equivalent to the Halting Problem and is impossible, but the nature of GPUs and shader programs makes analysis easier.

Doing hand coded assembly could in fact be counter productive on GPUs. A good HLSL compiler would do instruction scheduling, knowing the functional units and hazards for each particular platform. A DX9 assembly hacker would not be privy most of the time to the actual internal architecture of the vertex or pixel shader pipeline, and any assembly he "optimized" for one particular GPU could run badly on others due to the pipeline differences.

Theoretically, the driver could treat the DX9 code as a kind of intermediate representation, build up all the usual compiler datastructures, and try to do optimizations, but at this case, what the assembly programmer is writing is virtual code anyway, kind of like Java Byte code or .NET CLR. The instructions he is writing don't map to any real hardware, but are translated and compiled by the driver to an internal instruction set.

Thus I conclude that handwriting DX9 shaders in assembly is not really going to insure deterministic optimizations by the author anymore than using the HLSL. And atleast the HLSL is easier to write, maintain, and optimize.


DirectX9 assembly is more like .NET CLR or Java byte code than a real instruction set. And given the constantly evolving architectures, hand optimizing for each platform and architecture, I think, is a waste of time.

HLSL all the way baby.
 
Back
Top