[H]'s take on SM3 pros/cons

DemoCoder · Apr 26, 2004

It doesn't neccessarily need that many vertices, and you can combine displacement mapping with parallax/offset mapping. HL2 is using DM for deformable terrain and they don't use anywhere near that tesselation level.

Laa-Yosh · Apr 26, 2004

But I suppose that the terrain isn't displaced by an 1K or even 512*512 texture; what's more, the texel-to-pixel ratio is surely closer to the lightmaps' resolution that to the color/bump maps. I've used 256*256 maps for displacing terrain on some of my early 3ds max images as well, but I consider this level of its use to be a bit different

DemoCoder · Apr 26, 2004

The problem with all of the SM3 discussions is people expect the phrase "Impossible to do without SM3", and that's not the case in the majority of cases. In fact, the vast majority of shaders being used in PS2.0 games can be done with PS1.1-1.4 in 1-2 passes. The major difference is better precision. It's like that in early PS2 games for example, they are still using cube maps to normalize and not 3 instructions, since you get 2x the throughput.

SM3, like SM2, boils down to a) more efficient at some operations and b) easier to program c) a subset of new algorithms become "feasible"

Nothing is strictly impossible. With multipass, even a DX7 card can compute any mathematical function that SM3.0 can do.

Zeross · Apr 26, 2004

Are You saying the F-buffer will/can be exposed in DX9.0c?

I don't know... exposed is a big word but I think that this is the technique that ATI will use internally to compile long shaders. But this is basically the same thing isn't it ? In fact I can't even imagine how F-Buffer would be exposed in an API, I see it as a transparent layer for the programer.

I usually like Tim Sweeney, but he has managed to confuse the 3d enthusiast community quite well when he decided to call offset bump mapping as virtual displacement. Most people only remember the "displacement" part and already I've seen people telling things like UE3 is pushing 300 million polygons...

It is not Sweeney's fault I've heard the same kind of non sense on Doom III. It comes from the PolyBump â„¢... oupps sorry it comes from the Detail Preserving Simplification technique.

Then there's offset bump mapping, which is an advanced form of bump mapping, and which Tim decided to call virtual displacement mapping. It basically needs a height map and a normal map, and a few extra isntructions, so it's not much beyond standard bump mapping.

Offset bump mapping, parallax mapping (official name AFAIK), virtual DM : three names for the same technique. You can't blame Sweeney, it seems that everyone has his own nickname to call it

Laa-Yosh · Apr 26, 2004

Zeross said:
It is not Sweeney's fault I've heard the same kind of non sense on Doom III. It comes from the PolyBump ?... oupps sorry it comes from the Detail Preserving Simplification technique.

I'm quite sure that there was no mention of displacement mapping about Doom3... The only thing ID said was 'renderbump' which isn't technical enough either, though.

But you're right that 3 different names for the same technique is already too much, and we're surely going to see more

Zeross · Apr 26, 2004

Laa-Yosh said:
I'm quite sure that there was no mention of displacement mapping about Doom3... The only thing ID said was 'renderbump' which isn't technical enough either, though.

No need to check you're right for sure

, I was just saying that it's not the first time that you can see people being misleaded : I remember seing some posts on non technical forum stating that Doom III models are 250 000 triangles. While in fact these high res meshes are just used to build the normal map for the in game models which are between 2500 and 5000 triangles. But people tend to forgot the part related to the normal map and all they remember is "Doom 3 characters... 250 000 polygons"

Scarlet · Apr 27, 2004

Zeross said:
As far as I'm concerned I think that THE biggest advantage of PS3.0 over PS2.0 is that this model is not "fragmented" with things like PS2.0, PS2.0a PS2.0b. At least with PS3.0 you now for sure that you have :
-512 instructions
-no restriction on these 512 instructions (ALU instructions or texture instructions posibly dependant)
-dsx dsy instructions
-centroid sampling
-MRT

All of these were available under different profiles for PS2.0 but it was so messy that developers chose to stick to straight 2.0. PS3.0 is giving a clear target for developers on what features to expect.

Wanna bet?

DemoCoder · Apr 27, 2004

Bet what? I don't see him making a bettable claim.

One example is that any conditionals in shaders compiled with 3.0 can use predicates, which can save 1-2 instructions on average for a guaranteed performance win.

For example, take a very simple conditional

Code:

z = x > y ? a + b : c + d;

translation (without registers assigned)

under SM3.0

Code:

setp_gt p0, x, y
(p0) add z, a, b
(!p0) add z, c, d

under SM2.0

Code:

sub t, x, y
add z1, a, b
add z2, c, d
cmp z, t, z1, z3

Savings: 33% out of 4 instructions. On a 10 instruction shader, it would be 11%. On larger shaders, you'd most likely switch to dynamic branches.

This means any PS2.0 HLSL shaders (say from HL2) with conditionals, when compiled with PS3.0 profile (and no other code changes by the developer, just a recompile) can get a 33%-10% performance boost.

So you don't think developers are going to use the D3DX effects framework to compile 3.0 versions?

aaronspink · Apr 27, 2004

DemoCoder said:
Bet what? I don't see him making a bettable claim.

One example is that any conditionals in shaders compiled with 3.0 can use predicates, which can save 1-2 instructions on average for a guaranteed performance win.

I'm sure you know that number of static instructions is generally not a good way to estimate performance...

Code:
under SM3.0

Code:

setp_gt p0, x, y (p0) add z, a, b (!p0) add z, c, d

1 IPC Arch: 3+ cycles
2 IPC Arch: 2+ cycles
2+ IPC Arch: 2+ cycles (assuming predicate can't be used in same cycle which is generally correct for all microarchitectures I am aware of)

Code:
under SM2.0

Code:

sub t, x, y add z1, a, b add z2, c, d cmp z, t, z1, z3

1 IPC Arch: 4 cycles
2 IPC Arch: 3 cycles
3 IPC arch: 2 cycles

So it really depends on the architecture which one is going to be faster. I can tell you that predication adds more than insignificant complications to an architecture. Predication is not a pancea which has been proven by numerous studies. The concept is great, but the implementations leave a lot to be desired.

Aaron Spink
speaking for myself inc

DemoCoder · Apr 27, 2004

aaronspink said:
1 IPC Arch: 3+ cycles
2 IPC Arch: 2+ cycles
2+ IPC Arch: 2+ cycles (assuming predicate can't be used in same cycle which is generally correct for all microarchitectures I am aware of)

Yes, but it still saves work. On a 2 or more IPC architecture, you've saved a shader unit cycle which is freed up to schedule other non-dependant ops.

1 IPC Arch: 4 cycles
2 IPC Arch: 3 cycles

Apples to apples, it's still a win vs CMP.

Let's look at how they would be scheduled

Code:

Shader1: 1. SETP 2. ADD   3. ADD
Shader2: 1. ****  2.***    3.  ***

* = can dual issue another operation from your shader

Code:

Shader1: 1. SUB 2. ADD  3. CMP
Shader2: 1. ADD 2. ***   3. ***

In the worse case, predication would be equivalent, but in the average or best case, you have 3 slots on SU2 you can fill with other ops, whereas on the bottom case, you have only 2 open dual-issue opportunities.

Predication is not a pancea which has been proven by numerous studies. The concept is great, but the implementations leave a lot to be desired.

Predication has been an issue on other architectures in comparison to *real branches*. It has been a issue for compilers, such as on the Itanium, to use it appropriately in conjuction with branch prediction. The studies to which you refer are studies of predication on ILP CPUs in the context of compiler issues, and real branches.

But on the GPU were are not comparing predication to real branches, in this thread, we are comparing it the CMP operator, which is just a CMOV instruction. I fail to see how the issues with write disablement vs conditional move are covered by these "numerous studies"

aaronspink · Apr 27, 2004

DemoCoder said:
Yes, but it still saves work. On a 2 or more IPC architecture, you've saved a shader unit cycle which is freed up to schedule other non-dependant ops.

I was actually talking about and IPC per pipe, but the general gist is there.

But on the GPU were are not comparing predication to real branches, in this thread, we are comparing it the CMP operator, which is just a CMOV instruction. I fail to see how the issues with write disablement vs conditional move are covered by these "numerous studies"

Click to expand...

I'm not talking about in comparison to real branches...

I'm talking about real hardware implementation. If an architecture supports predication, then it opens itself up to additional hazard cases that don't exist without predication. In addition, the write kill functionality is not simple to implement, and in any case where you have a feedback path presents even more issues. There are some cases where full predication provides a win vs CMOV, but they are rare.

CMOV is a limited case of predication, as you state, but it is the limit that allows for a much simpler implementation. Things get very complicated in hardware when you start letting more than 1 thing potentially write the same location. It can be done, but you will end up with non-optimal scheduling and additional overheads.

If you are simple single issue, then the differences between CMOV and predication disappear, but when you are multi-issue (which I assume the new coming and future GPUs are), predication starts to make less sense.

Aaron Spink
speaking for myself inc

Click to expand...

DemoCoder · Apr 27, 2004

Can you give a specific example? I just don't see how a write disable is going to cause a huge issue on the GPU. It's no different than write masking on destination registers, the only difference is, it's equivalent to a NULL mask, and data dependent.

Predicates on the GPU can be treated as just conditional write masks, and as such, the performance implications should be about the same vs the CMOV case or destination write masks.

p.s. I was also talking about IPC per pipe, that's why I demonstrated two shader units in my example.

Scarlet · Apr 27, 2004

lol you guys got really carried away. My bet challenge was on the very last line which was something to the effect that the developers have a clear target for 3.0 (and implying the world will therefor be rosy post SM3).

Observing how everyone loves to diiferentiate, I am merely observing the obvious.

Whatever nV does, ATI will do them one differently (no, I didn;t mean better, just differently).

And of course the corollary:

Whatever ATI does, nV will do them one differently (no, I didn;t mean better, just differently).

To think the 3D developer world will be united at SM3 does not reflect past history and ignores the fundamental motivations these IHVs have to be different.

DemoCoder · Apr 27, 2004

Scarlet, you misunderstood the meaning of "clear target" The point is, SM3.0 is a *known quantity* because it requires many features which were optional in PS2.x, thus developers don't have to write shaders and deal with a combinatorial set of different card capabilities. It's much clearer to developer for, vs trying to target 2.x where we have 2.a (NVidia) and 2.b (ATI) and potentially a bunch of other variations.

Maintank · Apr 27, 2004

This article is a good laugh on a Monday morning.

Of course it is Tuesday so I thank you guys even more. Tuesdays are such a pain in the arse

hstewarth · Apr 27, 2004

This means any PS2.0 HLSL shaders (say from HL2) with conditionals, when compiled with PS3.0 profile (and no other code changes by the developer, just a recompile) can get a 33%-10% performance boost.

Are you stating that HLSL 2.0 shader can be recompile on the fly to 3.0 and get the benifits of 3.0 without the developer writing directly to it.

If so so this is signficant - if means that new hardware like the 6800 that support 3.0 can making speed up shaders written for 2.0. I assume for users to have this ability they have to wait for DX9.0c. Any dates on that - I am assuming it will be available when 6800 Ultras are in store on Memorial day ( According to interview with NVidia ).

[H]'s take on SM3 pros/cons

DemoCoder

Laa-Yosh

I can has custom title?

DemoCoder

Zeross

Laa-Yosh

I can has custom title?

Zeross

Scarlet

DemoCoder

aaronspink

DemoCoder

aaronspink

DemoCoder

Scarlet

DemoCoder

Maintank

hstewarth

Similar threads