Inquirer spreading R420 info

Jabbah · Apr 21, 2004

As much as I hate to get back on topic after the last few riveting pages...

XT delayed till June?
http://www.theinquirer.net/?article=15440

and more FUD on PS 3v2
http://www.theinquirer.net/?article=15442

From what I understand from what Ive read round here the effects that will be achievable using ps3 over ps2 will not be that noticable if at all.

Indeed, if HLSL is used cant the compilation target just be changed to ps3 and you then magically have a ps3 game which nvidia can tout as a leap in graphical quality but which in reality is exactly the same output as the ps2 shader?

Bad_Boy · Apr 21, 2004

sm3.0 isnt ALL about visuals...it's supposed to increase performance as well. but we shall see when the benchmarks come.

KimB · Apr 21, 2004

Jabbah said:
Indeed, if HLSL is used cant the compilation target just be changed to ps3 and you then magically have a ps3 game which nvidia can tout as a leap in graphical quality but which in reality is exactly the same output as the ps2 shader?

Well, that will not be the case. There are many HLSL shaders that will simply not compile in PS 2.0 but will compile in PS 3.0. This is due to the larger instruction limits and branching support.

This means that because many shaders won't be all that easy to write under PS 2.0, even though it may be possible, you are not guaranteed that developers will write those fallback shaders.

Granted, this shouldn't be a problem for a year or more, but it will eventually come up.

Bry · Apr 21, 2004

Chalnoth said:
Jabbah said:

Indeed, if HLSL is used cant the compilation target just be changed to ps3 and you then magically have a ps3 game which nvidia can tout as a leap in graphical quality but which in reality is exactly the same output as the ps2 shader?

Click to expand...

Well, that will not be the case. There are many HLSL shaders that will simply not compile in PS 2.0 but will compile in PS 3.0. This is due to the larger instruction limits and branching support.

This means that because many shaders won't be all that easy to write under PS 2.0, even though it may be possible, you are not guaranteed that developers will write those fallback shaders.

Granted, this shouldn't be a problem for a year or more, but it will eventually come up.

Which by then NV50/R500 should be released, which will definately have the power to run those programs with no hickups..something we can not gaurantee with current generation NV40/R420 whether they support SM3.0 or not..

KimB · Apr 21, 2004

Bry said:
Which by then NV50/R500 should be released, which will definately have the power to run those programs with no hickups..something we can not gaurantee with current generation NV40/R420 whether they support SM3.0 or not..

Some shaders that you could run on SM 3.0 that won't compile to SM 2.0 may not even exceed 96 instructions in SM 3.0.

One primary example would be any shader using the facing register for two-sided lighting. This shader might not even be longer than 10 instructions (a shader for polygonal volumetric fog would be a prime candidate here).

PatrickL · Apr 21, 2004

And you say that game developpers will primarily code for the only card that may run sm 3 at acceptable speed rather than PS 2 that most card will support at normal speed ? Or are you pretending that the next 5200 equivalent (6200 ?) will run sm 3.0 just fine ?

Bry · Apr 21, 2004

PatrickL said:
And you say that game developpers will primarily code for the only card that may run sm 3 at acceptable speed rather than PS 2 that most card will support at normal speed ? Or are you pretending that the next 5200 equivalent (6200 ?) will run sm 3.0 just fine ?

You know how it is..Many developers spent hours and much resources getting things to work for NV3x..but they won't do it for ATI..well according to some people

KimB · Apr 21, 2004

PatrickL said:
And you say that game developpers will primarily code for the only card that may run sm 3 at acceptable speed rather than PS 2 that most card will support at normal speed ? Or are you pretending that the next 5200 equivalent (6200 ?) will run sm 3.0 just fine ?

What I'm suggesting is that they may, when adding SM 3.0 to the engine, add effects that are a bit more sophisticated than they used under SM 2.0. Then they may neglect to create a fallback for SM 2.0 for those effects. As I said, though, it should be at least a year from now before that happens. There should be a number of SM 3.0 parts available by that time.

And yes, I'm sure the "6200" will run SM 3.0 just fine. It will be slower, obviously, than a high-end card, due to the lower fillrate, but should have no problems beyond that.

Jabbah · Apr 21, 2004

Chalnoth said:
Jabbah said:

...HLSL...compilation...magically have a ps3...

Click to expand...

Well, that will not be the case. There are many HLSL shaders that will simply not compile in PS 2.0 but will compile in PS 3.0. This is due to the larger instruction limits and branching support. ...

True, all i was trying to get at is that the easiest way to get a SM/PS3 patch is to just recompile to that target with the existing code. I wouldnt have thought that games that have been developed and are in the final stages of development will have the extra thought put into them to really take use of ps3 in a new and interesting way, but just recompile to ps3 to take any advantages that could offer. That will give nVidia plenty of ps3 games to brag about without having to wait for innovative effects that would only work with ps3 to come out.

DemoCoder · Apr 22, 2004

There are few shaders which absolutely *cannot* be compiled on PS2.0b or on a predicated instruction set. For example, even a loop with flow control like

Code:

for(int i=0; i&lt;LARGE_NUM; i++)
{
    a+=x;
    if(a.x > threshold) break;
}

Can be done, even without support for loop instructions. You simply unroll that loop body LARGE_NUM times and every iteration after the first has a predicated write to the A variable, e.g.

Code:

// A = R0, R1 = X, C0 = threshold
add R0, R0, R1
setp_gt P0.x, R0.x, C0.x 

(p0.xxxx) add R0, R0, R1
setp_gt P0.x, R0.x, C0.x 
... repeated loop body unrolled
(p0.xxxx) add R0, R0, R1
setp_gt P0.x, R0.x, C0.x

Of course, if LARGE_NUM is big, this could be a huge penalty. If we assume that the penalty for dynamic flow control is a few cycles, then the dynamic branch wins as soon as (# cycles for one iteration+branch_delay) * avg_number_iterations < (# of cycles for one iteration) * LARGE_NUM

Of course, it's a hypothetical. I really can't think of any PS3.0 shader that can't be compiled on 2.0 with long instruction limits except for those with branches containing DISCARD/RET/PIX-KILL. I don't think the predicate register can be used on RET or the flow control ops.

(BTW, even without predicates, you can achieve the above using CMP instead, but the code is less optimal, so the penalty is larger)

KimB · Apr 22, 2004

But, of course, it is worth mentioning that there are sure to be a number of PS 3.0 shaders that are so slow as to be utterly unfeasible on PS 2.0. How often these would occur in games, I'm not sure, but they should be very common for a number of algorithms for non-graphics work.

Anyway, most of the benefits of PS 3.0 in terms of allowing new algorithms seem to revolve around better texture filtering in the shader. I suspect that it will be easier in PS 3.0 to develop a shader with a more exotic use of textures that does not also have aliasing.

rwolf · Apr 22, 2004

Chalnoth said:
Althornin said:

I also wonder why you think its 32 registers, and OpenGL guy says 16 for DX9 SM2.0

Click to expand...

Here:

12 min/32 max: The number of r# registers is determined by PS20Caps.NumTemps (which ranges from 12 to 32).

Click to expand...

Make up your mind. When ATI treats em as smart, you bitch and say that they shouldnt leave them room to screw up. When ATi does that, you bitch. Which is it?

Click to expand...

When did I say that?

ATI shouldn't leave room for end users to screw up (to an extent...FSAA/AF settings are a good thing).

Click on the Temporary Registers link.....

Temporary Register

--------------------------------------------------------------------------------

Pixel shader input temporary registers are used to hold intermediate results.

Syntax

Temporary Register
Remarks

Syntax
no declaration is required

Remarks

Pixel shader versions 1_1 1_2 1_3 1_4 2_0 2_sw 2_x 3_0 3_sw
Temporary Register x x x x x

There are 12 pixel-shader temporary registers: r0 to r11.
These registers are used for storing intermediate results during computations.
If a temporary register uses components that are not defined in previous code, shader validation will fail.
These are at least floating-point precision.
A maximum of three can be used in a single instruction.

Instruction Information

Minimum operating systems Windows 98

Doesn't say 32 registers does it.

KimB · Apr 22, 2004

rwolf said:
Click on the Temporary Registers link.....

Um, look next to the Temporary Registers link. It says see note 1. Note 1 is:

12 min/32 max: The number of r# registers is determined by PS20Caps.NumTemps (which ranges from 12 to 32).

Rolf N · Apr 22, 2004

Chalnoth said:
But, of course, it is worth mentioning that there are sure to be a number of PS 3.0 shaders that are so slow as to be utterly unfeasible on PS 2.0. How often these would occur in games, I'm not sure, but they should be very common for a number of algorithms for non-graphics work.

Speed isn't a question of SM2 or SM3. It doesn't matter, throughput-wise, whether you can do a "real" loop or have to unroll it umpteen times.

Also note that it is quite likely *cough* 2.0b *cough*, that R420 will support very long shaders. It might just be unlimited length, with 512 being an artificial limitation enforced by the DX Graphics versioning, erm, approach.

DemoCoder · Apr 22, 2004

zeckensack said:
Speed isn't a question of SM2 or SM3. It doesn't matter, throughput-wise, whether you can do a "real" loop or have to unroll it umpteen times.

I beg to differ. First, with respect to loops, if flow control is used inside the loop, there could be a big difference throughput wise. Secondly, the non-unrolled loop may fit in an instruction cache whereas the unrolled might not and this could have throughput implications. Third, if you don't have enough slots to fully unroll, what then? Fourth, some features, like the aL register will eah up throughput when they are emulated.

Finally, the other features of SM3.0 impose non-trivial throughput issues. How to emulate vertex textures? Gradient? FP filtering and blending? Indexable constant registers? Predicates? (CMP eats up more slots)

I would say that speed is the major question of SM2 vs SM3. Unlike the PR, it is not a question of what's visually possible to do. There was a Siggraph paper that proved even OpenGL1.0 was universal (e.g. can compute anything with enough passes) Anything you can render with SM3, you can render with SM2, the question is: how fast will it run.

So the crucial question boils down to, how large (and important) is the class SM3: the set of all algorithms that run more efficiently on SM3 vs SM2. We don't yet have alot of information in this area.

anaqer · Apr 22, 2004

Chalnoth said:
And yes, I'm sure the "6200" will run SM 3.0 just fine. It will be slower, obviously, than a high-end card, due to the lower fillrate, but should have no problems beyond that.

You see Chalnoth, this is exactly the kind of extreme confidence without even the slightest evidence that people tend to call blind faith.

Bouncing Zabaglione Bros. · Apr 22, 2004

DemoCoder said:
So the crucial question boils down to, how large (and important) is the class SM3: the set of all algorithms that run more efficiently on SM3 vs SM2. We don't yet have alot of information in this area.

Also, will the supposed clock speed advantage of the R420 running SM2.0 offset the speed gained by NV40 when running the more effecient SM3.0?

Rolf N · Apr 22, 2004

DemoCoder said:
zeckensack said:

Speed isn't a question of SM2 or SM3. It doesn't matter, throughput-wise, whether you can do a "real" loop or have to unroll it umpteen times.

Click to expand...

I beg to differ.

I admit that wasn't very accurate

DemoCoder said:
First, with respect to loops, if flow control is used inside the loop, there could be a big difference throughput wise.

I know dynamic branching has been regarded by some as an optimization technique. I don't take that for granted right now. hardware.fr's results weren't encouraging IMO. That may be driver related, it may be because they used a "too high frequency" branch conditional (they didn't disclose their actual shader code), but it also may be natural behaviour for an SIMD architecture. I'd rather wait until I can test it myself.

Secondly, the non-unrolled loop may fit in an instruction cache whereas the unrolled might not and this could have throughput implications.

Right. But then I don't know how much instruction cache would really matter on GPUs. These things have enormous local memory bandwidths at their disposal. If you go compute heavy instead of texture fetch heavy, some of that should be available for instruction fetch.
*shrugs*
Really, I don't know.

Third, if you don't have enough slots to fully unroll, what then?

Multipass

512 instructions is a lot of code for a single shader, though. Executing them all would reduce NV40 "ultra"'s fillrate right back in S3 Virge territory.

Fourth, some features, like the aL register will eah up throughput when they are emulated.

What's the aL register?

Finally, the other features of SM3.0 impose non-trivial throughput issues. How to emulate vertex textures? Gradient? FP filtering and blending? Indexable constant registers? Predicates? (CMP eats up more slots)

Point sampled VTs aren't all that different from another vertex attribute stream. I mean, they are not the same, and it will require different code, but it's not impractical. Linear filtered VTs would be a different story.

Gradient: prepare an adequately sized luminance only mipmap, where each mipmap level encodes a step between 0 and 1. Use a trilinear filter. Do a dependent fetch from that mipmap with whatever quantity you need a gradient for. You can do two fetches from a 1D luminance mipmap if you need separate x/y gradients.

FP filter/blend: no way

Constant index: that's available in SM2 vertex shaders, no? Pretty near impossible in SM2 fragment shaders AFAICS. But what's wrong with using point sampled 1d textures?
Predicates: you got it.

I would say that speed is the major question of SM2 vs SM3. Unlike the PR, it is not a question of what's visually possible to do. There was a Siggraph paper that proved even OpenGL1.0 was universal (e.g. can compute anything with enough passes) Anything you can render with SM3, you can render with SM2, the question is: how fast will it run.

So the crucial question boils down to, how large (and important) is the class SM3: the set of all algorithms that run more efficiently on SM3 vs SM2. We don't yet have alot of information in this area.

I can agree with that. What I'm concerned about is that very often, "advanced" technology, as determined by a higher version number, is understood as "faster" without much reflection. I've been browsing the Rage3D boards a few hours ago, and it always takes a while to snap out of it

With the PS nothing vs PS1.x debate, one could make a valid argument in stating that you could collapse rendering passes, and at least save bandwidth and geometry load, even if nothing else changed. This has diminished with PS1.x vs PS2, and isn't the case with PS2 vs PS3 at all. If you spend dozenz of cycles per fragment (which you can do with SM2), it plain doesn't matter anymore.

So we're really left with structural features. I don't know enough yet to make a blanket statement about what's going to happen. Branching, while offering a great opportunity to skip unneeded computation, also comes at a penalty. The other features you noted, however interesting and useful they may be, do you expect a shader to really be bound by their performance? Isn't the bulk of any shader still DOT3s and MADs? Serious questions, in case you wondered.

I'd really like to see performance figures taken with real world models and real world high level shaders, run through the two compiler profiles, preferably on the same architecture (NV40). It may turn out that there's a 200% speedup in some cases. It may just as well be nothing, or worse.

Note that none of the above makes me want some NV40 any less. I guess it's just some sort of reservation against the latest fashion that naturally comes with old age 8)

edited bits: spelling, propose textures as a replacement for indexed constant storage.

KimB · Apr 22, 2004

zeckensack said:
I know dynamic branching has been regarded by some as an optimization technique. I don't take that for granted right now. hardware.fr's results weren't encouraging IMO.

As long as that was a dynamic branch, I really don't see the problem. ~8 cycle performance hit on a dynamic branch in the pixel shader just doesn't seem that bad to me. This will obviously limit the cases you'd want to use a dynamic branch, but in no way makes one useless.

512 instructions is a lot of code for a single shader, though. Executing them all would reduce NV40 "ultra"'s fillrate right back in S3 Virge territory.

I don't think it would be that bad.

Anyway, I'm more interested in long shaders for non-graphics stuff.

KimB · Apr 22, 2004

anaqer said:
Chalnoth said:

And yes, I'm sure the "6200" will run SM 3.0 just fine. It will be slower, obviously, than a high-end card, due to the lower fillrate, but should have no problems beyond that.

Click to expand...

You see Chalnoth, this is exactly the kind of extreme confidence without even the slightest evidence that people tend to call blind faith.

Um. Think about it for half a second.

What will have changed from the NV40 to the "NV44" (or whatever the value part will be)? Well, there will be fewer pipelines, and some circuits will be removed to reduce production costs. So, we can expect that it will have lower fillrate, and possibly the memory bandwidth that is available won't be made as much use of. I would expect the video processor to remain.

But what will remain the same is the general pipeline structure. So if the NV40 can do SM 3.0 just fine, and I'm certain it can (since it runs SM 2.0 well), the NV44 will also be able to run SM 3.0 just fine, though you'd be running at lower resolutions.

Inquirer spreading R420 info

Jabbah

Bad_Boy

god of war.

KimB

Bry

KimB

PatrickL

Bry

KimB

Jabbah

DemoCoder

KimB

rwolf

Rock Star

KimB

Rolf N

Recurring Membmare

DemoCoder

anaqer

Bouncing Zabaglione Bros.

Rolf N

Recurring Membmare

KimB

KimB

Similar threads