two interesting slides about 9800XT from [H]

OpenGL guy said:
and leave it at that.

ps. nice pic at DH. ;)

43.jpg

No that is Joe Chien a software director from Silicon Valley. The dude on the left some of you may know from the forums as OpenGL Guy
 
Should we regard the adress op the R3x0 performs separately as a 'stand-alone' instruction, compared to the instruction set of the NV3x? Probably not.

But, if we don't, we should also take temp regs and constants into account. And that takes the NV3x quite a bit of extra clocks. For the ATi, you would want to use as much regs and constants as possible. It is for free, when we don't count the adress op and see that it gets written in the same pulse anyway.

While the NV3x wins on some special calculations, it has to spend more time shuffling things around, expanding calculations to use as few temp regs as possible and loading constants.

In the end, just about any program you write will execute in less clockpulses on an ATi. And it has twice as many pipelines.
 
Chalnoth said:
And if this is what ATI is basing their performance comparison on, then it is severely flawed. Their description of nVidia's number of operations per clock is very different from the description they apply to ATI's hardware. Similarly, it doesn't take into account other functions. According to David Kirk, the NV3x can do a sin/cos in 2 cycles, while ATI takes 7-8. If this comparison is on a per-pipeline basis, then nVidia could do sin/cos functions in half the time on a per-clock basis.
"How useful is sincos() as a dedicated instruction?" is something that I find myself asking.

If you have a spare sampler (pretty likely in most shaders) and spare texture instruction slots (also likely) you could get sin or cos (or some other function or combination of functions) with a texture lookup - either from a floating point texture with no filtering, or (probably preferably) from a high-precision integer texture with linear filtering. Then you could get full advantage from the parallel nature of texture lookups on R3xx - in a shader with lots more ALU ops than texture ops you could effectively get sin or cos in 0 cycles - this sounds better than 2 to me.

Using a 1-D, 1-component, 2048 entry texture for either sin or cos, I expect the accuracy could be pretty good for most purposes.

A 2048 entry 16-bit fixed point table would introduce an additional error (at the sampling points) of about 3e-5 when compared to a 32-bit float implementation (such as on a current CPU). I haven't bothered to work out the maximum error at the linearly interpolated intermediate points yet, but I reckon it's probably pretty useable.

As a reference the maximum permitted absolute error for the sincos instruction in a pixel shader is 0.002, so there's probably some wiggle room - it seems like a workable method.

You could also get sin and cos from one lookup if you use a 2 component texture, just as with the sincos instruction.

- Andy.
 
OpenGL guy said:
I'll just say your analysis and conclusion are severly flawed and leave it at that.

We would rather you didn't and expounded upon whether these are accurate.
 
andypski said:
A 2048 entry 16-bit fixed point table would introduce an additional error (at the sampling points) of about 3e-5 when compared to a 32-bit float implementation (such as on a current CPU). I haven't bothered to work out the maximum error at the linearly interpolated intermediate points yet, but I reckon it's probably pretty useable.

As a reference the maximum permitted absolute error for the sincos instruction in a pixel shader is 0.002, so there's probably some wiggle room - it seems like a workable method.
Just did a quick check, and the error over the whole range with linear interpolation stays at around 3e-5, so doing it this way seems fine compared to the macro implementation.

So there you have it - for a really accurate sin/cos on an R300 you can use a texture and get it in 0 cycles (sometimes).

Definitely better than 2 ;)
 
andypski said:
andypski said:
A 2048 entry 16-bit fixed point table would introduce an additional error (at the sampling points) of about 3e-5 when compared to a 32-bit float implementation (such as on a current CPU). I haven't bothered to work out the maximum error at the linearly interpolated intermediate points yet, but I reckon it's probably pretty useable.

As a reference the maximum permitted absolute error for the sincos instruction in a pixel shader is 0.002, so there's probably some wiggle room - it seems like a workable method.
Just did a quick check, and the error over the whole range with linear interpolation stays at around 3e-5, so doing it this way seems fine compared to the macro implementation.

So there you have it - for a really accurate sin/cos on an R300 you can use a texture and get it in 0 cycles (sometimes).

Definitely better than 2 ;)

LMAO - thanks for the update Andy.... I can see it now in a follow-up interview with DK

Q: "Do you have any further comments on intruction lengths for certain functions on the NV3x and R3x0 hardware following teh revelation that it is not 7 or 8 instructions as you guessed but 0 compared to your 2"....

A: "Ah but that is a cheating hack which will use up texture instructions... as they aren't doing multiptexturing it will significantly affect performance. With our improved CineFX 2.1 architecture due in NV40 we will not only cope with Sin/Cos faster but give you an additional texture lookup for free"

;)
 
andypski said:
Just did a quick check, and the error over the whole range with linear interpolation stays at around 3e-5, so doing it this way seems fine compared to the macro implementation.
Is that with a 16-bit integer texture? Is filtering supported with that texture format on the R3xx?
 
OpenGL guy, you look like you had one to many in that photo. Remember, no drinking and drivering ;) .
 
Chalnoth said:
Is that with a 16-bit integer texture? Is filtering supported with that texture format on the R3xx?
Yes and yes.

With an 8-bit texture the maximum error would be around 0.004 (or about double the permitted spec error for sincos), which might still be usable in many cases, but since 16-bit textures are no problem for the hardware it doesn't really make much sense to sacrifice the accuracy. It would be better to cut down on the table size to get better caching characteristics - you can probably go down to 256 (or even 128) entries without killing the accuracy too much.

- Andy.
 
ps. nice pic at DH. ;)

43.jpg

No that is Joe Chien a software director from Silicon Valley. The dude on the left some of you may know from the forums as OpenGL Guy

You can sure tell that all the available money at ATI is going into R&D from the look of ATI's booth in the photo :)

...or maybe GL guy has been "bad" and we're looking at his new office at ATI world headquarters. The good news is that in case of a fire, he'll already be in the fire escape...
 
Back
Top