Can someone tell me why ATI's PS 3.0 is better than Nvidia's?

Bouncing Zabaglione Bros. said:
Anything that makes extensive use of SM3.0, especially branching.
Apart from the branching, does ATI have another advantage? I'm really curious, how well does R520 handle texldd, texldl, gradients, dependent reads, lots of live registers? Are arbitrary swizzles free now, or just resolved by the compiler? G70 has the higher raw arithmetic throughput, especially with MUL, DP and sin/cos. ATI really has to have high efficiency to make up for that with shaders that have no use for branching.
 
Dave Baumann said:
IIRC the difference in GROMACS performance between G70 and R520 is attributed to the register space that R520 has.
It would be interesting to see how many live registers you can have before performance degrades because not enough threads can be kept in flight.
 
Xmas said:
It would be interesting to see how many live registers you can have before performance degrades because not enough threads can be kept in flight.
As best I can fathom one of Erics messages there is enough space for 32 register per pixel, per thread. When you exceed that the threads start dropping, but that still doesn't give us much indication as to when performance starts dropping as well.
 
Dave Baumann said:
As best I can fathom one of Erics messages there is enough space for 32 register per pixel, per thread. When you exceed that the threads start dropping, but that still doesn't give us much indication as to when performance starts dropping as well.

Run GPUBench. You will see that the R520 performance is perfectly linear with the number of instructions, regardless of the number of GPRs used, up to 32. There is no fall off, since fatter threads cover more latency, and so fewer are required. As long as the product of the 'thread cycle count' times 'the number of threads' is larger than the number of cycles of latency you are hiding, all is well and GPRs are free.
 
weird how i keep reading stuff online how the older cards will not be able to handle newer games....bla, bla.... my fx5900 flashed to fx5950 ultra can run any games today (probably the upcoming games) smoothly.....until my fx5900 runs some of my favorite games(Max details) less than 45 FPS, I'm not upgrading yet...
 
This board is packed with developers and professionals; can't anyone attempt to lay out the whole truth about the branching stuff?

The layman here will try, so you may count the bullets in my feet afterwards...*ahem*.

1. There's no single doubt that dynamic branching performance is excellent on R520 and lacklustering on NV4x/G7x.

2. As long as dynamic branching is NOT a requirement, does the driver itself or does it not decide whether to use dynamic or static branching in the end? (unroll the loop if the HW supports it).

3. Is dynamic branching really under all occassions (exept the cases where it turns into a necessity) and absolute eulogy and never ever a panacea for SIMD architectures? (take eulogy/panacea in a relative sense).

4. Can anyone predict how often dynamic branching will be a necessity and how often those cases will make it into games?

5. Does a dynamic branching performance advantage really compensate for higher ALU throughput in an as objective as possible average? How many instructions per shader are we really talking about and what size of render targets anyway? Are there any dynamic branches in tech-demos for the entire screen or just a fraction of it?
 
Xmas said:
I'm really curious, how well does R520 handle texldd, texldl, gradients, dependent reads, lots of live registers? Are arbitrary swizzles free now, or just resolved by the compiler?

Gradients should be single-cycle AFAIK. Dependent texture reads I would assume are faster than previous generation due to the improved cache. I haven't analysed that myself though. Arbitrary swizzles are free.
 
Ailuros said:
2. As long as dynamic branching is NOT a requirement, does the driver itself or does it not decide whether to use dynamic or static branching in the end? (unroll the loop if the HW supports it).

It can do so if it decides to do so. For really short branches predication will be faster on ATI too, though a short branch around an expensive texture lookup can still be beneficial. But if you're branching over say two ALU instructions, the driver will most likely replace it with predication. If there's a loop and the loop count is known at compile time, and the unrolled loop fits within the instruction slot limit, then the driver will most likely choose to unroll it.
 
FX5900 said:
weird how i keep reading stuff online how the older cards will not be able to handle newer games....bla, bla.... my fx5900 flashed to fx5950 ultra can run any games today (probably the upcoming games) smoothly.....until my fx5900 runs some of my favorite games(Max details) less than 45 FPS, I'm not upgrading yet...

Have you tried the F.E.A.R. demo?
 
Xmas said:
How do you figure from those results?
R520 has 1.5x G70's DEP-RAND bandwidth. As the dependent access is completely random caches probably don't get used. You think that the cache behaviour would change the results or that the pagefaults kill the bandwidth enough that some other bottleneck doesn't reveal itself?
 
zgemboandislic said:
Have you tried the F.E.A.R. demo?

Don't bother with the green goblin with his über nv35 hardware, it can still run UT2003 perfectly...
 
Ailuros said:
5. Does a dynamic branching performance advantage really compensate for higher ALU throughput in an as objective as possible average? How many instructions per shader are we really talking about and what size of render targets anyway? Are there any dynamic branches in tech-demos for the entire screen or just a fraction of it?

A quick example I just tried : mandelbrot set algorithm rendered on moving mid size triangles. G7x architecture really likes it (lot of mads, scalar and vec2 instructions). I compute 129 iterations (-> ~400 instructions) :

7800GTX : 35.6 Mpix/s
X1800XL (my XT is back to ATI) : 13.1 MPix/s

Now I use a loop with a break under condition to early out when more iterations are not needed :

7800GTX : 17.7 MPix/s
X1800XL : 29.4 MPix/s


So yes the dynamic branching advantage can compensate for the ALU throughput but only in specific cases. There is no objective average here.
 
Last edited by a moderator:
It just occurred to me that I don't particularly understand why dynamic branching is now getting so much attention. I mean, branching and looping are fundamental programming concepts - why wasnt this built into the earliest shader models - back in the GF3 days?

Or was it just a matter of transistor budget?
 
Shader Model 3 was the first time Dynamic Branching was mandated as a requirement in the pixel pipeline.

Previous shader models were very limited in their programming constructs and capabilities - DX8 could only operate 8 instructions, whih doesn't really have much room for looping and branching.
 
If dynamic branching works well, it opens up the possibility for more general algorithms, which are closer to how developers think it ought to work. And it allows them to implement things the way CGI does it, without having to discover the wheel all over again. And it offers the possibility of using real libraries. So, it's easier, faster to develop and offers more possibilities.

As for the speed: it depends entirely on what you do and how you do it. It's just an additional tool, not a speed-up by itself.
 
Back
Top