two interesting slides about 9800XT from [H]

991060

Regular
I don't see how this "maximum FP operations per clock" should be interpreted.
Either they count only arithmetic operations, then it's 32:8, or they count tex ops too, then it's 40:12. I don't know how they get 40:8. Of course that's all massively in favour of ATI, you'll probably never find a complex shader where the vector/scalar separation doubles performance.
 
I believe they're counting the number of operations which could be executed by the pipeline's alu's and adderss processors. If you notice Dave's R3xx pipeline diagram, there are exactly 4 alu's and 1 address processor to match Ati's claim of 5 ops/clock/pipeline.
 
Luminescent said:
I believe they're counting the number of operations which could be issued/executed by the pipeline's alu's and adderss processors. If you notice Dave's R3xx pipeline diagram, there are exactly 4 alu's and 1 address processor to match Ati's claim of 5 ops/clock/pipeline.
But that doesn't explain why they claim 40(24 without co-issue):8 instead of 32(16):8 or 40(24):12. Either they count tex ops, then its 12 for NV35, or they don't count tex ops, then its 32(16):8.
 
Xmas said:
Luminescent said:
I believe they're counting the number of operations which could be issued/executed by the pipeline's alu's and adderss processors. If you notice Dave's R3xx pipeline diagram, there are exactly 4 alu's and 1 address processor to match Ati's claim of 5 ops/clock/pipeline.
But that doesn't explain why they claim 40(24 without co-issue):8 instead of 32(16):8 or 40(24):12. Either they count tex ops, then its 12 for NV35, or they don't count tex ops, then its 32(16):8.
This is how I see it (I known all of you are sharp, pardon me if I'm overly simple):

Ati seems to multiply the number of ops possible per clock (in one pipeline) directly with the number of pipelines. Therefore 1 full vec alu op + 1 mini vec alu op + 1 full scalar alu op + 1 mini scalar alu op + 1 texture addressing op = 5 ops/clock/pipe x 8 pipelines = 40 ops/clock. However, at 412 MHz, this would give you a total ops count of 40 ops/Hz*4.12x10^8 MHz or 16.48 gops.
 
Xmas said:
But that doesn't explain why they claim 40(24 without co-issue):8 instead of 32(16):8 or 40(24):12. Either they count tex ops, then its 12 for NV35, or they don't count tex ops, then its 32(16):8.

Maybe they have some reason to believe that nv35 can only manage 8 fp32 ops/clock. Can the 2 mini ALUs in one of nv35s pipes manage an fp32 instruction each or do they have to combine. In the thread dave posted with the pipelines it was mentioned that they have to combine for an fp32 MAD, what about other instructions?
 
If this stuff is at all true then the nvidia drivers are much better as ATI seems to be more than 3x the power of the NV.

I dont believe what I just wrote though, I believe the water is muddled somewhere.
 
Sxotty said:
If this stuff is at all true then the nvidia drivers are much better as ATI seems to be more than 3x the power of the NV.
.

FX runs at higher clock than radeon,and not all games are 100% shader limited,if there's any.
 
Sxotty said:
If this stuff is at all true then the nvidia drivers are much better as ATI seems to be more than 3x the power of the NV.

I dont believe what I just wrote though, I believe the water is muddled somewhere.
That's true only for apps with near 100% shader-bound performance. And in the few of those that exist today, the 3x performance differential seems to hold true.
 
The 5900 has 135 million transistors. Obviously some things are going to be implemented as good or better than ATI. I think ATI has a better hardware balance than Nvidia.
 
rwolf said:
The 5900 has 135 million transistors. Obviously some things are going to be implemented as good or better than ATI. I think ATI has a better hardware balance than Nvidia.

Since when does transistor count relate to performance, even in the X86 world the Athlon smacked the early P4's equipped with more transitors.

Alot of those transistors might not even be used :!:
 
Yeah I wasn't really serious I was just saying it is an irony.

The more you slam the hardware of your competition, the better their driver team must be to make it even remotely competitive.
 
The first slide is correct as far as we know, but it is the theoretical maximum. The second one is a bit biased.

The 9800 will almost certainly never execute the maximum amount of instructions, but the FX could do that, as long as all the pixels in a quad are used and there is a sequence of texop, texop, FP or FX. (3*4 = 12)

All in all, the ATi won't perform all the ops possible, but it doesn't mind much what sequence they're in. While a shader limited program (FX, FX, FX...) would only execute 2*4 = 8 ops on the FX, it would run just as fast as any other program on an ATi. And the ATi will normally perform (depending on the actual instructions) 2 or 3 ops per pipe, 4 in special cases, so that would be 2.5*8 = 20 operations for just about any real-life shader program.

Although a FX does do some things (like SIN and COS) quite a bit faster than the Radeon, it can at most be half as fast when shader programs are used.

And with only texture ops, (DX8), they're both just as fast. The higher clockspeed of the FX gives it an edge, but it is hampered by the fact that at the edges of triangles not all 4 pixels in a quad are used.
 
The shader compiler however could re-order the instructions so they do utilize the maximum thoughput of the pipeline.
 
rwolf said:
The 5900 has 135 million transistors. Obviously some things are going to be implemented as good or better than ATI.

Because of its higher transistor count? LOL...
 
It seems to me that "5 ops per clock" is merely referring to a pipeline that is capable of one tex, one mad, and separate vec/scalar ops.

So, five ops per clock could be done with:
1 tex
1 scalar multiply
1 vector multiply
1 scalar add
1 vector add

I don't think that this is anything we didn't already know. Benchmarks elsewhere on this message board have shown that the R3xx can do a separate multiply and add just as fast as a MAD, and we also knew about the separation of scalar and vector ops.

And if this is what ATI is basing their performance comparison on, then it is severely flawed. Their description of nVidia's number of operations per clock is very different from the description they apply to ATI's hardware. Similarly, it doesn't take into account other functions. According to David Kirk, the NV3x can do a sin/cos in 2 cycles, while ATI takes 7-8. If this comparison is on a per-pipeline basis, then nVidia could do sin/cos functions in half the time on a per-clock basis.
 
Chalnoth said:
And if this is what ATI is basing their performance comparison on, then it is severely flawed.
I'll just say your analysis and conclusion are severly flawed and leave it at that.
 
Back
Top