My take on ATI and nVidia

FUDie

Regular
Here's the way I see the current products vs. the previous generation.

9700 vs. 8500
----------------
- improved AF (trilinear plus better on rotations)
- faster AA (multisample with gamma correction and compression)
- big step forward in bandwidth due to 256-bit bus
- overall efficiency improvements
- new shaders
- 8x1 architecture

GeForce FX vs. GeForce 4
------------------------------
- a step backwards in AF (new modes are only there for benchmarks not for improving image quality)
- apparently the same AA (no gamma correction, ineffective compression)
- improved bandwidth due to faster memory, but still using 128-bit bus
- no efficiency improvements (that I have seen)
- new shaders, but old shaders still exist as well
- 4x2 architecture?

To be honest, it really looks to me like the GeForce FX is just a revamped GeForce 4. Where are the new features? The color compression seems poor (notice the large drop in single texture fillrate with AA enabled). The new shaders are there, but they seem slow.

Some things I still can't understand about the GeForce FX. Why is the multitexture fillrate so inefficient? 89% of theoretical maximum is all that is achieved compared to 98-99% for the 9700 and GeForce 4. Why are the new shaders so slow? Are the new drivers really running in FP mode for PS 2.0 shaders?

It seems like nVidia guessed wrong this time around and ATI guessed right...

-FUDie
 
I think a wait and see approach is needed with the GeForce FX architechture. With so many engineers working at nVidia and all the intellectual property they acquired from the 3dfx deal, I can see them having a very intresting roadmap. With the NV 30 all we observe a bland foundation. With each refresh more exotic techniques might be implemented. Today the NV 30 looks like something from a Detroit car plant, and maybe few tomorrows from now the NV 45 looks like something out of Area 51.
 
FUDie said:
9700 vs. 8500
----------------
- improved AF (trilinear plus better on rotations)

Yes 8500 had no AF on 45º, now 9700 can almost reach 2x AF at 22,5º.
A big improvement but I'm hoping for more in the followup products.

- faster AA (multisample with gamma correction and compression)

Gamma-correction is not a speed feature it's about quality (the good thing is the it's free).
R300 has the best AA in videocards so far. (A big +)

- big step forward in bandwidth due to 256-bit bus

Another big +.

- overall efficiency improvements

That's something they were missing in 8500 vs GF4 comparsions, now they have it. (Like 4way memory controller...)

- new shaders

Not that new, it's a natural step from PS1.4
Remember they were closer to PS2.0 than nVidia.

- 8x1 architecture

This is why the 256bit interface is important.
If you look at the R9500Pro, that could be 4x2 by the performance results as it has no bandwidth to make advantage of it's 8x1 design.

GeForce FX vs. GeForce 4
------------------------------
- a step backwards in AF (new modes are only there for benchmarks not for improving image quality)

There's nothing wrong with their "old" image quality, so they didn't have to change it.
OTOH the new modes are there to be compatitive speed-wise.
I don't see it as bad.

- apparently the same AA (no gamma correction, ineffective compression)

I wouldn't judge the effectiveness of the compression, but the AA modes are disappointing indeed (only 4x OGMSAA, no gamma).

- improved bandwidth due to faster memory, but still using 128-bit bus

Frankly I wouldn't care how they improve bandwidth as long as they do it and the product is competitive (It's still a big :?: right now.)

- no efficiency improvements (that I have seen)

That wouldn't be a bad thing - they had it quite well in GF4.
What I found worrying is the dropping of efficiency (eg. CPU usage).

- new shaders, but old shaders still exist as well

Again it's an evolutionary step from their previous architecture.
They made the fp texture-shaders programmable, while keeping the register-combiners.
The problem is the fp shader performance - somthing went very wrong there.

- 4x2 architecture?

It doesn't quite have the bandwidth to support 8x1, so it's sensible.
But, why don't they admit it???

To be honest, it really looks to me like the GeForce FX is just a revamped GeForce 4. Where are the new features?

There are new features (dynamic branching in VS, floating point PS with predicators, and partial-derivatives, etc.).
Are there useful? Now that's a different question!

The color compression seems poor (notice the large drop in single texture fillrate with AA enabled). The new shaders are there, but they seem slow.

Some things I still can't understand about the GeForce FX. Why is the multitexture fillrate so inefficient? 89% of theoretical maximum is all that is achieved compared to 98-99% for the 9700 and GeForce 4. Why are the new shaders so slow? Are the new drivers really running in FP mode for PS 2.0 shaders?

Mostly valid points

It seems like nVidia guessed wrong this time around and ATI guessed right...

Yep, they guessed wrong multiple times:
1. Needlessly improved shaders (no-one will support them)
2. FP16 vs FP32 instead of just FP24
3. No AA improvements

4. 0.13 lowK process
5. Suicide marketing campaign
 
All I can say is Nvidia needs to fire their PR department, or replace the monkeys that work there. OOPs, I hope no one here is working there lol, if you are quit screwing up please.

I still think the GFX is a fine card, and I think Nvidia is doing fine too, just not dominating, which is good for everyone but Nvidia, and which is not really their fault so much as ATI's, when ATI decided it was time to make an effort to put out good stuff they really did, so you cannot blame that on Nvidia.
 
If you look at the R9500Pro, that could be 4x2 by the performance results as it has no bandwidth to make advantage of it's 8x1 design.

I'll show you a case tomorrow where that most definitly is not true!
 
Hyp-X said:
If you look at the R9500Pro, that could be 4x2 by the performance results as it has no bandwidth to make advantage of it's 8x1 design.
Not quite true. Even if you are short on bandwidth, 8x1 is still better because you can reject pixels faster.
There's nothing wrong with their "old" image quality, so they didn't have to change it.
OTOH the new modes are there to be compatitive speed-wise.
I don't see it as bad.
It is bad because it doesn't improve the users' experience. Why not improve the speed of the old method?
I wouldn't judge the effectiveness of the compression, but the AA modes are disappointing indeed (only 4x OGMSAA, no gamma).
GeForce FX AA performance is very lackluster compared to the 9700.
- no efficiency improvements (that I have seen)
That wouldn't be a bad thing - they had it quite well in GF4.
What I found worrying is the dropping of efficiency (eg. CPU usage).
So? They can't do better? I find it hard to believe that the GeForce 4 is the pinnacle of efficiency. Yes, it is good, but there's always room for improvement.

-FUDie
 
DaveBaumann said:
If you look at the R9500Pro, that could be 4x2 by the performance results as it has no bandwidth to make advantage of it's 8x1 design.

I'll show you a case tomorrow where that most definitly is not true!

Woohoo!! A second B3D R9500Pro review tomorrow!!!! :!: :!:

(Or maybe something else? 8) )

Of course the test case doesn't seem like such a mystery: if running the single-textured fillrate test in 16-bit doesn't do the trick (perhaps because GFfx forces 32-bit framebuffer as theorized?), you just underclock the core until you reach a point where the fillrate either stays stuck at exactly 4x clock rate or finally heads higher.

The mystery is which will happen...
 
Dave H said:
Woohoo!! A second B3D R9500Pro review tomorrow!!!! :!: :!:

Hmmm...maybe he was referring to this?

http://www.beyond3d.com/articles/3dmark03/tests/index.php?p=3

image004.gif


This is some behaviour that that many of us were expecting from "benchmarks with heavy shaders applied." That is, they become pixel rate (shader) limited, but not bandwidth limited.

Note that the 9500 Pro keeps up nicely with the 9700 non-pro. Same pixel rate...the extra bandwidth of the 9700 doesn't help.

This is also why we generally expected GeForceFX to dominate these tests because of its much higher pixel rate.

Of course, that "assumption" of FX's higher pixel rate was based on a 8x1 architecture....which is looking more and more to not be the case...
 
Of course, that "assumption" of FX's higher pixel rate was based on a 8x1 architecture....which is looking more and more to not be the case...

Except that 8x1 vs. 4x2 doesn't necessarily speak to the question of pixel shader throughput AFAICS.
 
But 4x2 does impose a pixel fillrate "wall" on the card, where it might not be shader limited.

Right, but unless I'm missing something the 9500Pro/9700 aren't going to be running into the fillrate wall at 28 Mpixels/sec running GT4!
 
Dave H said:
But 4x2 does impose a pixel fillrate "wall" on the card, where it might not be shader limited.

Right, but unless I'm missing something the 9500Pro/9700 aren't going to be running into the fillrate wall at 28 Mpixels/sec running GT4!

you mean 2.8, right? :)

I agree, the Radeon 9500pro/9700 arent going to have any problems.
All i was saying is that the card COULD (if it is 4x2) be fillrate limited in some situation, rather than shader limited. Not that it appears to be the case in GT4. I dont think i was clear on this.
 
I guess that reminds me...

Where does the prospect of the FX being a 4x1 architecture stand @ this point in time? I've been real busy as of late, so I haven't been keeping up on current event. I recall a thread insinuating that the FX performance levels indicated were more indicative of a 4x1 than a 8x1...but that's the last I heard.

Is this still a remote possibility?
 
Just an opinion, but I'd hazard that the FX is 8x1, but it runs at half rate when doing full precision math. There's some logic to this, the half rate full precision is consistent with the 4x2 like behaviour i.e. you can do two texture fetches (which run at full rate) during the time it takes to execute 1 ALU op (which run at half rate).

Also, half precision runs at full rate (i.e. 1 op/pipe/clock), to build 1 full speed full precision ALU from half precision units basically requires 4 ALU's. If they had 4 half precision units (and could actually use them) then I'd guess they would be shouting quite loudly about it, but if half precision only runs at equiv 8 pipes then they only have enough resource for half rate (1op/pipe/2Clocks) full precision.

Of course at this point this could all just be smoke and mirrors on NV's part...

John.
 
Hi DaveH,

Except that 8x1 vs. 4x2 doesn't necessarily speak to the question of pixel shader throughput AFAICS.

Yes, this is true. ;) I am indeed going with an "assumption" that theoretical pixel shader throughput is directly tied to the pixel rate. (Number of "pixel pipes" times clock-speed).

You certainly can't predict the absolute throughput of "shaded pixels" based on the pixel rate. It depends on the complexity of the shader program itself, and the details of the shading architecture. However, I would find it very unlikely that shader throughput to not be proportional to theoretical pixel rate. At least on these architectiures, where each "pixel pipe" is apparently a "pixel/fragment shading pipe."

So while I can't say for certain that the 9500 Pro works as an 8x1, (at least when performing shader ops) I would be extremely surprised to find out that it isn't, based on the evidence shown in the 3DMark 2.0 shader test.
 
I meant 4x2 for the R9500Pro with the same shader performance (2 shader ops per pipe).
If it'd only do shader 4 ops/cycle it would clearly show itself.

I was talking about 4x2 vs 8x1 as having the same texel/shader rate.
The biggest difference is in single texturing, 1 op case, but it needs bandwidth to support it.

OTOH, I'm pretty sure it's possible to do a benchmark where 8 pipelines are visible. (No blend, no texturing, no Z-check should should show it very clear!).

The same should apply to GFFX.

Maybe it has 8 pipes it might be just bandwidth limited. (But it's suspicious, as it fails to show more than 2000 Mpixels/s in even the simplest tests.)

Also the shader ops / pipe is still a mistery for the FX.
 
OTOH, I'm pretty sure it's possible to do a benchmark where 8 pipelines are visible. (No blend, no texturing, no Z-check should should show it very clear!).

The same should apply to GFFX.

the XBit labs 9500 PRO review initially stated it to be a 4x2 architecture, until I pointed out that it you run the 3DMark fillrate test in 16bit you get more pixels per clock than a 4 pipe card can handle.

I know people have tried the test you speak about on a GFFX but still have not got more than 4 pixels per clock, or so they say.
 
I know people have tried the test you speak about on a GFFX but still have not got more than 4 pixels per clock, or so they say.

As a last ditch effort they could try underclocking the core while keeping the memory clock constant. Although it would seem GFfx really can't sustain a throughput of >4 pixels/clock under any circumstances.

Now, this isn't necessarily a bad thing, and it doesn't mean their marketing of NV30 as having 8 pixel pipelines is necessarily false. First, look at the extremes one needs to go to to even test this: obviously there is no realistic in-game situation (except for something like a z-only pass) where a GFfx would even have the opportunity for sustained pixel throughput of >4/clock.

To the extent that the advent of pixel shaders with dynamic branching puts us in a situation where neighboring pixels can take a different number of cycles to render, it might make sense to talk about throughput more like one would on a modern CPU--seperate throughputs for dispatch, execution resources, and commit. Thus it could be that GFfx can dispatch and/or execute up to 8 pixels/clock, but can only commit (i.e. write back to the framebuffer) 4. Such a design seems reasonable, given how rare it would be that GFfx actually gets the chance to sustain >4 pixels/clock. OTOH, I'm not clear that there's necessarily a need to dispatch (i.e. begin rendering) more than 4 pixels/clock either, and if by "8 pixel pipelines" Nvidia means only that the GFfx can be executing shaders on 8 different pixels at one time...that may be pushing it too far.

My guess: there are 8 texturing units, and each can texture a different pixel at any given time. But some stage of the pipeline is limited to 4 pixels/clock: either dispatch, commit, or both. Probably commit, in the sense of "write to the framebuffer": that way, the z-only pass could still proceed at 8 pixels/clock (no framebuffer write, after all, only z-buffer write). So for all intents and purposes the fixed-function pipeline is 4x2, but it can do texture lookups and z-pass at a throughput of 8 pixels/clock.

Obviously all hideous and uninformed speculation.
 
My guess: there are 8 texturing units, and each can texture a different pixel at any given time. But some stage of the pipeline is limited to 4 pixels/clock: either dispatch, commit, or both. Probably commit, in the sense of "write to the framebuffer": that way, the z-only pass could still proceed at 8 pixels/clock (no framebuffer write, after all, only z-buffer write). So for all intents and purposes the fixed-function pipeline is 4x2, but it can do texture lookups and z-pass at a throughput of 8 pixels/clock.

Interesting guess.

One implication being (correct me if I 'm wrong) that the FX should be a very "efficient" if viewed as a 4x2 architecture. Very crudely stated, it's almost like "double buffering" the pixel pipelines.
 
Back
Top