My take on ATI and nVidia

Re: 4x2 vs. 8x1

Can someone explain these results to me? Rightmark 3D Pixel Shading I'd like to know why the FX is getting almost exactly half the 9700's score. Is this an 8x1 vs. 4x2 thing (b/c pixel shaders are attached to each pixel pipe, as I understand it), or am I misinterpreting the results? It's odd that the FX gives the exact same numbers for both FP16 and FP32, so maybe the test is wrong, or I just don't know what it's meant to reveal. Perhaps it all boils down to unoptimized drivers?

TIA.
 
The FX shaders are less than half the speed even though its clock is > 50% faster, so it is not as simple as the FX only having 4 pipes. I think there must be a lot of operations that take more than 1 cycle in the FX shaders.
 
Can someone explain these results to me? Rightmark 3D Pixel Shading I'd like to know why the FX is getting almost exactly half the 9700's score. Is this an 8x1 vs. 4x2 thing (b/c pixel shaders are attached to each pixel pipe, as I understand it), or am I misinterpreting the results? It's odd that the FX gives the exact same numbers for both FP16 and FP32, so maybe the test is wrong, or I just don't know what it's meant to reveal. Perhaps it all boils down to unoptimized drivers?

Brent posted similar PS test results (from ShaderMark) in a thread here just after the initial reviews came out. Hang on a sec...

Ok, here's the graph of those results.

At this point unoptimized drivers still have to be the primary theory. It's interesting that the 9700 Pro won the ShaderMark tests by an average ratio of roughly 3:1 (eyeballing it), while they win the RightMark tests by roughly 2:1. I wonder if this is due to differences in the benchmarks or if it's the result of the new "3DMark03-only" GFfx drivers.

The fact that the FP16 and FP32 results are exactly the same is more or less proof that one of the two is not supported under the current PS 2.0 drivers. (Impossible to tell which, although of course we'd hope it's FP16...) Note that according to Carmack's .plan both are functioning (and at the expected 2:1 performance ratio) in OpenGL under Nvidia's proprietary extensions.

The fact that switching FP precisions isn't yet enabled in DX is more evidence of just how raw the drivers are. Which is strange, given the length of time the driver team should have had to work on them. Still, it means we probably shouldn't make anything of these results just yet.
 
The fact that the FP16 and FP32 results are exactly the same is more or less proof that one of the two is not supported under the current PS 2.0 drivers.

I was sent a message yesterday stating that NV30 his 8 FP16 shader exectution units, 8 texture samplers, but the number of rendering pipelines remain at 4. So, its max pixel output, regardless of the situation is 4 pixels - in FP32, the FP units are combined, meaning that it can only execute 4 FP32 instructions per clock. This means that to optimsise the compiler for FP16 you've got to be calculating two FP16 instructions per pipe, which may be a little difficult to manage.

This explaination is what would occur if what was said to be true, and it does fit with current shader performance numbers. Marketting wise, if there is no possibility that you'll actually get more than 4 pixels output per clock then it shouldn't be described and sold as an 8x1...
 
Well, that also confirms the fp16/fp32 ideas we've mentioned before when discussing this, but it boggles my mind that nvidia marketing has fallen low enough to call 4 rendering pipelines "8x1". It took me a while to wrap my mind around other realities, though, so I guess it is just me. :-?

What is the 125 million transistors for? Instruction/value storage? Branching logic?
 
Perhaps for a lot of redundant circuits :)

It looks like NV30 has register combiners, FP16, and FP32 units. Note that it is not possible to just combine two FP16 untis into one FP32 unit. Only few logics can be shared between FP16 and FP32 units. Furthermore, it looks like NV30 still has fixed T&L pipelines, since its fixed T&L performance is very good, while its vertex shader performance is not.
 
Yup, I've already mentioned that I believe it has a static T&L engine. Running at 500MHz you can see why it would have good T&L performance, and I also believe this explains why Quadro FX is looking so good.
 
DaveBaumann said:
This explaination is what would occur if what was said to be true, and it does fit with current shader performance numbers. Marketting wise, if there is no possibility that you'll actually get more than 4 pixels output per clock then it shouldn't be described and sold as an 8x1...

Most people don't know what 8x1 is anyway, and those who do, know enough to check the benchmarks. Compared to GF4 Ti4800SE, this is really small potatoes, IMO.

On the whole, if the FX is still 4x2, still has a static TnL unit, still has 128 bit memory, still doesn't have gamma corrected AA (and true RG AA either), still... Well, you get the point, it just seems pretty lackluster.
 
DaveBaumann said:
The fact that the FP16 and FP32 results are exactly the same is more or less proof that one of the two is not supported under the current PS 2.0 drivers.

I was sent a message yesterday stating that NV30 his 8 FP16 shader exectution units, 8 texture samplers, but the number of rendering pipelines remain at 4. So, its max pixel output, regardless of the situation is 4 pixels - in FP32, the FP units are combined, meaning that it can only execute 4 FP32 instructions per clock. This means that to optimsise the compiler for FP16 you've got to be calculating two FP16 instructions per pipe, which may be a little difficult to manage.

This explaination is what would occur if what was said to be true, and it does fit with current shader performance numbers. Marketting wise, if there is no possibility that you'll actually get more than 4 pixels output per clock then it shouldn't be described and sold as an 8x1...

:oops:

I'm confused. And a little shocked. Maybe I should just start asking questions:
  • I know this is from a reliable source or else you wouldn't have posted it. But is this "official" Nvidia information yet? I guess they'll have to break the bad news in their developer docs soon anyways...
  • Just to get it totally straight...well, I can see a few ways your description could possibly be interpreted:
    • FP16 execution units and texture samplers are divvied up 2 to a pipe. Thus in order to make use of the second in any pair you have to be executing two (independent, duh) instructions on the same pixel, or sample two textures for the same pixel.
    • They're still divvied up 2 to a pipe, but can be allocated to two different pixels making their way through the same pipeline. Wait--that doesn't seem to make sense w.r.t. the texture samplers, because there'd be no reason to do that, as all pixels in the pipeline undergo the same PS program, so there wouldn't appear to be any benefit. Still might make some sense w.r.t. the execution units, as a way to avoid instruction dependencies.
    • Some vaguely hippie-ish "sea of processing units" arrangement. Except the FP16 execution units must be somehow organized in pairs considering they team up to form the FP32 units. Plus there would seem to be no need for such an arrangement considering, as I noted before, all the pixels should be running the same shader program at a time. Unless when one pixel has an instruction/lookup predicated out to a NOP it could donate its units to the cause...but that seems awful complicated.
    [/list:eek::b02498fc97]
    • Given that interpretation #1 seems most plausible (to my limited understanding, at least)...WTF??!? That's just a perfectly standard 4x2 organization!! How on earth can they market that as "8 pixel pipelines"??

      Taking a shader-oriented view, then, this would mean that where the R300 has 8 pipes each capable of 1 texture sample, 1 address calc. and 1 color op in a clock, the NV30 has 4 pipes each capable of 2 samples, 2 color ops and...how many address calculations? Or is it 2 samples and 2 ops, whether color or address?
    • And even so...the FP32 and FP16 performance still should not be identical under intense pixel shading, as they are in the Rightmark benches at Digit Life. It may be "difficult to manage" 2 FP16 instructions per pipe as you suggest (2 independent ones if it's per pixel as it seems), but you would still expect to have some success, unless the scheduler isn't even trying in current drivers. Golly, even then you might expect a slight performance difference due to reduced bandwidth, although I suppose this would only apply to changes in the precision of FP textures and render targets, not the pipeline itself.

      Still, there must be something wrong if FP16 mode gives you absolutely no performance benefit compared to FP32; otherwise why bother with FP16 mode at all? And Carmack implied the performance difference is almost 100% as would be expected.

      Something is being exposed in the Nvidia GL extentions that isn't in the ARB extensions or DX9 with current drivers...

    :?
 
DaveBaumann said:
I was sent a message yesterday stating that NV30 his 8 FP16 shader exectution units, 8 texture samplers, but the number of rendering pipelines remain at 4. So, its max pixel output, regardless of the situation is 4 pixels - in FP32, the FP units are combined, meaning that it can only execute 4 FP32 instructions per clock. This means that to optimsise the compiler for FP16 you've got to be calculating two FP16 instructions per pipe, which may be a little difficult to manage.

Dave, I think that's about time beyond3d put some real pressure on nVidia to get this sorted out.

I have noticed that this issue has been brought up others places aswell. See over at opengl.org:

http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/008757-3.html

Here a poster called 'pixelpipes' might be on the something if he/she indeed have a GeForce FX. Note that cass from nVidia is dogding the issue in that thread.

Anyway, I agree that the info you have seem to fit with real world benchmark.
 
Heh, just had another look that that thread. Well, what do you know:

Cass said:
Do you have any tests that do Z/stencil only (no color buffer writes)? If so, how many pixels per clock do you get for those tests?
Cass


pixelpipes said:
Normally I have Z test disabled, but Z write enabled, and of course also color write.
Enabling Z test will invoke the 'early out' tests, which are done per tile, thus screwing the measurement.

I tried it with Z write DISabled, and the result is the same. (equivalent to NV25 with appropriate GPU clock ratio boost)

If you are hinting at memory bandwidth limitation, I don't see the logic here. With 1GHz memory and 128 bit bus, you have 4 Gpix/sec if you are writing either only RGBA (32 bit) or only stencil/z (24+8 ). But disabling Z write didn't increase performance.

But here is the strange thing:
With color write DISabled, Z write ENabled, and stencil test that does both read and write, the performance doubles. (glStencilFunc(GL_NOTEQUAL,0,-1);glStencilOp(GL_INCR_WRAP_EXT,GL_KEEP,GL_INCR_WRAP_EXT))
I have no explenation for this. Do you?
Is it some special optimization intended for the stencil shadow path?

Interesting: Hallo Doom III :p
 
That is interesting. The only time they mentioned 8 pixels per clock at Dusk-Till-Dawn was during the Stencil Shadows presentation, the rest of the time they were talking about 2x2 configurations and "second texture comes at full speed"...
 
Dave H. why to you think #3 is unlikely?

A pixel pipe would manage final stencil/Z-tests, framebuffer blending, etc. Each pipe would then have two pixel shader units it could schedule work to. This makes, IMHO, sense with previous statements saying NV30 is optimized for longer shader programs (ie. more work per pixel).

Cheers
Gubbi
 
A pixel pipe would manage final stencil/Z-tests, framebuffer blending, etc. Each pipe would then have two pixel shader units it could schedule work to. This makes, IMHO, sense with previous statements saying NV30 is optimized for longer shader programs (ie. more work per pixel).

I meant it as allowing e.g. 3 (or more) shader units to be assigned to one pixel if they were open and the dependencies worked out. I'm not quite sure how what you described is different from having the 2 units be "part of" the pipe itself.

But I'm probably missing something.
 
So, based on pixelpipes' comments, my working theory (stated in a way that makes sense to me, so I can also understand any corrections made to it) :p :

8 "output" pipelines, each capable of outputting a 32-bit value, dynamically allocated based on pixel shader demands. Whether this is determined by pipeline characteristics (seems likely they'd be designed this way) only, or if granularity/addressing concerns (128-bit is 4 * 32, and see my comments on addressing for color compression for an idea of what might be another factor), which could conceivably not change for 16-bit rendering, are a factor, it might be interesting to theorize.

Behavior:

  • When writing Z buffer and stencil only:
    Equivalent to 8x1.
  • When writing color only: Either
    Equivalent to 8x1 (when not limited by texture reads and the theoretical 8 texture samplers, i.e., with no texture filtering...like for professional applications) or
    Equivalent to 4x? because 4 of the pipelines are only capable of writing Z/stencil values (this may be due to a driver choice if efficient on-the-fly adaptation to variations between color only and color+Z+stencil is not achievable or even beneficial given my above theories).
  • When outputting color and zbuffer/stencil values for the same pixel, 4x2 by traditional thought regardless of relative bandwidth limitations

The focus on pack/unpack functionality in the shader instruction set seems to make sense with the above to me.

WRT the fp16/fp32 limitations, I'm wondering how much of the previous discussion fits?

And another question...how significant is the transistor budget for register storage taken up after replication across each pipeline?
 
Why would Nvidia try to use such an elaborate method? Wouldn't just going to 8x1 make more sense? I'm as much a fan of "smart" design as the next person, but sometimes you just can't make up for the simplicity and efficiency of BRUTE FORCE! :devilish:

Also, I'm still confused as to why the FX has a hardwired TnL unit. I thought those went out in favor of VS emulation years ago?
 
DaveBaumann said:
...NV30 his 8 FP16 shader exectution units...[snip]... the FP units are combined, meaning that it can only execute 4 FP32 instructions per clock.

I really don't get this 8 FP16 units = 4 FP32 units, thang.

Sure, it's half the data, so I can appreciate that you can transfer twice as many FP16 data items down a bus as FP32, but you can't do a FP32 multiply by cobbling together 2 FP16 multiplies. Hell, you can't even do that for integer multiplies [1]. So that just sounds like pure BS to me.

I reckon FX is arranged thus:
o 4 pipes each with 2 fragment shaders.
o Each fragment shader is FP32 (can also work on FP16 data, just extend it)
o Each fragment shader is supplied by it's own texture sampler (hence 8)
o Only enought bandwidth down the pipe for 1 FP32 fragment or 2 FP16 fragments per clock

i.e. it's been optimised for dual texturing operations at FP16 or lower.

Edit: Disable smilies

[1] A simple integer multiply is constructed out of n*m 1-bit adder cells, where n & m are the widths of your inputs, so a 32bit multiply ~= 4 16 bit multiplies.
 
Well, if it has two 12 bits X 24 bits multiplier, it can use them to perform a FP32 multiplication. I doubt that a 125M transistor chip has no room for bandwidth required by eight 32 bits FP units.
 
pcchen said:
Well, if it has two 12 bits X 24 bits multiplier, it can use them to perform a FP32 multiplication. I doubt that a 125M transistor chip has no room for bandwidth required by eight 32 bits FP units.

Agreed, but that's a lot of pipeline you've just turned from 64-bits wide to 128-bit wide though.
 
Why such an architecture? First, let's explain the 4 color writes design choice.

Because nVidia expected their FP32 performance to be sufficent. So, considering they can only do 4 FP32 operations/clock, it makes a lot of sense to use a 4x2 design.

The problem, now, is that their FP32 performance is way below R300 FP24 performance. Way, way below. Here are the theorical numbers:
R300: 8x325 = 2600
NV30: 4x500 = 2000

That's about 25% slower, in a perfect theorical situation. In practice, early benchmarks show it could be even worse than that.

The idea behind 4 color outputs probably is that you'd waste much if you had 8 color outputs and you were using FP32. The problem with 4 color outputs, however, is that it's not optimal when using FP16.

As Wavey says, figuring out which instruction isn't impacted by another isn't easy at all. It's even very hard. On the plus side for nVidia, however, developers are suggested by ATI that there should be spaces between instructions who are not dependant of each other. That's because the R300 architecture *also* benefit from this.
So, even though the reasons for nVidia & ATI aren't exactly the same, both agree on it. So one won't try to educate the developers in a way which would ruin the other's architecture. And that's a good thing.

But it's probably near optimal with FP32, since it's basically the same thing as with the NV20 4x2 design.
Too bad FP32 is so slow and nVidia is forcing FP16 nearly everywhere :) Poor nVidia!
Also, the 3DMark 2003 huge score increase could be attributed to *hardcoding* which instructions are independant, instead of using the default algorithm... So, the GFFX would use FP16 nearly everywhere, and its efficiency in 3DMark 2003 would be near perfect. Makes sense suddently, doesn't it?
The real question is wether the IQ difference between 90% FP16/10% FP32 & 100% FP24 is that huge... Why didn't anyone use 3DMark screenshot utility yet? Are people too cheap to buy the Pro version? :)

Now, why 8 Z/Stencil writes/clock?
The first reason, obviously, is the performance gain with the Z Pass: such a pass doesn't even need a color write, and it needs a LOT of Z writes. A fast Z pass lets the PS engine be used more rapidly, too, because it's waiting during the Z Pass.

But then, why can't the NV30 do 8 color writes when there's no Z Write?
Well, a first explanation would be that most people running that test use FP32. But let's assume such a test was done at FP16, and also suggested a 4x2 configuration ( a verification would be nice, too... )
Such a case would be rare, but it does exist ( HUDs might do that, for example ). So it's unlikely it's a driver bug and nVidia didn't implement it yet. Who knows... But let's assume not.

Let's see... 4 Color Writes, and 8 Z Writes? Where would this be highly optimal?
I'm going to give you a hint: what writes more different Z values than Color values?
You guessed it: MultiSampling.

"But wait!" , I hear you say. "If the GFFX could already do 2 Z/pipeline without MSAA, and it can do 4Z/pipeline with MSAA, why doesn't it have a native 8x MSAA mode?"
Because the 4Z/pipeline with MSAA thing probably includes the 2 Z/pipeline without MSAA. But then... Why can't the GFFX use the 4Z/pipeline without MSAA, to do the Z Pass even faster when there's no MSAA? Why does it only use 2Z/pipeline?
Well, a first explanation is bandwidth limitation. As pixelpipes say, the limit is 4GP/s when writing 32 bits/pixels on the GFFX.
An interesting question, thus, would be if the GFFX is capable of doing 8GP/s when only writing a 16 bit Z value/pixel. If it is, then it all makes sense - they've simply been sufficently smart to use the MSAA Z Output capabilities in all situations.
Or anyway, probably. They could be using something even more complex, but that's unlikely.

My conclusion, thus, is quite simple:
1. The GFFX is 4x2
2. The GFFX is capable of using its 4Z/pipeline capability ( or at least half of that capability ) even when it isn't using MSAA, which is a really good design idea ( although the benefits are all gone when using 4x FSAA, but using 4x FSAA in Doom 3 on a NV30 is kinda unrealistic anyway )
3. The GFFX design is smarter than the NV25's design, and more efficient. It is thus unfair to say it's a 4x2 "just like the NV25" - but it's still less powerful than the R300's design.

Please note that this conclusion is not correct if my reasoning is not correct. As always, I'd appreciate to be corrected if I'm wrong - after all, if I write this type of thing, it's in the hope of learning more...


Uttar

P.S. : About the NV30 T&L unit. Couldn't it be possible that the T&L unit is simply using the integer path, which is also usable in the shaders, but which is used a lot less in them because FP is available and developers love FP? While I don't think nVidia would hesitate to change FP32 in FP16 through driver "optimizations" to gain performance, I'd be surprised if they dared to change anything to integer while the program asks for FP...
 
Back
Top