5700, 5600, 9600 'Pure' DX9 Performances

Dave Baumann

Gamerscore Wh...
Moderator
Legend
OK, in lieu of the full B3D 5700 Ultra preview here are a few Shadermark and fill-rate tester comparison numbers to tide you over for the time being – not that the Shadermark I use if the full version so we can see the effect on these shaders using the new GeForce FX PS_2_a DirectX9 HLSL compiler.

(WRT to the preview, I am working on one and have done a substantial amount but I found didn’t get all the things I wanted to in and found a few little IQ niggles – rather than hitting the NDA time and doing a few updates there was still enough left to do to warrant holding back and publishing the entire thing)

Test System: Dave current test system
Radeon 9600 XT w Cat3.8, GeForce FX 5600/5700 Ultra w 52.16

ShaderMark 2.0

Code:
        9600 XT  5700 Ultra                5700 Ultra % Diff from 9600 XT
           2_0   2_0  2_0 PP  2_a  2_a PP  2_0    2_0 PP   2_a   2_a PP
shader  2  152   67   79      67   79      -56%   -48%    -56%   -48%
shader  3  105   44   56      44   56      -58%   -47%    -58%   -47%
shader  4  107                             -100%  -100%   -100%  -100%
shader  5  86    38   45      38   45      -56%   -48%    -56%   -48%
shader  6  112   42   63      42   63      -63%   -44%    -63%   -44%
shader  7  97    39   54      39   54      -60%   -44%    -60%   -44%
shader  8  85                              -100%  -100%   -100%  -100%
shader  9  77    19   30      20   31      -75%   -61%    -74%   -60%
shader 10  155   83   84      83   84      -46%   -46%    -46%   -46%
shader 11  135   68   72      68   72      -50%   -47%    -50%   -47%
shader 12  92    35   45      35   45      -62%   -51%    -62%   -51%
shader 13  74    20   34      20   33      -73%   -54%    -73%   -55%
shader 14  74    27   44      27   42      -64%   -41%    -64%   -43%
shader 15  89    26   46      26   46      -71%   -48%    -71%   -48%
shader 16  50    21   31      22   32      -58%   -38%    -56%   -36%
shader 17  9     3    5       3    5       -67%   -44%    -67%   -100%
shader 18  68    13   26      14   26      -81%   -62%    -79%   -100%
shader 19  17                              -100%  -100%   -100%  -100%
shader 20  36                              -100%  -100%   -100%  -100%
shader 21  40                              -100%  -100%   -100%  -100%
shader 22  23                              -100%  -100%   -100%  -100%
shader 23  40                              -100%  -100%   -100%  -100%

Shadermark 2.0 - 5600U (Flipchip) & 5700 U

Code:
            2_0               2_0 PP            2_a               2_a PP
            5600 5700         5600 5700         5600 5700         5600 5700  
shader  2   35   67    91%    36   79    119%   35   67    91%    36   79    119%
shader  3   25   44    76%    27   56    107%   25   44    76%    27   56    107%
shader  5   20   38    90%    21   45    114%   20   38    90%    21   45    114%
shader  6   25   42    68%    27   63    133%   25   42    68%    27   63    133%
shader  7   22   39    77%    24   54    125%   22   39    77%    24   54    125%
shader  9   13   19    46%    16   30    88%    13   20    54%    17   31    82%
shader 10   39   83    113%   39   84    115%   39   83    113%   39   84    115%
shader 11   34   68    100%   34   72    112%   34   68    100%   34   72    112%
shader 12   18   35    94%    22   45    105%   18   35    94%    22   45    105%
shader 13   13   20    54%    17   34    100%   14   20    43%    17   33    94%
shader 14   18   27    50%    19   44    132%   18   27    50%    19   42    121%
shader 15   19   26    37%    21   46    119%   19   26    37%    21   46    119%
shader 16   12   21    75%    14   31    121%   13   22    69%    15   32    113%
shader 17   2    3     50%    3    5     67%    2    3     50%    3    5     67%
shader 18   7    13    86%    10   26    160%   7    14    100%   10   26    160%

Marko Dolenc's Fillrate Tester:

Code:
                                                  5700 % Diff from
Pixels/s                      9600 XT  5600    5700    9600XT  5600
FFP       Pure fillrate       1910.1   1563.2  1840.1  -4%     18%
FFP       Z pixel rate        1893.4   1559.4  1833.3  -3%     18%
FFP       Single texture      1650.8   1397.0  1545.1  -6%     11%
FFP       Dual texture        928.0    750.7   785.0   -15%    5%
FFP       Triple texture      573.0    346.3   382.2   -33%    10%
FFP       Quad texture        420.3    235.6   265.4   -37%    13%
PS 1.1    Simple              977.2    398.0   469.3   -52%    18%
PS 1.4    Simple              977.2    368.5   434.6   -56%    18%
PS 2.0    Simple              977.2    250.9   295.9   -70%    18%
PS 2.0 PP Simple              977.2    250.9   439.8   -55%    75%
PS 2.0    Longer              493.7    151.0   126.0   -74%    -17%
PS 2.0 PP Longer              493.7    151.0   222.4   -55%    47%
PS 2.0    Longer 4 Registers  493.7    123.3   123.8   -75%    0%
PS 2.0 PP Longer 4 Registers  493.7    151.0   292.6   -41%    94%
PS 2.0    Per Pixel Lighting  110.3    26.9    51.7    -53%    92%
PS 2.0 PP Per Pixel Lighting  110.3    28.0    62.7    -43%    124%

It would appear from this lot that the configutation of 5700 Ultra in terms of its fixed function pipelines is the same as 5600 - 4x1 in single / no texturing circumstances, but 2x2 in any multitexturing case, so it seems like only two of the pipelines are able to loopback and have any shader functionality as well.

We can see that the DX9 PS2 performance of 5700 Ultra is significantly improved from 5600 Ultra, well beyond the clock speed increases in some cases. This would suggest that NVIDIA have done exactly as they did do with NV35 over NV30, and changed the integer units to smaller float ops that help out with common PS2 instructions (not that the simple PS2 test is only at difference over the 5600, which is their clock rate differences) - the difference being that this time the 52.16 drivers are better able to make use of this extra functionality, where the 44.03 drivers couldn't really show it with nv35 initially. Note, that the DX8 performance also stay inline with the clock rate increases.

Note that in the Fill-rate tester results a couple of the PS2.0 results take a dump in performance in comparison to the 5600 Ultra - the PS2.0 Longer is 17% behind and the Longer 4 Reg case is the same performance (despite the extra FP performance in 5700 Ultra). this is like down to differences in register space.

Despite the improved PS2.0 performance they are still quite significantly behind ATI's "Pure" DX9 performance, and in the shadermark tests the new MS 2_a compiler for the FX series isn't making much of a difference - this likely suggests that the compiler optimiser that is now in the 52.16 drivers is already getting close to the performance of the HLSL compiled code in the first place, not not their optimal performance for Sahder Assembly reordering. Despite the 5700 being a new chip, entirely designed an built after DX9 was finalised they haven't altered the FX architecture at all do improve some of the missing areas - still no float buffer support and still no MRT's etc.
 
I don't think anyone really expected the 5700 to be outgunning ATi in the DX9 stakes, and there have been some impressive gains against the 5600. The 5700 is shaping up to be a pretty decent mainstream card for nVidia, within the boundaries of the FX architecture (and it's associated problems).
 
PaulS said:
I don't think anyone really expected the 5700 to be outgunning ATi in the DX9 stakes, and there have been some impressive gains against the 5600. The 5700 is shaping up to be a pretty decent mainstream card for nVidia, within the boundaries of the FX architecture (and it's associated problems).
to bad this should have been out last sept and not this october. This is still not enough. All the gains are nice but they are no where near what they should be .
 
not that the Shadermark I use if the full version


(not that the simple PS2 test is only at difference over the 5600, which is their clock rate differences

I presume these nots should be notes, The first one got me I thought "why the hell has he not paid for the full version, cheapskate" :)
 
I don't remember there being such a big difference between the 5800 and 5900, so I think the 5600U was crippled even more than we thought. I've always wondered why the 5600U performed at much less than half of the 5900's shader performance (even taking clock speeds into account), but I think the 5700 corrects that.

Anyway, it just shows how powerful the 9600 is at DX9. I'm sort of curious as to how the 9600XT stacks up to the 9500/9700 np in shader speed.
 
I don't remember there being such a big difference between the 5800 and 5900

I think you'll find there is now. I think the first set of drivers that the 5900 U reviews came out with didn't access the two FP replacement FX units efficiantly, masking its performance. If this test were conducted on current drivers I think it would show something different, as they now are using them more optimally and the shader compiler will also be targgetting the two extra FP units - mix in the fact that the relative clocks went in opposite directions (5800 -> 5900 = down, 5600 -> 5700 = up) and I think that explains the differences.
 
DaveBaumann said:
...still no float buffer support...
Hum.... Ok, now I'm confused! In the page 22 of the release notes (pdf here), under Release 50 Enhancements, Direct X Graphics, there's "Floating point render targets"... Am I confusing things?
 
DaveBaumann said:
Despite the improved PS2.0 performance they are still quite significantly behind ATI's "Pure" DX9 performance, and in the shadermark tests the new MS 2_a compiler for the FX series isn't making much of a difference - this likely suggests that the compiler optimiser that is now in the 52.16 drivers is already getting close to the performance of the HLSL compiled code in the first place, not not their optimal performance for Sahder Assembly reordering. Despite the 5700 being a new chip, entirely designed an built after DX9 was finalised they haven't altered the FX architecture at all do improve some of the missing areas - still no float buffer support and still no MRT's etc.


I don't think you can draw that conclusion from the data. I've been consistently arguing in these forums over the past year for improved compiler technology, and the fact that the instruction scheduling issues are non-trivial. It seems a majority of people subscribed to the view just months ago that there are no improvements left to NV3x drivers to increase PS2.0 speed that don't involve hacks to IQ.

The more likely story all along is that NVidia underestimated the difficulties their architecture would pose for their driver team, and they didn't really have enough inhouse compiler technology talent (which is why they've been hiring lots of compiler positions this past year) That's why it too so long to get a driver with a good optimizer that doesn't generate buggy code.

This was clear if you looked at the output of the Cg compiler, that Nvidia's first compiler efforts were straightforward "get something working" efforts. It more than likely would take them about half a year to get the first optimizing compiler done.

Therefore, I feel it is premature to conclude on this first major in-compiler optimization effort that there is no room left for improvement. We don't really know. Perhaps these drivers just represent the implementation of a better register allocator and instruction scheduler. But both those algorithms are NP-complete, and there is a lot of room for tweaking, but more than that, there are lots of other optimizations that perhaps are not doing yet.

And of course, per my arguments in other threads, there are probably optimizations that cannot do at all, because they have to deal with the output of MS's compiler instead of doing the compilation themselves from the source.


I don't think this is the end of the story, I think it is merely the beginning. As GPUs/VPUs become even more complex (vs3.0/ps3.0 and beyond), issues related to compiler optimization will move to the forefront.

I predict on NV40/R420 release, these cards will run at only a fraction of their true potential due to early drivers being developed for correctness. Neither ATI nor NVidia is going to hold back the release of their new cards to wait for new drivers for new architectures to mature fully.

That said, I still doubt that the NV3x can keep up with the R3x0 in any PS2.0 capacity, but that isn't to say there isn't more room for improvement. At a time where most people thought the NV3x was "maxed out", we see that new compiler technology develiver quite nice performance increases this late in the game. It is only natural to speculate "ok, they pulled it off this time, but obviously there can't be anymore, besides, I don't trust NV and they might use IQ hacks to go further" But, let's wait and see. It takes a while to go from scratch, having no compiler, to have a a mature one. Ditto for initial driver releases.
 
Remi said:
Hum.... Ok, now I'm confused! In the page 22 of the release notes (pdf here), under Release 50 Enhancements, Direct X Graphics, there's "Floating point render targets"... Am I confusing things?

Well, I didn't see any floating point formats supported on a NV34 (52.16 driver). No floating point textures or render targets. Perhaps NV34 just doesn't support it.
 
DemoCoder said:
Therefore, I feel it is premature to conclude on this first major in-compiler optimization effort that there is no room left for improvement. We don't really know. Perhaps these drivers just represent the implementation of a better register allocator and instruction scheduler. But both those algorithms are NP-complete, and there is a lot of room for tweaking, but more than that, there are lots of other optimizations that perhaps are not doing yet.

DC, sorry, but I catagorically did not state that there was no more room for improvement. All I said was that it appears that the compiler optimiser is getting very close (only a few FPS variance in a few tests) to the HLSL PS_2_a compiler. The PS_2_a compiler is likely only to be as good as the input NVIDIA have given MS so far anyway.

I'm entirely sure how you concluded that I said there could be no more opimsation. Clearly, the fact that they are still a few FPS behing in a couple of cases shows they do have a little work to completely match the PS_2_a compiler, let along be completely optimal of their shader architecture.
 
pcchen said:
Remi said:
Hum.... Ok, now I'm confused! In the page 22 of the release notes (pdf here), under Release 50 Enhancements, Direct X Graphics, there's "Floating point render targets"... Am I confusing things?

Well, I didn't see any floating point formats supported on a NV34 (52.16 driver). No floating point textures or render targets. Perhaps NV34 just doesn't support it.

same here on the nv38 :( - why do they write something in the release notes, which isn't actualy there, or which must be activated with a registry key....

Thomas
 
pcchen said:
Remi said:
Hum.... Ok, now I'm confused! In the page 22 of the release notes (pdf here), under Release 50 Enhancements, Direct X Graphics, there's "Floating point render targets"... Am I confusing things?

Well, I didn't see any floating point formats supported on a NV34 (52.16 driver). No floating point textures or render targets. Perhaps NV34 just doesn't support it.

It's probably exposed via a FourCC code, and not a D3DFMT (DIRECTXDEV talk mainly). Which sucks IMO as I still have to have two distinct (setup) paths for R3X0 and the FX...
 
DaveBaumann said:
DC, sorry, but I catagorically did not state that there was no more room for improvement. All I said was that it appears that the compiler optimiser is getting very close (only a few FPS variance in a few tests) to the HLSL PS_2_a compiler. The PS_2_a compiler is likely only to be as good as the input NVIDIA have given MS so far anyway.

Ok, maybe I misread what you were trying to say. Yes, there is a limit as to how far they can optimize the assembly code they are given from the HLSL compiler, although I wouldn't use the PS_2_a compiler as any kind of benchmark, since it still doesn't appear to do many things it should be doing. I tried a few tests with PS_2_0 vs PS_2_a and didn't see very much difference in the output.
 
Dave,

I don't know if you ran any tests with MSAA enabled, but if yes I'd be interested to see how the tested cards behave in Mac Dolenc's fillrate tester with various AA modes.
 
Acoording to an employee from nvidia, there won't be MRT support in future driver release, they use MET instead, it seems to be a hardware issue.
 
MRTs and floating point textures are not required for DirectX. The GeForceFX
series do not have MRTs, so are not supported.

The floating point textures in the GeForceFX do not support wrap mode, nor
MIP maps. Wrapping and MIP maps are required attributes of floating point
textures for DirectX and so the the caps bit is not exported for GeForceFX.


The new drivers will not expose floating point textures.

We are working with Microsoft on a mechanism to expose our floating point
texture format implementation.


-Doug Rogers
NVIDIA Developer Relations

Just an FYI for those that don't watch the dxdev list.
 
DemoCoder said:
I don't think you can draw that conclusion from the data. I've been consistently arguing in these forums over the past year for improved compiler technology, and the fact that the instruction scheduling issues are non-trivial. It seems a majority of people subscribed to the view just months ago that there are no improvements left to NV3x drivers to increase PS2.0 speed that don't involve hacks to IQ.

It certainly appears to me, though, that the ps2.0 shader performance improvement in the 5700 has much less to do with compiler tuning and much more to do with ripping out the integer units in the chip, if that is indeed the case. I don't think anyone would suggest that improving compilers isn't a worthwhile endeavor, and had we discussed this two weeks after the launch of nV30 I'd have agreed there's a lot of room for improvement. But sitting where we are now, after nVidia's third official launch of nV30, with several silicon respins and pcb revisions in between, along with months of working on compiler optimization and other driver "optimizations" of all kinds, I think I might have a sound position in saying I think 98% of the blood has been squeezed from this turnip.

The problem with nVidia's approach to compiler optimization is that it has been entirely lopsided, held out as a panacea that will produce the desired results "given time," and more or less used as a marketing ploy to try and explain away very large performance deficits of nV3x relative to its competition, deficits pertinent to advanced API functionality and much less dependent on traditional factors like bandwidth, TMU's, etc. Compiler optimization in this regard is only good up to a point and no further. To get beyond that point requires a change in the hardware, IMO, which we've evidently seen in the 5700, at least to some extent (beyond the clocks and in the core, I mean.)

As far as being skeptical of nVidia's approach to "optimization" of all kinds goes, nVidia's been entirely responsible for whatever skepticism there is about that in the general public. nVidia is the sole creator of that perception.

To illustrate the point consider how little ATi has talked about compiler optimization in the last year, and yet it's certain they're in no less need of optimized compilers than is nVidia or anybody else. Their product performance has been so much better than nVidia's, however, that they simply have had no need to make a PR issue of it.

The more likely story all along is that NVidia underestimated the difficulties their architecture would pose for their driver team, and they didn't really have enough inhouse compiler technology talent (which is why they've been hiring lots of compiler positions this past year) That's why it too so long to get a driver with a good optimizer that doesn't generate buggy code.

I think you are jumping the gun here and making some assumptions that are presently unwarranted. It's too early to appreciate the bugginess or lack thereof here, and it is also too early to assume that IQ hasn't been sacrificed in some areas, as well. I'm not saying your assessment is incorrect--just that I will feel better about it as time goes on and these things are exposed, one way or the other. I think though that what nVidia underestimated more than anything in this situation was actually R3x0 (ATi.)

This was clear if you looked at the output of the Cg compiler, that Nvidia's first compiler efforts were straightforward "get something working" efforts. It more than likely would take them about half a year to get the first optimizing compiler done.

The platform target goals of Cg are pretty broad, too, which I think might have something to do with it. Perhaps nVidia underestimated the size of the task. Heh...:) Here's hoping nVidia's judgement will markedly improve so that the company no longer underestimates so many things in the future.

Therefore, I feel it is premature to conclude on this first major in-compiler optimization effort that there is no room left for improvement. We don't really know. Perhaps these drivers just represent the implementation of a better register allocator and instruction scheduler. But both those algorithms are NP-complete, and there is a lot of room for tweaking, but more than that, there are lots of other optimizations that perhaps are not doing yet.

This all seems very lopsided to me and it certainly appears as if you might be expecting compilers to work miracles. I think "decent improvement" is a reasonable expectation, but I also think that each successive attempt at squeezing performance out of optimization will be a matter of greatly diminishing returns.

I don't think this is the end of the story, I think it is merely the beginning. As GPUs/VPUs become even more complex (vs3.0/ps3.0 and beyond), issues related to compiler optimization will move to the forefront.

I'm looking forward to the end of the nV3x story, myself...:) My sincerest prayer is that it is not merely continued with nV4x. Again, everyone knows the importance of compilers--how could any company design a .13 micron 125M transistor API-compliant (well...) gpu and not clearly understand that? nVidia is simply doing the best it can with what it can make--which is what everybody does. The difference in this case is that nVidia's using the concept of compilers (old as the hills and twice as dusty) as a PR tool to try and frame the issue of its performance deficit in such a way as to have it appear less critical than it actually is. I'll bet you that if it was ATi behind, instead of nVidia, that you'd have heard scarcely a peep about compilers out of nVidia all year long.

The crux of the matter is that it is not for lack of a good compiler that nV3x suffers in comparison to R3x0, but to many more things which are more important and fundamental than the compiler but which nVidia can do nothing about at the present time. The compiler is what they can change presently and so that is what they talk about.

I predict on NV40/R420 release, these cards will run at only a fraction of their true potential due to early drivers being developed for correctness. Neither ATI nor NVidia is going to hold back the release of their new cards to wait for new drivers for new architectures to mature fully.

As far as I know, that has always been the case with 3d chip development. Nothing new to see here in that general regard.

That said, I still doubt that the NV3x can keep up with the R3x0 in any PS2.0 capacity, but that isn't to say there isn't more room for improvement. At a time where most people thought the NV3x was "maxed out", we see that new compiler technology develiver quite nice performance increases this late in the game. It is only natural to speculate "ok, they pulled it off this time, but obviously there can't be anymore, besides, I don't trust NV and they might use IQ hacks to go further" But, let's wait and see. It takes a while to go from scratch, having no compiler, to have a a mature one. Ditto for initial driver releases.

Honestly, I never cared about whether nV3x was "maxed out" or not...:) I am not sure what you mean by "pulled off"...Do you mean the fact that a compiler optimization has managed to provide some performance increase? Well, wouldn't you expect that? I would, and most people would, but nVidia has unnecessarily complicated the matter by injecting doubt into the situation because of all its other "performance optimizations" done this year which certainly bumped up the frame-rate benchmark tickers, but were more often than not done at the expense of IQ in some regard.

I'm still waiting on nVidia to reinstate full trilinear support, for instance. I can't get very interested in alleged "compiler optimization performance increases" which are based on benchmark numbers while they are still doing that kind of thing in their drivers without any apology whatsoever (it's far too late to apologize for it--what they need to do is remove such hacks.) Do the new compiler optimizations actually affect more software than benchmarks? Traditionally, nVidia's been far more concerned with benchmark performance as opposed to application performance, and the company has made a rather poor name for itself in the last year by "optimizing" for specific benchmarks in order to inject product-performance illusions into its target markets. I just can't hop on the "they pulled it off" bandwagon just yet, sorry. I'm going to have see a lot more data, over time, before I'm persuaded of any kind of legitimacy for whatever claims nVidia is making here.
 
Back
Top