NV40: Surprise, disappointment, or just what you expected?

Joe DeFuria said:
radar1200gs said:
You evidently, since your argument is comsumers will ignore a budget 6800 because of power consumption, heat and useless features. I would suggest history does not back your assertions up.

Um, where did I say consumers will ignore a budget 6800?!

Joe DeFuria said:
Chalnoth said:
I definitely expect that at the very least, the NV4x parts for the mid-low range of the market will be the parts to buy this fall.

I would have to disagree.

From what we can tell, nvidia has basically achieved "parity" with the R3xx core, in terms of performance and bandwidth utilization (Pipe per pipe, and clock for clock.)

That is, a 4 pipe NV4x at say 300 Mhz, would about equal the performance of a 4 pipe RV3x at 300 Mhz, given the same memory bandwidth.

However, it also appears (can't tell for sure yet) that it has taken nVidia more transistors to do this, and more power consumption. To be frank, having support for SM 3, as little as it may possibly mean in the high end, means almost zero in the low end, AFAIC.

In the lower end space, cost is obviously key.

So I see two advantages for ATI here:

1) ATI appears, with 4 pipe R3xx core, to have a cost/performance advantage over an nVidia 4 pipe variant of NV40 core.

2) ATI may have access to a more cost effective process to boot. (ATI will be using 0.11 for their coming low end RV370...I don't think nVidia has 0.11 in the cards for at least a few quarters yet.)
 
Ummm...right.

Chalnoth insists that the budget 6800 would be "the" parts to buy, and that the market would go "hands down" to nvidia.

I disagreed.

so, um, Where again did I say consumers would "ignore" a low end 6800? And didn't I in fact say that I felt there would be good competition, but I thought ATI had the "edge?"

Oh, yes, I did:

In a previous post in this thread said:
I think it will be a battle in both areas, and I believe ATI will have the edge in both as well. Not "hands down" to ATI...but certainly not hands-down to nVidia.
 
Joe DeFuria said:
To be frank, having support for SM 3, as little as it may possibly mean in the high end, means almost zero in the low end, AFAIC.

Looking at the FX 5200, it's was actually the direct opposite with regards to the high end and low end and features. Although there it was DX8 vs DX9 which is course makes it a bit different.

And of course, "as little as it may possibly mean" is your opinion :)
 
Joe DeFuria said:
Chalnoth said:
And why would you think that ATI would be the only one to drop to .11 micron?
This spring? Because ATI said in their last conference call that their spring line up would include a 0.11u part.
No, late this year (hopefully in the fall).

I think it will be a battle in both areas, and I believe ATI will have the edge in both as well. Not "hands down" to ATI...but certainly not hands-down to nVidia.
I really don't think so. It seems to me that the NV4x core has a significant advantage over the R3xx core (for the number of pipes and frequency), and given that ATI Probably will not release a R4xx core for the mid-low range until next year, nVidia will have a significant advantage here.

Note that I expect performance to be roughly similar for most current titles (given that the limits will be mostly bandwidth/fillrate related), but they should pull ahead in titles released this year (titles that are more shader-heavy). How much will they pull ahead by? Well, that depends. Remember that the NV40 has been shown to be more shader-efficient than the R3xx core on immature drivers. If nVidia can get a much better driver compiler by the fall, the NV4x's shader throughput might increase by 50%, giving these cards a huge advantage.

The R3xx core, on the other hand, just won't get much more efficient by that time. Only with a new core in the mid-low range can ATI stand a chance in shader performance.
 
Joe DeFuria said:
That is, a 4 pipe NV4x at say 300 Mhz, would about equal the performance of a 4 pipe RV3x at 300 Mhz, given the same memory bandwidth.

However, it also appears (can't tell for sure yet) that it has taken nVidia more transistors to do this, and more power consumption. To be frank, having support for SM 3, as little as it may possibly mean in the high end, means almost zero in the low end, AFAIC.

I disagree about the per-clock performance. NVidia has two shader units per pipe @ FP32 precision. Clock for clock, when drivers mature, this will yield up to (note upto, but certainly > 1.0 ) 2x the per-clock shader performance of the R300.

NVidia's card has much more transistors because
a) FP32
b) 2x shader units
c) VS3.0
d) PS3.0
e) Video processor
f) some features people don't know about yet


You cannot extrapolate alone from the fact that they have way more transistors that they achieved their parity with the R300 "inefficiently". First, they exceed parity with the R300 on a per clock basis. Second, the card is chock full of new features.
 
Bjorn said:
Looking at the FX 5200, it's was actually the direct opposite with regards to the high end and low end and features. Although there it was DX8 vs DX9 which is course makes it a bit different.

Yes, it makes a world of difference IMO. "DX9" support is an inflection point, unlike PS 3.0.
 
DemoCoder said:
I disagree about the per-clock performance. NVidia has two shader units per pipe @ FP32 precision. Clock for clock, when drivers mature, this will yield up to (note upto, but certainly > 1.0 ) 2x the per-clock shader performance of the R300.

Are you really saying that nVidia's drivers are so bad (or the architeture is so "twitchy"), that it's only realizing about 50% of its shading potential?

I (and everyone else) seems to be under the impression that the benchmarks are showing pretty much what's to be expected based on its architecture.

NVidia's card has much more transistors because
a) FP32
b) 2x shader units
c) VS3.0
d) PS3.0
e) Video processor
f) some features people don't know about yet

DC, I'm not saying the "Extra" transistors are not put to any good use in terms of a high end product.

I'm saying they don't particularly strike me as useful / marketable (worth the extra cost / power consumption) in a low end product.
 
DemoCoder said:
I disagree about the per-clock performance. NVidia has two shader units per pipe @ FP32 precision. Clock for clock, when drivers mature, this will yield up to (note upto, but certainly > 1.0 ) 2x the per-clock shader performance of the R300.
It may not be capable of keeping both shader units full at all times, depending on what limitations exist on what can be executed. I'm not sure we can expect 2x the shader performance per clock, but I think 1.5x is reasonable to expect by the end of this year.

You cannot extrapolate alone from the fact that they have way more transistors that they achieved their parity with the R300 "inefficiently". First, they exceed parity with the R300 on a per clock basis. Second, the card is chock full of new features.
They exceed parity with the R300 on a per pipeline per clock basis, with very close to twice the transistors. I'd say that's pretty impressive, and I think it bodes very, very well for the lower-version parts (for heat, for cost, and for performance).
 
DemoCoder said:
I disagree about the per-clock performance. NVidia has two shader units per pipe @ FP32 precision. Clock for clock, when drivers mature, this will yield up to (note upto, but certainly > 1.0 ) 2x the per-clock shader performance of the R300.

I don't believe that for a second.

There is no way Nvidia will be able to milk that much performance through drivers (unless they are greatly holding back, which I doubt).

ATI have only marginally increase shader speed through drivers (and it's been a year and a half now).
 
Joe DeFuria said:
Are you really saying that nVidia's drivers are so bad (or the architeture is so "twitchy"), that it's only realizing about 50% of its shading potential?

It has nothing to do with being twitchy or "bad". Simply put, dealing with ILP is hard. This is hard to understand if you haven't tried to write compilers to do it. Every lesson NVidia learned in writing the driver compiler for the NV3x still applies, but explicit instruction level parallelism adds whole new challenges that have to be tackled that did not exist in the NV3x. I would not call adding parallelism to the pipe "twitchy". It is not the same as the NV3x register issue. You don't lose performance below the floor (e.g. for every 2 registers, another 1/2 performance drop), you just fail to perform as well as you could with optimal code, e.g. you have a steady baseline lower than what is possible on average,

Even today's CPUs fall way below their maximum performance because of C compilers. A friend of mine rewrote a gene sequencing algorithm in assembly that was compiled on the best know compiler with best options and even feedback profiling, and he was able to achieve a *10x* speedup, by exploiting SSE2, MMX, cache and alignment issues that the compiler couldn't.

Anyone designing a multi-core chip, multi-cpu system, or Itanium-like EPIC VLIW will face similar issues. Go read Intel and HP's papers on their compilers to see how far they progressed. Look when the Playstation 2 first arrived, most games didn't even take advantage of the second VU. PS2 got a huge boost in performance once compilers and tools came out to make it easy to distribute work to the two units.

The first NV40 driver is likely to have been written for completeness, with simple naive or greedy algorithms. They had deadlines to meet with this chip, and what you are seeing is immature drivers. DX9.0c isn't even out yet, and the DX driver doesn't even support the Video Processor yet. (BTW, I had one of the first early access DX9 driver for the R300 from ATI/MS beta program, and it had similar issues of missing unimplemented features) Immature drivers are a constant of this industry on launches.


Instruction scheduling is now much more important for Nvidia, both in the pixel pipeline, and perhaps even moreso in the MIMD Vertex pipeline with texture fetches.

Let's just leave it at that. If you don't want to believe me, fine. We can revisit this in 6 months to see how it panned out. I predict minimum 25% ALU bound shader boost with better optimizer.
 
Nick Spolec said:
ATI have only marginally increase shader speed through drivers (and it's been a year and a half now).
nVidia is not ATI. The NV4x architecture is quite a bit more complex than the R3xx architecture, and so has much more room to grow with future driver improvements.
 
Nick Spolec said:
DemoCoder said:
I disagree about the per-clock performance. NVidia has two shader units per pipe @ FP32 precision. Clock for clock, when drivers mature, this will yield up to (note upto, but certainly > 1.0 ) 2x the per-clock shader performance of the R300.

ATI have only marginally increase shader speed through drivers (and it's been a year and a half now).

ATI only has a co-issue architecture in the pixel shader, atleast according to the public architecture documents. They had no conflicting resource conditionals (register pressure, etc) Much simpler to optimize for.
 
I thought that ATI already had one full and one half shader unit per pipe. If so ,and assuming that their compiler is mature enough to exploit it, that would suggest that Nvidia already does some optimizing or they have a way more efficient design.
 
DemoCoder said:
Let's just leave it at that. If you don't want to believe me, fine. We can revisit this in 6 months to see how it panned out. I predict minimum 25% ALU bound shader boost with better optimizer.

Well, 25% shader boost isn't anything like approaching 2X. For nVidia's sake, I hope you're right. And I would bet that it won't be 25% across the board, but certain tests getting major boosts, others not much.
 
Joe DeFuria said:
DemoCoder said:
Let's just leave it at that. If you don't want to believe me, fine. We can revisit this in 6 months to see how it panned out. I predict minimum 25% ALU bound shader boost with better optimizer.

Well, 25% shader boost isn't anything like approaching 2X. For nVidia's sake, I hope you're right. And I would bet that it won't be 25% across the board, but certain tests getting major boosts, others not much.

DemoCoder said:
Clock for clock, when drivers mature, this will yield up to (note upto, but certainly > 1.0 ) 2x the per-clock shader performance of the R300.

Well, if you read my original post it says "UP TO" *TWICE*. Can I qualify my comment anymore than that and still not get misinterpreted? . I never said across the board 2x increase. That can only happen in rare circumstances. The increase will be > 1, < 2, and probably fall between 1.25 and 1.50
 
DemoCoder said:
Well, if you read my original post it says "UP TO" *TWICE*.

Don't get your panties in a knot.

I know you said UP TO *TWICE* Again, 25% is nothing like approaching twice.

Can I qualify my comment anymore than that and still not get misinterpreted? .

Sure you can. Can you discuss things without accusing others of misinterprting what you said?

I never said across the board 2x increase.

Nor did I say you did. Should I start CAPPING and *HIGHLIGHTING* what I actually said?
 
nelg said:
I thought that ATI already had one full and one half shader unit per pipe. If so ,and assuming that their compiler is mature enough to exploit it, that would suggest that Nvidia already does some optimizing or they have a way more efficient design.

If I recall correctly, ATI's half-unit only does some of the DX8 P1.4 modifier options (e.g. d2, d4, d8, x2, x4, inverse, etc) This certainly can be exploited, but the options are more limited. Same with NV3x's register combiners. They could be exploited for limited purposes.

For example, if you have code like

a = 2 * b + c

Then the half unit can come into play and do the "2*" operation for free, saving one MUL instruction, whereas on the NV3x, I believe it had to put the constant "2" into a constant register, and MUL r0, c0, r1 + ADD r0, r0, r2, whereas on ATI's card, this could be treated as ADD r0, r1_x2, r2

Of course, if I were a gambling man, a good bet would be that ATI "beefed up" this half-unit in the R420, possibly into a full unit. :)

No doubt, if I said anything incorrect, OpenGL guy will jump into the thread "there you go making assumptions again"
 
Back
Top