Nvidia and ARB2

Very interesting post, Dave. But I'd like to comment on this statement:

Dave H said:
NV40 will be the first Nvidia product to have any hope of being designed after the broad outlines of the DX9 spec were known. Of course at that time Nvidia may have thought that their strategy of circumventing those DX9 specs through the use of runtime-compiled Cg would be successful, in which case NV40 might not reflect the spec well either.
I'm not sure that's likely. Instead, I find it very likely that the NV40 will have a unified vertex and pixel pipeline. This will require all execution units to support FP32.

Why do I think the NV40 will have a unifed pipeline setup? The answer lies in the spec of the NV_fragment_program extension. Specifically, the NV30 has already unified most of the instructions available. Within this architecture, the vertex pipeline and the fragment pipeline are very, very similar already. It remains only a small step to make them the same.

One has to wonder what the best scheduling technique would be on such an architecture, but hopefully that's been researched well enough that it won't be a significant issue.

Third, FP24 is a better fit than FP32 for realtime performance with current process nodes and memory performance.
I'm not sure this is the case, either, as I feel the NV35 shows. Specifically, the NV35, apparently, has as many FP32 units as the R300 and R350. This argument may have held up if only the NV30 was available, but the NV35 has been put out at hardly any additional transistors.

As you pointed out, the primary transistor count differences between FP32 and FP24 are in the multiplier, not in the registers and other required components. Since the problems with the NV3x pipeline are clearly not related to the structure of the functional units themselves (except perhaps, as you stated, the pipeline depth, but again, that's a separate design decision), I'm not sure this argument is a valid one.

Said another way, I think you put out a significant enough argument that could easily explain the register usage penalties of the NV3x architecture without having to resort to them being due to transistor count constraints from the FP32 support.
 
Chalnoth said:
I'm not sure this is the case, either, as I feel the NV35 shows. Specifically, the NV35, apparently, has as many FP32 units as the R300 and R350.

You might want to look at this thread. I don't think thats the case.
 
Dave H said:
Where R3x0 and NV3x come into this is in performance for dependent texture reads. I'd be interested in having more specifics, but as I understand it the R3x0 pipeline is deep enough to do an alright job minimizing the latency from one level of dependent texture reads (particularly if you fetch the texture as much in advance of when you need it as possible), but woe be unto you if you're looking for two or three levels. Meanwhile, NV3x is reputed to do just fine mixing several levels of texture reads in to the normal instruction flow--meaning that it must do a gangbusters job at hiding latency. Meaning it has a hell of a deep pipeline.

Meaning every exposed full-speed register is replicated many times, meaning that in order to fit in a given transistor budget, the number of full-speed registers might have to be cut pretty low.

Brilliantly good theory, Dave H! 8)

NV3x's ability to do a lot of dependent texture reads and the ability to do constant branching in PS should be more than enough hint for us as to why nVidia had to make some sacrifice with those registers. It had to be a design choice/limitation rather than a bug IMHO.
 
Dave H said:
So the theory is this: Nvidia tried to address two different markets with a single product, and came up with something that does neither particularly well. Meanwhile, MS and the ARB, being focused primarily on realtime rendering, chose specs better targeted to how that goal can be best achieved in today's timeframe.

This is the crux of the matter. Nvidia made a compromise that pleases no one. It seems they did this because they had a solution and then went looking for a problem to apply it to.

Like the man who fails his exam by not reading the question properly, and so produces a brilliant answer - to a different question - so too Nvidia have misread the market and it's direction completely.

Whether Nvidia were arrogant to think they could pull the market their way, or whether they simply produced something Nvidia wanted to make, rather than what the market they were addressing wanted to buy, who knows? I doubt we'll find out the details of what really went on for a while yet.

I am still surprised by how far out of step Nvidia is with what the market wants, which I guess explains things like the NV30 fan. I suppose they have just lost touch with the gaming market, and not realised that we don't want a non-real-time broadcast graphics generator to play our games on.
 
DaveBaumann said:
Chalnoth said:
I'm not sure this is the case, either, as I feel the NV35 shows. Specifically, the NV35, apparently, has as many FP32 units as the R300 and R350.
You might want to look at this thread. I don't think thats the case.
I'm not sure that thread explains anything.

The point is, the NV35 core has similar functional units, at FP32, as the R300/R350 does at FP24. Will this require more transistors? Yes. Does the NV35 core have more transistors? Yes.

But there are too many other differences to directly compare the cores, or narrow the differences down to a simple difference of precision support.
 
I'm not tlaking about the transitor differences, I'm talking about the supposition that NV35 has more FP units.
 
1)nVIDIA created a very special architecture and they thought everyone will accept is as easy as JC did and program especialy for it using their CG language .

The thing is that ATi's DX9 architecture was chosen as the sample of initial DX9 hardware guidlines and it suited that much better . The game developers will still use DX9 direct coding instead of CG , at least for know .

Maybe if game developers will start using CG in the future , let's say in 2004 , we will probably se a great improvement in NV3x's performance , but only in that king of coding ( CG related ) .

2)Also , nVIDIA fell in the sin of 3DFX . They thought people will never care about qualty rendering but only about frame rate . This belief gives birth to so many troubles for them :

-nobody care about framerates in the 200's anymore as you can't see more than the monitor's refresh rate is showing you , that's a (-) for the marketing departament ;

-new games are especialy about increasing the IQ and the level of detail showed and this contrasts a lot with nVIDIA's tendency to go after the frame rate ;

-before this generation , they've fixed everyting with better optimised drivers that were not showing any or were showing acceptable loss of quality due to the simplicity of the games , now they can to that but people ask to see the beauty of the graphics not the frame rates and if they show the real graphics they loose the performance below the acceptable playability level .
 
LeStoffer said:
Brilliantly good theory, Dave H! 8)

NV3x's ability to do a lot of dependent texture reads and the ability to do constant branching in PS should be more than enough hint for us as to why nVidia had to make some sacrifice with those registers. It had to be a design choice/limitation rather than a bug IMHO.

Well, assuming this theory is the correct one, then it would also appear that computational flexibility was more important than 3d performance in the design, especially as it concerns fp32, and that the chip designers at nVidia started taking the "gpu" concept a little too seriously and were thinking more like general cpu designers...:) But if this was actually a conscious decision I'm having difficulty seeing what they gained from it...?
 
WaltC said:
But if this was actually a conscious decision I'm having difficulty seeing what they gained from it...?

CineFX = Cinematic Effects = CG developers. They already had the gamers by the balls, you see. ;)
 
LeStoffer said:
CineFX = Cinematic Effects = CG developers. They already had the gamers by the balls, you see. ;)

Good point ;) Yes, I guess in the end it still boils down to a failure to anticipate the competition...
 
DaveBaumann said:
I'm not tlaking about the transitor differences, I'm talking about the supposition that NV35 has more FP units.
More FP units than what?

It apparently does have more FP units than the NV30. It apparently also has a very similar number of FP units when compared to the R3xx. What's your point?
 
Dave H said:
WaltC said:
The DX9 feature set layout began just a bit over three years ago, actually. It began immediately after the first DX8 release. I even recall statements by 3dfx prior to its going out of business about DX9 and M$.

Exactly. DX9 discussions surely began around three years ago, and the final decision to keep the pixel shaders to FP24 with an FP16 option was likely made at least two years ago, thus about 15 months ahead of the release of the API. I just don't think you have an understanding of the timeframes involved in designing, simulating, validating and manufacturing a GPU. Were it not for its process problems with TSMC, NV30 would have been released around a year ago. Serious work on it, then, would have begun around three years prior.

As I'm sure you know, ATI and Nvidia keep two major teams working in parallel on different GPU architectures (and assorted respins); that way they can manage to more-or-less stick to an 18 month release schedule when a part takes over three years from conception to shipping. This would indicate that serious design work on NV3x began around the time GeForce1 shipped, in Q3 1999. (Actually, high-level design of NV3x likely began as soon as high-level design of the GF2 was finished, probably earlier in 1999.) A more-or-less full team would have been assigned to the project from the time GF2 shipped, in Q1 2000. Which is around the point when it would have been too late for a major redesign of the fragment pipeline without potentially missing the entire product generation altogether.

In general, a from-scratch design takes on the order of 3 years from architecture spec to production. However, GF's have never had a from scratch design that I can see. Evolutionary instead (which is not bad -- don't get me wrong; it has pros & cons). Consequently, the design cycle is probably shorter. Both ATI & NV had plenty of input into DX9, as well as a good (not great) amount of time to incorporate needed changes.

NV40 will be the first Nvidia product to have any hope of being designed after the broad outlines of the DX9 spec were known. Of course at that time Nvidia may have thought that their strategy of circumventing those DX9 specs through the use of runtime-compiled Cg would be successful, in which case NV40 might not reflect the spec well either.

Of course it's not by accident. When choosing the specs for the next version of DX, MS consults a great deal both with the IHVs and the software developers, and constructs a spec based around what the IHVs will have ready for the timeframe in question, what the developers most want, and what MS thinks will best advance the state of 3d Windows apps.

Both MS and the ARB agreed on a spec that is much closer to what ATI had planned for the R3x0 than what Nvidia had planned for NV3x. I don't think that's a coincidence. For one thing, the R3x0 pipeline offers a much more reasonable "lowest common denominator" compromise between the two architectures than something based more on NV3x would. For another, there are plenty of good reasons why mixing precisions in the fragment pipeline is not a great idea; sireric (IIRC) had an excellent post on the subject some months ago, and I wouldn't be surprised if the arguments he gave were exactly the ones that carried the day with MS and the ARB.

Third, FP24 is a better fit than FP32 for realtime performance with current process nodes and memory performance. IIRC, an FP32 multiplier will tend to require ~2.5x as many transistors as an FP24 multiplier designed using the same algorithm. Of course the other silicon costs for supporting FP32 over FP24 tend to be more in line with the 1.33x greater width: larger registers and caches, wider buses, etc. Still, the point is that while it was an impressive feat of engineering for ATI to manage a .15u core with enough calculation resources to reach a very nice balance with the available memory technology of the day (i.e. 8 vec4 ALUs to match a 256-bit bus to similarly clocked DDR), on a .13u transistor budget FP24 would seem the sweet spot for a good calculation/bandwidth ratio. Meanwhile the extra transistors required for FP32 ALUs are presumably the primary reason NV3x parts tend to feature half the pixel pipelines of their R3x0 competitors. (NV34 is a 2x2 in pixel shader situations; AFAICT it's not quite clear what exactly NV31 is doing.) And of course FP16 doesn't have the precision necessary for a great many calculations, texture addressing being a prime example.

Lots of stuff in there. The computation / bandwidth ratio is interesting, but I'm not sure it makes that much sense. The raw pixel BW required is not determined by the format of your internal computation engine, but more about the actual surface requirements per pixel. R300 was an 8 pixel pipe before it was FP24 in the pixel shader. For that matter, 90% of the R300 pixel pipe is FP32 or higher. Just when you get down to the ALU does it go to FP24, since at that point, FP24 is reasonable (i.e. from a texture addressing standpoint, if nothing else). While FP16 is actually a step back from FX12 in the [0,1] color range. Honestly, the area increase from 24b to 32b would not cause you to go from 8 to 4 pixels -- It would increase your overall die cost, but not *that* much. It's just a question of would it justify the cost increase? For today, it doesn't.

[While the register usage limitations are not the only flaw in the NV3x fragment pipeline architecture, they are clearly the most significant. (If NV3x chips, like R3x0, could use all the registers provided for by the PS 2.0 spec without suffering a performance penalty, their comparitive deficit in calculation resources would still likely leave them ~15-25% behind comparable ATI cards in PS 2.0 performance. But that is nothing like the 40-65% we're seeing now.) The question is why on earth did Nvidia allow these register limitations to exist in the first place. Clearly the answer is not "sheer incompetence". Then what were they thinking?

Actually, the performance difference appears to be 2~5x, when running complexe shaders. It's not just register usage (though that's a big one).

Where R3x0 and NV3x come into this is in performance for dependent texture reads. I'd be interested in having more specifics, but as I understand it the R3x0 pipeline is deep enough to do an alright job minimizing the latency from one level of dependent texture reads (particularly if you fetch the texture as much in advance of when you need it as possible), but woe be unto you if you're looking for two or three levels. Meanwhile, NV3x is reputed to do just fine mixing several levels of texture reads in to the normal instruction flow--meaning that it must do a gangbusters job at hiding latency. Meaning it has a hell of a deep pipeline.

BS. The R300 has no performance hit at 1,2 and sometimes a slight performance when hitting 3 and 4 levels of dependancy. The performance hit an NV3x gets in just defining 4 temporaries is much larger than the R3x0 gets in doing 4 levels of dependancies. The worst 4 levels perf hit I've ever seen is in the 10~15% range. The R3x0 is VERY good at hiding latencies.
 
Dave H said:
IMO the answer can be found in the name. Carmack made a post on Slashdot a bit over a year ago touting how a certain unnamed GPU vendor planned to target its next consumer product at taking away the low-end of the non-realtime rendering market. Actually, going by what Carmack wrote, "CineFX" was something of a slight misnomer; he expected most of the early adopters would be in television, where the time and budget constraints are sufficiently tighter, and the expectations and output quality sufficiently lower, that a consumer-level board capable of rendering a TV resolution scene with fairly complex shaders at perhaps a frame every five seconds could steal a great deal of marketshare from workstations doing the same thing more slowly.


But I don't think one can really accuse Nvidia of incompetence, or stupidity, or laziness, or whatnot.

As far as the second quote relates to the first I could only agree if the non-game market was large enough to make this a good compromise. Otherwise I would chalk it up to a grave mistake. If the cinema/television market is so important why not make a second GPU tailored for that and let the programming API be what ties them togther.
 
Chalnoth said:
It apparently does have more FP units than the NV30. It apparently also has a very similar number of FP units when compared to the R3xx. What's your point?

Eh? As far as we know so far NV35 has a full combined ALU & texture address processor and two smaller ALU's for each of its 4 pipes. R300, ignoring the Vec/Scalar coissue, has an FP texture address processor a full ALU and a small ALU for each of its 8 pipes. Thats 12 FP processors for NV35 and 16 for R300, excluding the texture address processor, 24 including.
 
Back
Top