I'm not sure that's likely. Instead, I find it very likely that the NV40 will have a unified vertex and pixel pipeline. This will require all execution units to support FP32.Dave H said:NV40 will be the first Nvidia product to have any hope of being designed after the broad outlines of the DX9 spec were known. Of course at that time Nvidia may have thought that their strategy of circumventing those DX9 specs through the use of runtime-compiled Cg would be successful, in which case NV40 might not reflect the spec well either.
I'm not sure this is the case, either, as I feel the NV35 shows. Specifically, the NV35, apparently, has as many FP32 units as the R300 and R350. This argument may have held up if only the NV30 was available, but the NV35 has been put out at hardly any additional transistors.Third, FP24 is a better fit than FP32 for realtime performance with current process nodes and memory performance.
Chalnoth said:I'm not sure this is the case, either, as I feel the NV35 shows. Specifically, the NV35, apparently, has as many FP32 units as the R300 and R350.
Dio said:I think when counting up these 'transistor budgets' you've missed something very significant.
Dave H said:Where R3x0 and NV3x come into this is in performance for dependent texture reads. I'd be interested in having more specifics, but as I understand it the R3x0 pipeline is deep enough to do an alright job minimizing the latency from one level of dependent texture reads (particularly if you fetch the texture as much in advance of when you need it as possible), but woe be unto you if you're looking for two or three levels. Meanwhile, NV3x is reputed to do just fine mixing several levels of texture reads in to the normal instruction flow--meaning that it must do a gangbusters job at hiding latency. Meaning it has a hell of a deep pipeline.
Meaning every exposed full-speed register is replicated many times, meaning that in order to fit in a given transistor budget, the number of full-speed registers might have to be cut pretty low.
Dave H said:So the theory is this: Nvidia tried to address two different markets with a single product, and came up with something that does neither particularly well. Meanwhile, MS and the ARB, being focused primarily on realtime rendering, chose specs better targeted to how that goal can be best achieved in today's timeframe.
I'm not sure that thread explains anything.DaveBaumann said:You might want to look at this thread. I don't think thats the case.Chalnoth said:I'm not sure this is the case, either, as I feel the NV35 shows. Specifically, the NV35, apparently, has as many FP32 units as the R300 and R350.
DaveBaumann said:I'm not tlaking about the transitor differences, I'm talking about the supposition that NV35 has more FP units.
LeStoffer said:Brilliantly good theory, Dave H! 8)
NV3x's ability to do a lot of dependent texture reads and the ability to do constant branching in PS should be more than enough hint for us as to why nVidia had to make some sacrifice with those registers. It had to be a design choice/limitation rather than a bug IMHO.
WaltC said:But if this was actually a conscious decision I'm having difficulty seeing what they gained from it...?
LeStoffer said:CineFX = Cinematic Effects = CG developers. They already had the gamers by the balls, you see.
More FP units than what?DaveBaumann said:I'm not tlaking about the transitor differences, I'm talking about the supposition that NV35 has more FP units.
Dave H said:WaltC said:The DX9 feature set layout began just a bit over three years ago, actually. It began immediately after the first DX8 release. I even recall statements by 3dfx prior to its going out of business about DX9 and M$.
Exactly. DX9 discussions surely began around three years ago, and the final decision to keep the pixel shaders to FP24 with an FP16 option was likely made at least two years ago, thus about 15 months ahead of the release of the API. I just don't think you have an understanding of the timeframes involved in designing, simulating, validating and manufacturing a GPU. Were it not for its process problems with TSMC, NV30 would have been released around a year ago. Serious work on it, then, would have begun around three years prior.
As I'm sure you know, ATI and Nvidia keep two major teams working in parallel on different GPU architectures (and assorted respins); that way they can manage to more-or-less stick to an 18 month release schedule when a part takes over three years from conception to shipping. This would indicate that serious design work on NV3x began around the time GeForce1 shipped, in Q3 1999. (Actually, high-level design of NV3x likely began as soon as high-level design of the GF2 was finished, probably earlier in 1999.) A more-or-less full team would have been assigned to the project from the time GF2 shipped, in Q1 2000. Which is around the point when it would have been too late for a major redesign of the fragment pipeline without potentially missing the entire product generation altogether.
NV40 will be the first Nvidia product to have any hope of being designed after the broad outlines of the DX9 spec were known. Of course at that time Nvidia may have thought that their strategy of circumventing those DX9 specs through the use of runtime-compiled Cg would be successful, in which case NV40 might not reflect the spec well either.
Of course it's not by accident. When choosing the specs for the next version of DX, MS consults a great deal both with the IHVs and the software developers, and constructs a spec based around what the IHVs will have ready for the timeframe in question, what the developers most want, and what MS thinks will best advance the state of 3d Windows apps.
Both MS and the ARB agreed on a spec that is much closer to what ATI had planned for the R3x0 than what Nvidia had planned for NV3x. I don't think that's a coincidence. For one thing, the R3x0 pipeline offers a much more reasonable "lowest common denominator" compromise between the two architectures than something based more on NV3x would. For another, there are plenty of good reasons why mixing precisions in the fragment pipeline is not a great idea; sireric (IIRC) had an excellent post on the subject some months ago, and I wouldn't be surprised if the arguments he gave were exactly the ones that carried the day with MS and the ARB.
Third, FP24 is a better fit than FP32 for realtime performance with current process nodes and memory performance. IIRC, an FP32 multiplier will tend to require ~2.5x as many transistors as an FP24 multiplier designed using the same algorithm. Of course the other silicon costs for supporting FP32 over FP24 tend to be more in line with the 1.33x greater width: larger registers and caches, wider buses, etc. Still, the point is that while it was an impressive feat of engineering for ATI to manage a .15u core with enough calculation resources to reach a very nice balance with the available memory technology of the day (i.e. 8 vec4 ALUs to match a 256-bit bus to similarly clocked DDR), on a .13u transistor budget FP24 would seem the sweet spot for a good calculation/bandwidth ratio. Meanwhile the extra transistors required for FP32 ALUs are presumably the primary reason NV3x parts tend to feature half the pixel pipelines of their R3x0 competitors. (NV34 is a 2x2 in pixel shader situations; AFAICT it's not quite clear what exactly NV31 is doing.) And of course FP16 doesn't have the precision necessary for a great many calculations, texture addressing being a prime example.
[While the register usage limitations are not the only flaw in the NV3x fragment pipeline architecture, they are clearly the most significant. (If NV3x chips, like R3x0, could use all the registers provided for by the PS 2.0 spec without suffering a performance penalty, their comparitive deficit in calculation resources would still likely leave them ~15-25% behind comparable ATI cards in PS 2.0 performance. But that is nothing like the 40-65% we're seeing now.) The question is why on earth did Nvidia allow these register limitations to exist in the first place. Clearly the answer is not "sheer incompetence". Then what were they thinking?
Where R3x0 and NV3x come into this is in performance for dependent texture reads. I'd be interested in having more specifics, but as I understand it the R3x0 pipeline is deep enough to do an alright job minimizing the latency from one level of dependent texture reads (particularly if you fetch the texture as much in advance of when you need it as possible), but woe be unto you if you're looking for two or three levels. Meanwhile, NV3x is reputed to do just fine mixing several levels of texture reads in to the normal instruction flow--meaning that it must do a gangbusters job at hiding latency. Meaning it has a hell of a deep pipeline.
Dave H said:IMO the answer can be found in the name. Carmack made a post on Slashdot a bit over a year ago touting how a certain unnamed GPU vendor planned to target its next consumer product at taking away the low-end of the non-realtime rendering market. Actually, going by what Carmack wrote, "CineFX" was something of a slight misnomer; he expected most of the early adopters would be in television, where the time and budget constraints are sufficiently tighter, and the expectations and output quality sufficiently lower, that a consumer-level board capable of rendering a TV resolution scene with fairly complex shaders at perhaps a frame every five seconds could steal a great deal of marketshare from workstations doing the same thing more slowly.
But I don't think one can really accuse Nvidia of incompetence, or stupidity, or laziness, or whatnot.
Chalnoth said:It apparently does have more FP units than the NV30. It apparently also has a very similar number of FP units when compared to the R3xx. What's your point?