Nvidia and ARB2

micron

Diamond Viper 550
Veteran
How come a company like Nvidia who's spent alot of time in the OpenGL community cannot make their cards run the ARB2 path very well?
I mean, their cards use a pretty complicated architecture, would it really have been so hard for them to design a card that runs standard code?
They must have spent alot of cash developing all of those extensions they use. It just seems like they want to do things the hard way, and make things harder on people who want to code for their hardware.
Also, I've noticed that ATi is gathering quite a few proprietary extensions of their own...is this a bad thing?
Do ATi's OpenGL extensions drop precision like Nvidia's do?
 
I think that OpenGL's extensions can be a good thing, since it helps make it possible for hardware developers to try new things without needing a new specification for every new core. In addition, it helps maintain continuity if there is a base spec common between generations, so that new features won't run the risk of breaking basic functionality.

If they had waited for DX or OGL to remake their specifications, it could have taken even longer for us to see things like hardware T&L and bump mapping come into widespread adoption.
 
3dilettante said:
In addition, it helps maintain continuity if there is a base spec common between generations, so that new features won't run the risk of breaking basic functionality.
Is that the reason we see similarities between current and past generations of Nvidia products?
 
micron said:
How come a company like Nvidia who's spent alot of time in the OpenGL community cannot make their cards run the ARB2 path very well?
I mean, their cards use a pretty complicated architecture, would it really have been so hard for them to design a card that runs standard code?
They must have spent alot of cash developing all of those extensions they use. It just seems like they want to do things the hard way, and make things harder on people who want to code for their hardware.
Also, I've noticed that ATi is gathering quite a few proprietary extensions of their own...is this a bad thing?
Do ATi's OpenGL extensions drop precision like Nvidia's do?
No,ATI Gpu cant benefit from dropping precision like Nvidia's do.
R3x0 8pipes all the time.
They neednt drop precision.
 
I understand.
Do you think that when Nvidia designed their cards to be fp16/fp32....they knew that fp32 would be pretty much unusable?
Was fp32 merely a selling gimmick?
Didnt they claim that ATi's fp24 wasnt as good at one time, claiming that fp32 was the end all of end all's?
 
micron said:
Do you think that when Nvidia designed their cards to be fp16/fp32....they knew that fp32 would be pretty much unusable?
Was fp32 merely a selling gimmick?

Don't you think that's pretty obvious by now? :)
 
micron said:
Is that the reason we see similarities between current and past generations of Nvidia products?

It probably helped make it possible for games written for TNT hardware being able to run on a GeforceFX (sometimes faster... ;) )


I also think Nvidia may have supposed that fp32 would be used more by CG professionals, since one of their advertising claims was that fp32 was what good enough for Hollywood.

In the case of proffessional apps, it looks like the QuadroFX fares a lot better in comparison to competing products at its price point.
 
3dilettante said:
In the case of proffessional apps, it looks like the QuadroFX fares a lot better in comparison to competing products at its price point.
I knew that the Quadro's of the past were the best in their catagory, but I wasnt sure that this was still the case these days ;)
I guess I still just find it hard to understand that a company as big and profitable as Nvidia cant get their desktop solutions to perform resonably in either API.
I would think that Nvidia is going to come back with a great new product now that they know that their target audiance isnt as succeptable to low precision and poor IQ.
How many people can honestly believe that Nvidia isnt learning a valuable lesson here?
Maybe I'm grasping for straws....I dunno. I hate to think that I'm looking stupid for trying to stay positive about Nvidia's future.
They used to be so great, and it seems like just yesterday....
 
micron said:
I would think that Nvidia is going to come back with a great new product now that they know that their target audiance isnt as succeptable to low precision and poor IQ.
How many people can honestly believe that Nvidia isnt learning a valuable lesson here?
Maybe I'm grasping for straws....I dunno. I hate to think that I'm looking stupid for trying to stay positive about Nvidia's future.
They used to be so great, and it seems like just yesterday....

I certainly hope Nvidia is in a learning mood as well. They may very well earn back some of their good reputation, or some of their mistakes will be forgotten as people's memories tend to be short-lived.

6 months from now, it may be that those who try to point out Nvidia's current bad behavior will look like crazy doomsayers, at least until the cycle starts all over again.
The good-will of hardware enthusiasts is an inconstant thing. :?
 
3dilettante said:
6 months from now, it may be that those who try to point out Nvidia's current bad behavior will look like crazy doomsayers, at least until the cycle starts all over again.
That does seem to be the way it works doesnt it.....
 
The hardware design was fixed long before the API decisions were made. Nvidia ended up losing both with MS and with the ARB in that the PS 2.0 and ARB_fragment_program specs are very similar to each other and the R3x0 pipeline, but map very poorly to the NV3x pipeline. But at that point it was too late for Nvidia to have a redesigned architecture that would be out by now. We'll have to wait and see how successful they were designing NV40 to the specs.
 
micron said:
I understand.
Do you think that when Nvidia designed their cards to be fp16/fp32....they knew that fp32 would be pretty much unusable?
Was fp32 merely a selling gimmick?
Actually, according to several recent papers, it is being used by people who want to do scientific computing on the GPU (since the graphics chips offer much better bang-for-the-buck than CPUs)
 
Dave H said:
The hardware design was fixed long before the API decisions were made. Nvidia ended up losing both with MS and with the ARB in that the PS 2.0 and ARB_fragment_program specs are very similar to each other and the R3x0 pipeline, but map very poorly to the NV3x pipeline. But at that point it was too late for Nvidia to have a redesigned architecture that would be out by now. We'll have to wait and see how successful they were designing NV40 to the specs.

I don't agree with this. The DX9 feature set layout began just a bit over three years ago, actually. It began immediately after the first DX8 release. I even recall statements by 3dfx prior to its going out of business about DX9 and M$. Secondly, your presumption would also mean that merely by sheer chance ATi hit everything on the head by *accident* instead of design, for DX9. Extremely unlikely--just as unlikely as nVidia getting it all so wrong by accident.

The purpose for the API is to serve developers, who have as much input into the specs as do the IHVs (for obvious reasons.) The API does not necessarily have to fit or suit the needs of any particular IHV--it's primary function is to support developers. As the IHVs participate in the formation of the API they become aware of the directions the API is taking. However, it is entirely up to the IHVs as to how closely they can manage to support the upcoming API standards in hardware. I think the idea that it is happenstance that R3x0 managed to offer hardware support for DX9 to the degree it does in comparison with nV3x is absurd...:) APIs like D3d, which don't offer the kind of extensibility that the current OpenGL API offers through vendor-specific extensions must be well planned in advance if they are to mean anything. Indeed, as OpenGL matures and heads toward 2.0, extension support will fall off as the API begins to natively support a much larger feature set.

nVidia's only problem here is that it simply didn't design a very good gpu for the ARB2/DX9 paths. I mean, it's not like nV3x *won't run* ARB2 or DX9--which would likely be the case if nVidia designed the hardware and worried about the API later--it's just the case that it won't run them *very well* at all in comparison to R3x0. It was no accident, in other words, that both the R3x0 and nV3x support fp pipelines, etc.
 
WaltC, the IHVs obviously have their say in the design of the APIs, as otherwise we would probably end with APIs that would look just fine on paper, but would be totally unrealistic to implement in silicon... Do you remember when Carmack started asking for floating point color in the pipeline ? How many generations did we see since ?
 
Perhaps some remember this from one of Carmack's old .plan files (feb 22, 2001):

Enshrining the capabilities of this mess in DX8 sucks. Other companies
had potentially better approaches, but they are now forced to dumb them
down to the level of the GF3 for the sake of compatibility. Hopefully
we can still see some of the extra flexibility in OpenGL extensions.
 
I honestly don't think that Nvidia, or anyone else, would have needed the DX9 specs, or any specs at all, to be able to realise that the register count performance issues would be a real problem. IMO, that is the only real flaw in the architechture.
 
CorwinB said:
WaltC, the IHVs obviously have their say in the design of the APIs, as otherwise we would probably end with APIs that would look just fine on paper, but would be totally unrealistic to implement in silicon... Do you remember when Carmack started asking for floating point color in the pipeline ? How many generations did we see since ?

What I mean by the "design for the DX9/ARB2 paths" is circuitry the IHV anticipates will optimally support the features the API is calling for--or beyond. I only mean that the IHVs had a good sense of where they needed to go. This is not the same as implying they had no input, which I don't mean to say. But I also don't think it's fair to leave out the developers, who also had an input into formulation of the API. The simple answer for me is that with all of that in mind, ATi just designed a much more powerful chip, and I think nVidia got lost in .13 microns along the way...
 
micron said:
I understand.
Do you think that when Nvidia designed their cards to be fp16/fp32....they knew that fp32 would be pretty much unusable?
Was fp32 merely a selling gimmick?
Didnt they claim that ATi's fp24 wasnt as good at one time, claiming that fp32 was the end all of end all's?

I do not believe nVidia expected FP32 to be that unusable.
There's on thing that makes 0 sense if they did:
- The NV30/NV31/NV34 - AKA NV3- ( NV35/NV36/NV38 AKA NV3+ are slightly different in that perspective, while it's still the same idea basically ) have full FP32 FPUs. Their units are really fully FP32, and everything is internally done at FP32.
- Register usage, in the NV3- is the only difference between FP32 and FP16 performance, beside for one or two highly advanced and rarely used instructions.

Of course, I'm not considering FX12 here, but that's beside the point.
So the NV3- pays the full cost of a FP32 FPU, which, according to an ATI employee, costs two times the transistors of a FP24 FPU ( it was said on this forum - can't remember who did though ).

So why in the world would nVidia pay 2 times the transistors of a FP24 unit if don't they believe FP32 will still be usable? I also believe FX12 support ( or maybe another FX format, but again, that's beside the point ) was in the original design documents.

So the performance of FP32 was never to be stunning - I doubt even before nVidia ran into the trouble they got into ( lots of failed tape-outs, ... ) , they would even have beaten ATI in purely traditional ( full FP24 or FP32 ) DX9 benchmarks. However, they probably could have used FP32 and FX12 - not FP16 and FX12 with barely any FP32 usage.

So I believe FP16 register support was added in the later stages of the design. Another interesting note is that FP16 doesn't have twice the theorical output in the NV3- : it's just lower register usage penalties, which results in performance improvements between 0% and 200% AFAIK. Although it's also possible FP16 always was in the design, but its importance was never realized until the latter stages of testing ( As a source noted, good luck finding latency issues on a 2Mhz emulation of a 500Mhz part! )

In the NV3+, it's a completely different story: The architecture, instead of being 1FP32+2FX12 is 1FP32+2FP16 ( those latter being 'combinable' to make 2 FP32 - however, they cannot be used for texture addressing )

It's sad that the NV40 will still have register usage penalties, as indicated by my recently leaked fact that it will also support FP16 and FX16 registers. It is questionable, however, if those formats will have any higher theorical outputs.


Uttar
 
Simon F said:
micron said:
I understand.
Do you think that when Nvidia designed their cards to be fp16/fp32....they knew that fp32 would be pretty much unusable?
Was fp32 merely a selling gimmick?
Actually, according to several recent papers, it is being used by people who want to do scientific computing on the GPU (since the graphics chips offer much better bang-for-the-buck than CPUs)

You betcha ;)
 
WaltC said:
The DX9 feature set layout began just a bit over three years ago, actually. It began immediately after the first DX8 release. I even recall statements by 3dfx prior to its going out of business about DX9 and M$.

Exactly. DX9 discussions surely began around three years ago, and the final decision to keep the pixel shaders to FP24 with an FP16 option was likely made at least two years ago, thus about 15 months ahead of the release of the API. I just don't think you have an understanding of the timeframes involved in designing, simulating, validating and manufacturing a GPU. Were it not for its process problems with TSMC, NV30 would have been released around a year ago. Serious work on it, then, would have begun around three years prior.

As I'm sure you know, ATI and Nvidia keep two major teams working in parallel on different GPU architectures (and assorted respins); that way they can manage to more-or-less stick to an 18 month release schedule when a part takes over three years from conception to shipping. This would indicate that serious design work on NV3x began around the time GeForce1 shipped, in Q3 1999. (Actually, high-level design of NV3x likely began as soon as high-level design of the GF2 was finished, probably earlier in 1999.) A more-or-less full team would have been assigned to the project from the time GF2 shipped, in Q1 2000. Which is around the point when it would have been too late for a major redesign of the fragment pipeline without potentially missing the entire product generation altogether.

NV40 will be the first Nvidia product to have any hope of being designed after the broad outlines of the DX9 spec were known. Of course at that time Nvidia may have thought that their strategy of circumventing those DX9 specs through the use of runtime-compiled Cg would be successful, in which case NV40 might not reflect the spec well either.

Secondly, your presumption would also mean that merely by sheer chance ATi hit everything on the head by *accident* instead of design, for DX9. Extremely unlikely--just as unlikely as nVidia getting it all so wrong by accident.

Of course it's not by accident. When choosing the specs for the next version of DX, MS consults a great deal both with the IHVs and the software developers, and constructs a spec based around what the IHVs will have ready for the timeframe in question, what the developers most want, and what MS thinks will best advance the state of 3d Windows apps.

Both MS and the ARB agreed on a spec that is much closer to what ATI had planned for the R3x0 than what Nvidia had planned for NV3x. I don't think that's a coincidence. For one thing, the R3x0 pipeline offers a much more reasonable "lowest common denominator" compromise between the two architectures than something based more on NV3x would. For another, there are plenty of good reasons why mixing precisions in the fragment pipeline is not a great idea; sireric (IIRC) had an excellent post on the subject some months ago, and I wouldn't be surprised if the arguments he gave were exactly the ones that carried the day with MS and the ARB.

Third, FP24 is a better fit than FP32 for realtime performance with current process nodes and memory performance. IIRC, an FP32 multiplier will tend to require ~2.5x as many transistors as an FP24 multiplier designed using the same algorithm. Of course the other silicon costs for supporting FP32 over FP24 tend to be more in line with the 1.33x greater width: larger registers and caches, wider buses, etc. Still, the point is that while it was an impressive feat of engineering for ATI to manage a .15u core with enough calculation resources to reach a very nice balance with the available memory technology of the day (i.e. 8 vec4 ALUs to match a 256-bit bus to similarly clocked DDR), on a .13u transistor budget FP24 would seem the sweet spot for a good calculation/bandwidth ratio. Meanwhile the extra transistors required for FP32 ALUs are presumably the primary reason NV3x parts tend to feature half the pixel pipelines of their R3x0 competitors. (NV34 is a 2x2 in pixel shader situations; AFAICT it's not quite clear what exactly NV31 is doing.) And of course FP16 doesn't have the precision necessary for a great many calculations, texture addressing being a prime example.

So a good case can be made that the PS 2.0 and AFB_fragment_program specs made the clearly better decision. So what the hell was Nvidia thinking when they designed the CineFX pipeline?

IMO the answer can be found in the name. Carmack made a post on Slashdot a bit over a year ago touting how a certain unnamed GPU vendor planned to target its next consumer product at taking away the low-end of the non-realtime rendering market. Actually, going by what Carmack wrote, "CineFX" was something of a slight misnomer; he expected most of the early adopters would be in television, where the time and budget constraints are sufficiently tighter, and the expectations and output quality sufficiently lower, that a consumer-level board capable of rendering a TV resolution scene with fairly complex shaders at perhaps a frame every five seconds could steal a great deal of marketshare from workstations doing the same thing more slowly.

AFAIK this hasn't yet come to pass in any significant way, but Carmack's post along with much of the thrust of Nvidia's original marketing indicates they really did intend NV3x derivatives to fill this role. Plus it helps explain nearly all of the otherwise idiotic design decisions in the NV3x fragment pipeline. FP32--a bad choice for realtime performance in today's process nodes as discussed above--appears to have been viewed by Nvidia as necessary to play in the non-realtime space. The decision to support shader lengths into the thousands of instructions--bizzare and inexplicable if you think the target of the design is realtime interactive rendering (after all, the damn thing can't even hit 30fps running shaders a couple dozen instructions long)--makes a great deal of sense if the target isn't realtime after all. And then there's this:

Colourless said:
I honestly don't think that Nvidia, or anyone else, would have needed the DX9 specs, or any specs at all, to be able to realise that the register count performance issues would be a real problem. IMO, that is the only real flaw in the architechture.

While the register usage limitations are not the only flaw in the NV3x fragment pipeline architecture, they are clearly the most significant. (If NV3x chips, like R3x0, could use all the registers provided for by the PS 2.0 spec without suffering a performance penalty, their comparitive deficit in calculation resources would still likely leave them ~15-25% behind comparable ATI cards in PS 2.0 performance. But that is nothing like the 40-65% we're seeing now.) The question is why on earth did Nvidia allow these register limitations to exist in the first place. Clearly the answer is not "sheer incompetence". Then what were they thinking?

One possibility is that it's a bug--or rather, the result of a workaround. Some other functionality in the fragment pipeline wasn't working properly, and so registers that would otherwise be free to store temps are instead used as part of the workaround. This seemed pretty likely to me at first, but the fact that NV35 has the same limitations as NV30 and the rest does seem to indicate that if this is indeed the result of a bug, it is not one that can be fixed with only a trivial reworking of the architecture. It will be interesting to see if the extra time they've had to work on NV38 has allowed Nvidia to come up with a fix; if not, perhaps the problem is too deeply rooted to really describe it as a bug after all.

And even if a bugfix exacerbated the problem, it seems unlikely it's the main cause of it. I had a discussion a few months ago with someone here (Arjan or Luminescent IIRC) who pointed out that, unlike for a CPU pipeline which only needs to store one state that is shared amongst all instructions in flight (belonging, as they do, to a single thread), a fragment pipeline needs to store a seperate set of state data for each pixel in flight. The CPU equivalent is fine-grained mutlithreading, in which state for N threads is stored in the processor at once, and each thread takes its turn executing for one cycle, with a full rotation after N cycles.

The benefits of this sort of arrangement are control simplicity and latency hiding. The effective penalty for a long latency operation--in the context of GPUs, a texture read, and in particular one that misses cache--is effectively cut by a factor of N. Meanwhile, the performance costs and complexity of managing context switches as on a traditional CPU pipeline are avoided.

The drawback is in the transistor cost dedicated to storing all that state. GPU registers need to be very highly ported, considering a typical operation is an arbitrary vec4 MAC, and thus the transistor cost rises very steeply with the depth of the pipeline. Pretty soon you get into a direct tradeoff between the degree of latency hiding and the number of registers you can have.

Where R3x0 and NV3x come into this is in performance for dependent texture reads. I'd be interested in having more specifics, but as I understand it the R3x0 pipeline is deep enough to do an alright job minimizing the latency from one level of dependent texture reads (particularly if you fetch the texture as much in advance of when you need it as possible), but woe be unto you if you're looking for two or three levels. Meanwhile, NV3x is reputed to do just fine mixing several levels of texture reads in to the normal instruction flow--meaning that it must do a gangbusters job at hiding latency. Meaning it has a hell of a deep pipeline.

Meaning every exposed full-speed register is replicated many times, meaning that in order to fit in a given transistor budget, the number of full-speed registers might have to be cut pretty low. As low as only 256 bits per pixel in flight? I dunno about that--leaving just 4 FP16 or 2 FP32 registers is so awful that is seems there must be some erratum involved. But significantly lower than on R3x0, yes, definitely.

So what are the advantages and disadvantages of privileging multiple levels of dependent texture reads? It depends on the complexity of the shaders your want your architecture to target. If you're trying to achieve realtime framerates on the sort of hardware available in the current timeframe, it's doubtful you'll be running shaders complex enough to need more than one level of texture indirection. But if you want to accelerate the sorts of tasks that have previously been handled at the low-end of offline rendering, you might indeed want to benefit flexible texture reads by hiding as much latency as possible, even at a potential cost to general shader performance.

So the theory is this: Nvidia tried to address two different markets with a single product, and came up with something that does neither particularly well. Meanwhile, MS and the ARB, being focused primarily on realtime rendering, chose specs better targeted to how that goal can be best achieved in today's timeframe.

Nvidia can probably be properly accused of hubris in thinking that they could tailor their product to address a new and rather different market segment (low-end production rendering) while still maintaining product superiority in the consumer market. Or of arrogance in assuming they were the only IHV worth paying attention to, and thus could influence future specs to reflect their new architecture instead of one that better targeted realtime performance.

Obviously one can correctly accuse their marketing of all sorts of nasty things.

But I don't think one can really accuse Nvidia of incompetence, or stupidity, or laziness, or whatnot. NV3x is not really a bad design. It's unquestionably a decent design when it comes to performance on DX7 and DX8 workloads. I can't entirely judge, but I would guess it's about as good as could be expected in this timeframe as an attempt to replace offline rendering in the low-end of video production; I just don't think that's quite good enough yet to actually capture any real part of the market.

The only thing it's truly bad at is rendering simple DX9-style workloads (and yes, HL2 is very much on the simple end of the possibilities DX9 represents) at realtime interactive framerates. And--except with the benefit of hindsight--it doesn't seem obvious to me that Nvidia should have expected any serious use of DX9 workloads in the games of the NV3x timeframe. This prediction turns out to have been very, very wrong. (What I mean by "the NV3x timeframe" does not end when NV40 ships, but rather around a year after NV3x derivatives are dropped from the mainstream of Nvidia's product lineup. After all, the average consumer buying a discrete video card expects it to hold up decently for at least a while after his purchase.)

It turns out that DX9 gaming is arriving as a major force quite a bit ahead of DX9 special effects production. And Nvidia will rightly pay for betting the opposite. But, viewed in the context of such a bet, their design decisions don't seem that nonsensical after all.
 
Back
Top