Does ATi follow OPEN 3d API standards better than Nvidia?

g__day

Regular
Thinking lately about NVidia really relying on DX9 precision hint to run quickly and about John Carmack's comments that in Doom 3 under OpenGL that NV3x runs the Open Standard ARB2 path at half the speed of the proprietary NV30 path, yet ATi runs ARB2 almost as fast as R200 path I started wondering.

Is what we are seeing just ATi's top chips are following the OpenGL and Directx 9 standards really, really closely whilst NVidia makes sure they can somewhat comply - but goes a far more proprietary route requiring numerous extensions to do the nearest equilvalent workload quickly? Meaning game developers need to code to open standards for ATi but use many proprietary extension for NVida to get equivalent performance?

If so why - why didn't NVidia stick to open standards - I sense it wasn't cause it didn't like them nor that they wanted to invent another competiting 3d API. rather it seems that they have some hidden agenda that requires a proprietary APIs and I can't guess what it might be. Where is the design and architecture of their top end cards taking them?
 
My personal take is that nVidia figured it had a monopoly on the graphic market since the demise of 3DFX, and decided that they could keep any competition from gaining any kind of a foothold by setting propritory standards, hence the FX series & Cg. If they - nVidia - can/could have gotten developers to buy into Cg, then, no matter what any of the other manufactures did, they could never truly compete......It's almost an evil, mianical plan to rule the (graphics) world....just what you'd expect from nVidia! ;)
 
Don't equate performance with compliance. NV3x's partial precision could be very useful, provided full precision wasn't so slow compared to R3xx to begin with. nV is also playing catch-up to ATi in that DX9(.0) was based heavily on the R300. And ATI runs ARB2 (FP) as fast as R200 ("FX") because it runs both on the same hardware, unlike the NV30's separate FX/FP units (now changed in NV35).

As far as using proprietary standards, it's obvious it would help the dominant card maker more than lesser competitors or developers or consumers. But we'll see if Cg is enough to optimize performance for smaller developers who may not have the resources to fully exploit every nuance of every architecture. Sure, a Carmack and a Sweeney will have the talent, time, and money to eke out every last drop of performance; but will the smaller devlopers or smaller titles have the same luxury? (I'm thinking of Hardware.fr's FX5600U review's racing sim benches here.)
 
martrox said:
My personal take is that nVidia figured it had a monopoly on the graphic market since the demise of 3DFX, and decided that they could keep any competition from gaining any kind of a foothold by setting propritory standards, hence the FX series & Cg. If they - nVidia - can/could have gotten developers to buy into Cg, then, no matter what any of the other manufactures did, they could never truly compete......It's almost an evil, mianical plan to rule the (graphics) world....just what you'd expect from nVidia! ;)

I think you people need a reality check.
Even with Cg, nVidia wouldn't be able to compete all that well...

My explanation to all this stuff:
1. nVidia decides of a NV30 design.
2. MS doesn't like it much.
3. ATI contacts MS about the R300 design.
4. MS decides to take the R300 design as the standard for DX9 ( well known they developped it in close cooperation with ATI )
5. nVidia knows about it, but think ATI won't be able to deliver, and even if they do, that their technology will be sufficent to be at least on par with ATI in DX9, and superior in OpenGL - and developer support thanks to their market share would guarantee that they might in fact win everywhere anyway.
6. There are some discussions about how useful a high level shading language would be.
7. nVidia, in order to potentially get more developer support, begins development of Cg, MS creates HLSL - pretty much at the same time, I suppose ( not sure though )
8. nVidia realizes the NV30 is not on too good track, decide to hurry Cg ASAP to hopefully get developer support for it.
9. ATI releases the R300, vastly superior to nVidia's expectations.
10. nVidia abandons Low-K for the production of the NV30, slightly reduces clock speeds.
11. nVidia releases the CineFX design documents to build further developer support, showing some clear errors in how they describe the R300, potentially showing that even at this stage, they likely underestimated it.
12. nVidia realizes the NV30 is even worse than what they had expected, put the clockspeeds back to the original in hope to be able to be competitive.
13. nVidia paper launches NV30 - fairly confident they might be roughly on par with driver upgrades and stuff.
14. nVidia, thinking the reviewers will kinda accept the product as-is and knowing they can't wait any further, sends review samples.
15. Reviewers say the NV30 is barely on par with the R300, often worse, and not worth the wait - the Flow FX noise makes it completely horrible to nearly everyone's eyes.
16. Further technical investigation reveal that the PS technologies used are horrible and it's really a 4x2. All of this concludes in a technical opinion of the NV3x technology being vastly inferior to the R3xx's.

A lot of it is guesses, but it is IMO the most likely happenings. Of course, all of it can't be right. Or wrong. Eh... Anyway, I hope it was interesting


Uttar[/b]
 
Russ DX9 high colour precision requires at least fp24 precison - NVidia can only achieve this by using fp32 - which causes a 40% - 50% performance hit on their architecture compared to abeing able to mx fx12 fp16 and fp 32 - as per the Dawn fps when fp32 is forced.
 
I think we can safely assume that the DirectX9 spec has been finalized after R300 and NV30 had their features set into stone. And we know for certain that this is true wrt OpenGL ARB shading extensions.

Therefore, "following the 3D api standards" sounds a bit twisted to me.
 
RussSchultz said:
Thats nice, g__day. Dawn is OpenGl.

How is that being non DX9 compliant?

Urg... OpenGL still uses the same defined precisions, and has the same performance hits with those features enabled compared to disabled.

Using 'DirectX levels' for compliance (saying it's a "DX9 card") is more for convenience than anything else.
 
Right, sure. Lets redefine "Directx 9 standards " to simply mean some arbitrary level of completeness, especially in a thread called "Do ATi follow the 3d API standards more closely than Nvidia?"
 
Well that's one pretty big Ooops in my understanding about Dawn... :oops: but it is perhiperal to the question asked honestly in the title of this thread. My question still stands and hasn't been definitively answered by any of the really skilled folk in here. I am not trying to pushing anything down anyones throat I am trying to bootstrap my learning and showing how I reached my current way of thinking. I am trying to learn and accepting all stumbles and embrassements along the way mate.

My base logic was that fully modern 3d API compliance requires support for high colour precision. Each API has standard rendering paths. OpenGL has several large NVidia specific rendering paths and open standard rendering paths.

For some reason in OpenGL NVidia can't optimise for ARB2 anywhere near as well as for the NV30 code path - I presumed - ARB2 is requiring higher colour precision for the majority of the time (meaning fp32 for NVidia - limiting how much of their parallel rendering pipelines were utilised). So if in OpenGL you encourage mixed precisions NVidia and ATi are quite competitive. This is not to say that the NV30 path is any less valid than the ARB2 path - just that one is an open industry standard that any game developer could code for and whilst the other is a NVidia proprietary rendering path.

I presumed Directx 9 high colour precision compliance requires fp24 precision or better. So you get the same outcome - NVidia can do it - but at a cost of up to 50% of performance - allow enough mixed precision formats and the playing field is leveled for NVidia again. The way to allow this is either 1) game developers use precision hint or 2) NVidia driver developers could attempt to detect and dyamnically lower shader quality where they know they can get similar looking results and doing so would keep their GPU pipelines far better utilised.

Russ there aren't the DX9 games around yet today to really answer your question are there? I am not saying one card is more compliant than another - just ATi targeted the full spec of both API's spot on whilst NVidia went both above and below so will take a hit on performance whenever high colour precision is required. Dave's article on question game developers of the need to cater for variable colour precision seemed to support this view.

Let me ask how would you position what typical percentage of DX9 calls in a typical frame of a typical game will require high colour precision - if you say from experience a DX9 scene may only require 5% high colour precision calculations I would be interested in your thinking. I am not saying every scene needs 100% high colour precision - someone one mentioned in the Dawn demo only her eyes used fp32. How would you re-frame and respond to NVidia's interesting positioning on the degree of proprietary vs open standards required in 3d API's for them to be competitive?
 
That post from Uttar is one of the most sensible post on Cg I've seen in a long time. I would however add Xmas comment about feature sets <pretty much> set in stone around (5). The Cg/HLSL start should probably come a bit earlier. And add that a big reason for nVidia hurrying Cg could be that they were afraid to be "cut out" of HLSL. Their rather special optimization rules can severely hurt them if they can't use their own back-end, and some features can't be accessed at all in HLSL.
 
g__day: DX9 has the partial precision hint for exactly such things. It is part of the DX9 2.0 pixel shader spec.

Of course, if my understanding is correct, the NV35 doesn't really get any benefit from the partial precision hint. Its primary benefit is from writing shaders that use less than a particular number of registers.

The other GF-FX parts apparently do get some sort of benefit from using the partial precision hint.
 
Well, the use of the partial precision hint could also potentially aid the NV3x in getting around the register limitations. You may recall that the performance hit comes every four registers if FP16 is used.
 
If so why - why didn't NVidia stick to open standards - I sense it wasn't cause it didn't like them nor that they wanted to invent another competiting 3d API.

The way I look at it- NVIDIA puts it's interests equally with the retail gamer market as it does to the professional/workstation market.

NVIDIA has *always* strived to make a single chip-generation that can cover a wide (read VERY wide) range of products, from slightly demasculated/stripped budget line, to mid/high end gamer, to extreme high-end, workstation-professional usage (ala Quadro boards).

In the case of NV3x series chips, the basis of putting higher-precision, more extreme instructions/flow control into the core design of the chip is aimed at making the same series very useful for their high-end products that may chomp down far from realtime Renderman shaders, even though such precision may be a performance handicap for the chip series for realtime games and retail/consumer range products based on the same.

For the gamer- it becomes a decision of what you pay for and what you get, which seems to be a changing logistic. In years past, you could market 32-bit framebuffer on TNT series chipset and have strong proponents of having something that the technology openly couldnt support (i.e. too slow for the advance). Today, people see higher precision and lengthier shader support, well beyond Direct3D, and the impacts from the slower technology and are a bit bent by it. It seems the emphasis is shifting more towards "value" than "it has it, but it's too slow to use it" like years past.

This may also be due to the extremely short product shelf life, with fierce competition on offset schedules, putting better/faster hardware on the shelves every 3-6 months.
 
Russ - are you saying NV35 can run fp32 as fast as fp16 - that the pipeline bottleneck running a pure fp32 stream of instructions was only a NV30 limitation? So game developers no longer need to worry are they using too many fp32 colour precision effects in frames vs simpler fp16 ones?

If NV3x performance starts to drop only when a scene requires say more than 10% - 20% of fp32 calculations is this a reasonable limit or are NVidia squeezing too tight a budget on game developers? And where does the line sit?

Is NVidia FPS performance (relative to a mixed FX12 and FP16 stream of commands per frame) = 100% * (percentage of screen calculated in fp16 or less) + 50% * (percentage of screen calculated at fp32 precision),
regardless of rendering path selected - because its GPU pipeline parallelism capability related - not 3d API implementation model driven.

Sharkfood,

I thought about that but surely the professional market really want high performance with their targeted colour precision. NV3x design seems to straddle the fence rather than either specialise for gamers or specialise for the cinematic renderings (other than for shader length). I feel we are getting closer but haven't yet hit the nail on the head.


PS Note the change in thread title to absorb Russ's observations!
 
g_day said:
Russ - are you saying NV35 can run fp32 as fast as NV16 - that the pipeline bottleneck with running a pure fp32 stream of instructions was only a NV30 limitation?

According to a pretty convincing body of evidence, all NV30-34 cores execute FP16 and FP32 pixel shader arithmetic ops at the same rate, namely one per pipeline (4 for NV30/31, 2 for NV34) per clock. The problem is that this ideal execution rate can only be sustained when fewer than 256 bits worth of temporary registers are in use. That translates to 4 FP16 temps, but only 2 FP32 temps. You can do a lot more useful things with 4 registers than you can with 2, so in actual practice a shader using FP16 is likely to be faster than one using FP32.

There are various other performance penalties as register usage increases from there; again, FP16 and FP32 perform the same at any given level of register usage (in bits), but FP32 takes twice as much register memory so the performance limits are reached twice as quickly.

To my knowledge, it hasn't been completely determined what the story is with NV35. It seems clear that the 2 FX12 register combiners per pipe from NV30 have now been upgraded to FP units--FP16 appears to have identical performance to FX12 on NV35. It's not clear (AFAIK) whether these units are also capable of FP32 operations, or just FP16. (Or even if they are capable of all PS 2.0 arithmetic operations.) Of course the point is almost moot because there's almost no way you're going to usefully feed three parallel arithmetic ops with only 2 registers.

As for the question of why NV3x runs into performance issues at such a ridiculously low level of register usage, that seems to be an open question. My favorite theory is that more temp registers are designed into the pipeline, but they are used in permanent workarounds for various bugs in the pipeline. The problem with this theory is that one would think Nvidia could have used the NV30->NV35 transition to fix at least some of those bugs and free up some desperately needed registers--but NV35 appears to have the same register problems as NV30.

Of course, even if there's some other cause for the register scarcity, you'd still think it could have been improved for NV35. Another possibility is that the seemingly stupid design has something to do with supporting dynamic flow control in PS 2.0x, in contrast to just static flow control like PS 2.0 and ATI's R3x0.
 
One possible reason for the register scarcity may be that pixel shader register files are very expensive in hardware: if you have 4 rendering pipelines, each with 10 pipeline steps for pixel shader operation, you need 40 physical instances of the entire pixel shader register file for correct operation. Also, with 3 arithmetic units per pipeline, you need perhaps 10 or so ports into the register file or else you will starve the arithmetic units. So you end up with 40 register files, each with 32 128 bits registers and 10 ports to each register. That's something like 25-30 million transistors in total, and a routing nightmare. On register files alone. Now, it is possible to save transistors by splitting the register files, so that only a small part has the full 10 ports, with the remaining part having only 2 or 3 or so ports - this might reduce the transistor count by a factor of 2 to 3, but will hit performance hard once you start using any of the registers ouside the 10-port part. Which is a good tradeoff for PS1.1 operation, given the very low number of temp registers that PS1.1 supports, but quite bad for PS2.0 performance.
 
arjan de lumens said:
One possible reason for the register scarcity may be that pixel shader register files are very expensive in hardware: if you have 4 rendering pipelines, each with 10 pipeline steps for pixel shader operation, you need 40 physical instances of the entire pixel shader register file for correct operation. Also, with 3 arithmetic units per pipeline, you need perhaps 10 or so ports into the register file or else you will starve the arithmetic units. So you end up with 40 register files, each with 32 128 bits registers and 10 ports to each register. That's something like 25-30 million transistors in total, and a routing nightmare. On register files alone. Now, it is possible to save transistors by splitting the register files, so that only a small part has the full 10 ports, with the remaining part having only 2 or 3 or so ports - this might reduce the transistor count by a factor of 2 to 3, but will hit performance hard once you start using any of the registers ouside the 10-port part. Which is a good tradeoff for PS1.1 operation, given the very low number of temp registers that PS1.1 supports, but quite bad for PS2.0 performance.

I appreciate some of the points you're making, but I don't see why the register requirements need to be nearly what you're proposing. Why have the full random-access register file at every pipeline stage? A standard datapath would keep register access confined to one well-defined pipeline stage (or perhaps spread it across two or three on a deeply pipelined design). Intermediate values are of course propagated down the pipeline as needed, but in latches, not 10-ported registers; it's really a seperate issue that has nothing to do with the size of the visible register set. The way you describe it, it seems like each of your 10 pipeline stages is an
EX phase.

And 10 ports!! A modern MPU datapath is just as superscalar as a single pixel shader pipe, but typically use just a dual read-port and single write-port design, IIRC. I suppose a pixel shader might tend to have denser operand locality, but 10 ports! Even if it's theoretically possible that you be asked to coissue three FMACs using the same register for all three source operands and also as one of the destinations...do you really think the hardware is designed to accomodate this ridiculous instruction bundle? No way. 3 or 4 ports should be more than enough to keep register structural hazards to a minimum.
 
Back
Top