Nvidia and ARB2

R300:
r300die.gif
 
That image has a little bit of.... "Marketing" thrown into it. The real die and real area would never be given out. The image is more conceptual in nature, then reflecting the real area of things. Sorry.
 
Surely the "256-bit interface" would be ditributed around the top and right (and possibly a little more) of the chip.

Anyway - there was a report some time back talking about transistors and processes in which they talked about ATI getting to 100M transistors - didn't that have an image of the die?
 
Yep, and it's guarded by Lord Lucan* and Shergar.

* Substitute Jimmy Hoffa for American readers
 
Dave H said:
....Nvidia can probably be properly accused of hubris in thinking that they could tailor their product to address a new and rather different market segment (low-end production rendering) while still maintaining product superiority in the consumer market. Or of arrogance in assuming they were the only IHV worth paying attention to, and thus could influence future specs to reflect their new architecture instead of one that better targeted realtime performance.

Obviously one can correctly accuse their marketing of all sorts of nasty things.

But I don't think one can really accuse Nvidia of incompetence, or stupidity, or laziness, or whatnot. NV3x is not really a bad design. It's unquestionably a decent design when it comes to performance on DX7 and DX8 workloads. I can't entirely judge, but I would guess it's about as good as could be expected in this timeframe as an attempt to replace offline rendering in the low-end of video production; I just don't think that's quite good enough yet to actually capture any real part of the market.

The only thing it's truly bad at is rendering simple DX9-style workloads (and yes, HL2 is very much on the simple end of the possibilities DX9 represents) at realtime interactive framerates. And--except with the benefit of hindsight--it doesn't seem obvious to me that Nvidia should have expected any serious use of DX9 workloads in the games of the NV3x timeframe. This prediction turns out to have been very, very wrong. (What I mean by "the NV3x timeframe" does not end when NV40 ships, but rather around a year after NV3x derivatives are dropped from the mainstream of Nvidia's product lineup. After all, the average consumer buying a discrete video card expects it to hold up decently for at least a while after his purchase.)

It turns out that DX9 gaming is arriving as a major force quite a bit ahead of DX9 special effects production. And Nvidia will rightly pay for betting the opposite. But, viewed in the context of such a bet, their design decisions don't seem that nonsensical after all.

Sireric responded to most of your post here, so I'll just address this part of it...

I would never accuse anything nVidia did of being nonsensical. Narrow-minded, shortsighted, and presumptive...certainly. But I'm sure it all made perfect sense to somebody there...:)

I think your idea of "DX9 special effects production" is pretty funny--I mean, that you'd think that...:) Perhaps what you mean is that their hardware is so slow at handling "DX9 special effects" that "DX9 special effects production" work is all it's good for? I find this a pretty nonsensical idea....I *hope* this is not what nVidia was thinking...:D
 
sireric said:
In general, a from-scratch design takes on the order of 3 years from architecture spec to production. However, GF's have never had a from scratch design that I can see. Evolutionary instead (which is not bad -- don't get me wrong; it has pros & cons). Consequently, the design cycle is probably shorter. Both ATI & NV had plenty of input into DX9, as well as a good (not great) amount of time to incorporate needed changes.
Very interesting point about from-scratch vs. evolutionary designs. Getting back to the original issue: do you think the decision to base DX9 around FP24 was sealed (or was at least evident) early enough for Nvidia to have redesigned the NV3x fragment pipeline accordingly without taking a hit to their release schedule? (And of course the NV30 was realistically planned for fall '02 before TSMC's process problems.) Obviously a great deal of a GPU design has to wait on the details of the API specs, but isn't the pipeline precision too fundamental to the overall design? Or is it?

Or, since you might not be able to speak to Nvidia's design process: was R3x0 already an FP24 design at the point MS made the decision? If they'd gone another way--requiring FP32 as the default precision, say--do you think it would have caused a significant hit to R3x0's release schedule? Or if they'd done something like included a fully fledged int datatype, would it have been worth ATI's while to redesign to incorporate it?

Lots of stuff in there. The computation / bandwidth ratio is interesting, but I'm not sure it makes that much sense. The raw pixel BW required is not determined by the format of your internal computation engine, but more about the actual surface requirements per pixel.
Right. To be clear, I wasn't claiming that computation precision affected the required per-pixel bandwidth. Rather that your available bandwidth requires a certain level of fillrate to balance it, and that your transistor budget would then put limitations on the precision at which you achieve that fillrate. The question is how large a factor the ALUs really are in the overall transistor budget--whether a ~2.5x increase in ALU size (from FP24->FP32) is enough to warrant cutting down on the number of pixel pipes or the computational resources in each pipe.

Honestly, the area increase from 24b to 32b would not cause you to go from 8 to 4 pixels -- It would increase your overall die cost, but not *that* much. It's just a question of would it justify the cost increase? For today, it doesn't.
...and apparently the answer is no. Which brings the mind the question of why Nvidia stuck with a 4x2 for NV30 and NV35, if not because they didn't have the transistor budget to do an 8x1. Two ideas spring to mind. First, that they were so enamored of the fact that they could share functionality between an FP32 PS 2.0 ALU and two texture-coordinate calculators that they went with an nx2 architecture. Second, that they planned to stick with a 128-bit wide DRAM bus after all; that NV35 is not "what NV30 was supposed to have been", but rather the quickest way to retrofit improved performance (particularly for MSAA) onto the NV30 core; and that if NV35's design seems a little starved for computational resources compared to its impressive bandwidth capabilities (particularly w.r.t. PS 2.0 workloads), that's because it was just the best they could do on short notice.

(If NV3x chips, like R3x0, could use all the registers provided for by the PS 2.0 spec without suffering a performance penalty, their comparitive deficit in calculation resources would still likely leave them ~15-25% behind comparable ATI cards in PS 2.0 performance. But that is nothing like the 40-65% we're seeing now.)
Actually, the performance difference appears to be 2~5x, when running complexe shaders. It's not just register usage (though that's a big one).
Hmm. I don't recall seeing too many real-world examples over a factor of ~3x (hence 65% behind). FWIW I'm talking about benchmarks of real games that might ship in the next, say, year or less: something like HL2 (or even a *bit* more shader-intensive) at 1600*1200, running an in-game workload. Perhaps if you're talking performance (maybe clock-normalized?) on straight synthetic shaders you might get to ~5x. Or are you suggesting there's even worse to come? :oops:

Where R3x0 and NV3x come into this is in performance for dependent texture reads. I'd be interested in having more specifics, but as I understand it the R3x0 pipeline is deep enough to do an alright job minimizing the latency from one level of dependent texture reads (particularly if you fetch the texture as much in advance of when you need it as possible), but woe be unto you if you're looking for two or three levels. Meanwhile, NV3x is reputed to do just fine mixing several levels of texture reads in to the normal instruction flow--meaning that it must do a gangbusters job at hiding latency. Meaning it has a hell of a deep pipeline.
BS. The R300 has no performance hit at 1,2 and sometimes a slight performance when hitting 3 and 4 levels of dependancy. The performance hit an NV3x gets in just defining 4 temporaries is much larger than the R3x0 gets in doing 4 levels of dependancies. The worst 4 levels perf hit I've ever seen is in the 10~15% range. The R3x0 is VERY good at hiding latencies.
Well there goes that theory. I do wonder where I've gotten that impression about R3x0 and multiple levels of texture dependancy (it was second-hand, of course), but obviously you're the authority on this one.

Not sure I'll get an answer on this but...so what do you think? I mean you must have some good ideas about what's causing the severe register pressure on NV3x. Surely Nvidia would love to offer more than 2 full-speed FP32 registers if they could. Is it a result of some other feature of the NV3x pipeline (static branching, say, although I'm not sure why this would be the case)? Or perhaps some likely piece of broken functionality that wouldn't have had time to be fixed for NV35?

Any thoughts would be appreciated...
 
sireric said:
That image has a little bit of.... "Marketing" thrown into it. The real die and real area would never be given out. The image is more conceptual in nature, then reflecting the real area of things. Sorry.

Why not? It's not like Nvidia (or whoever) can't just buy a Radeon, slice open the package and make their own micrograph. Or stick the damn thing under the electron microscope for that matter.

Given that both Nvidia and ATI pipeline their design process, any "trade secrets" revealed by publishing even the most basic design details (like, what percent of the die is dedicated to the pixel pipeline and what percent to geometry) likely couldn't be "stolen" and incorporated into competitive products until two product generations down the road.

The only downside to the IHV's releasing more architectural and design details that I can see is that it would probably prevent the sort of obfuscation (ok lies) we've seen from the likes of Nvidia and SiS regarding basic facts like the number of pixel pipes in their chips. Meanwhile, one would think that an IHV that didn't have anything to hide would only gain credibility among enthusiasts and respect among engineers.

It's always been a bit of a pet peeve of mine that the GPU companies can't treat their business a bit more like it's doing science and a bit less like it's hawking snake oil. CPU vendors give away hugely more technical data on their products, and they don't seem to have suffered as a result.

Anyways. I know you don't get to decide the policy (and both you ATI guys and the PVR boys have always been helpful and modestly forthcoming about current/past products in this forum), but you're the one who gets to hear me complain about it...
 
1. IMO, even if nvidia started designing NV30 before DX9, they had enough time to change the design.
2. Even if they had not enough time to change it, still it's totally their fault to implement the design so poorly.
NV3x series cannot run fast even FP16. There can be no excuse for that.
 
I think Dave H's points are well articulated. Unlike the GF 1/2/3/4, the GFFX looks like a workstation card turned into a gaming card, instead of vice versa. It'd make sense that it was geared towards DX9 content creation, rather than actually DX 9 gaming. I mean, the pick up rate for DX8 was pretty pathetic.

However, I'm sure NVIDIA has been aware of Valve's HL2 plans for sometime (as well as every other dev house out there). One would think that they would have seen the rapid adoption of DX9 coming and done something about it.
 
Dave H said:
WaltC said:
I think your idea of "DX9 special effects production" is pretty funny--I mean, that you'd think that...:)
Yeah, gee, Walt, only a big fucking idiot would think that. :rolleyes:

Dave, this trend is the same trend that's been going on for years and years. What it might have to do with nV3x and DX9, in particular, and apart from any other 3D chip and windows API ever made, I can't imagine.
 
zurich said:
I think Dave H's points are well articulated. Unlike the GF 1/2/3/4, the GFFX looks like a workstation card turned into a gaming card, instead of vice versa. It'd make sense that it was geared towards DX9 content creation, rather than actually DX 9 gaming. I mean, the pick up rate for DX8 was pretty pathetic.

I think these arguments can disposed of rather easily by examining the fact that the market for 95%+ of these chips is for use as a "gaming card" versus a "workstation" card, and by the fact that nVidia's "workstation" cards have always been exact duplicates of their "gaming cards," only with different driver and software packages included along with major differences in price.

To make this argument it must be assumed that nVidia decided to design a "specially slow" workstation-oriented chip and then decided to point it at the 3d-gaming market (where the vast bulk of its mid-high range chips are sold.) This means, if true, that someone at nVidia is smoking something hallucinogenic....:)

The problem is that it doesn't pay nVidia to design a slow chip for either market--and it doesn't hurt a "workstation" graphics chip to render faster, btw. Examine nVidia's marketing efforts for the past year for nV3x and you'll see they are almost all about 3d gaming, which should be understood as nVidia's target market for nV3x. The fact is that there's nothing more intrinsically "workstation-ish" about nV3x than there is about R3x0--and in both situations R3x0 renders the same quality faster. The fact is that as opposed to doing something as stupid as deliberately designing a slow "workstation" chip to sell into the 3d-gaming market (where it is also slow), nVidia simply didn't design as powerful a chip as R3x0, and that's why it looks as comparatively poor as it does.
 
Perhaps nV thought they could compensate for their "slow" IPC with 130nm and high clock speeds.
 
WaltC said:
Dave H said:
WaltC said:
I think your idea of "DX9 special effects production" is pretty funny--I mean, that you'd think that...:)
Yeah, gee, Walt, only a big fucking idiot would think that. :rolleyes:

Dave, this trend is the same trend that's been going on for years and years. What it might have to do with nV3x and DX9, in particular, and apart from any other 3D chip and windows API ever made, I can't imagine.

Maybe you should try reading the linked post, then.

(And incidentally, obviously such work would be done in OpenGL and not DX9, but it's the featureset of FP calculations and reasonably flexible vertex and fragment shaders that DX9, ARB_fragment_program, NV_fragment_program etc. expose that's being discussed here.)
 
Dave H said:
Very interesting point about from-scratch vs. evolutionary designs. Getting back to the original issue: do you think the decision to base DX9 around FP24 was sealed (or was at least evident) early enough for Nvidia to have redesigned the NV3x fragment pipeline accordingly without taking a hit to their release schedule? (And of course the NV30 was realistically planned for fall '02 before TSMC's process problems.) Obviously a great deal of a GPU design has to wait on the details of the API specs, but isn't the pipeline precision too fundamental to the overall design? Or is it?

The FP24 availability was at least 1.5 years before NV30 "hit" the market. I'm sure MS would of been very reasonable to inform them earlier or even work with them on some compromise. I have no idea what actually "happened". Not sure how that fits in with their schedule, but I think it's safe to say that they had time, if they wanted. I don't believe in the TSMC problems. I agree that LowK was not available, but the 130nm process was clean by late spring 02, as far as I know. The thing that people fail to realize is that MS doesn't come up with an API "out of the blue". It's an iterative process with the IHVs, and all of us can contribute to it, though it's controlled, at the end, by MS.

Or, since you might not be able to speak to Nvidia's design process: was R3x0 already an FP24 design at the point MS made the decision? If they'd gone another way--requiring FP32 as the default precision, say--do you think it would have caused a significant hit to R3x0's release schedule? Or if they'd done something like included a fully fledged int datatype, would it have been worth ATI's while to redesign to incorporate it?
No, it wasn't. I don't think FP32 would of made things much harder, but it would of cost us more, from a die cost.

...and apparently the answer is no. Which brings the mind the question of why Nvidia stuck with a 4x2 for NV30 and NV35, if not because they didn't have the transistor budget to do an 8x1. Two ideas spring to mind. First, that they were so enamored of the fact that they could share functionality between an FP32 PS 2.0 ALU and two texture-coordinate calculators that they went with an nx2 architecture. Second, that they planned to stick with a 128-bit wide DRAM bus after all; that NV35 is not "what NV30 was supposed to have been", but rather the quickest way to retrofit improved performance (particularly for MSAA) onto the NV30 core; and that if NV35's design seems a little starved for computational resources compared to its impressive bandwidth capabilities (particularly w.r.t. PS 2.0 workloads), that's because it was just the best they could do on short notice.

I can't speculate too much, but I agree with some of your posts. At the end, the GF4 was a 4x? arch (was it x1 or x2 -- I don't remember). A natural evolution of that architecture would be a 4x2 still. A radical change there might be more than their architecture can handle.

Hmm. I don't recall seeing too many real-world examples over a factor of ~3x (hence 65% behind). FWIW I'm talking about benchmarks of real games that might ship in the next, say, year or less: something like HL2 (or even a *bit* more shader-intensive) at 1600*1200, running an in-game workload. Perhaps if you're talking performance (maybe clock-normalized?) on straight synthetic shaders you might get to ~5x. Or are you suggesting there's even worse to come? :oops:
Well, 3x would be 33% in my book. I don't remember the shadermark results, but I thought some were more than 3x. Our renderman conversion examples (using Ashli to generate DX9 assembly) showed up to 5x using some sort of 4x driver set from NV.


Well there goes that theory. I do wonder where I've gotten that impression about R3x0 and multiple levels of texture dependancy (it was second-hand, of course), but obviously you're the authority on this one.

Not sure I'll get an answer on this but...so what do you think? I mean you must have some good ideas about what's causing the severe register pressure on NV3x. Surely Nvidia would love to offer more than 2 full-speed FP32 registers if they could. Is it a result of some other feature of the NV3x pipeline (static branching, say, although I'm not sure why this would be the case)? Or perhaps some likely piece of broken functionality that wouldn't have had time to be fixed for NV35?

Any thoughts would be appreciated...

Not sure exactly. I'm guessing that they share the ALUs with texture and ALU ops, so can only issue 1 instruction per cycle (also having 1 instruction memory implies that too). As well, I'm guessing that they use some sort of storage, per pixel, for registers, which is also used to hide latency. It could be that that memory is a small cache and that they store in local video memory the overflow; that would not be a good design, since DX9 has a very fixed maximum bound on pixel sizes. At the end, we may never know.

I speak for myself.
 
Dave H said:
Maybe you should try reading the linked post, then.

(And incidentally, obviously such work would be done in OpenGL and not DX9, but it's the featureset of FP calculations and reasonably flexible vertex and fragment shaders that DX9, ARB_fragment_program, NV_fragment_program etc. expose that's being discussed here.)

Right, it's not just DX9 that's slow at full precision on nV3x, it's ARB2, as well.

The point for me is that the idea that nV3x was designed for "workstations" and not "3d gaming" is simply void, as it suffers from the same problems in workstation useage--it's slow at full precision there, too.

OK, I went back a second time and reread it, and still din't see one word in it about nV3x and DX9 (or the same functionality exposed in OpenGL.) In fact, this sentence:

John Carmack said:
...The current generation of cards do not have the necessary flexibility, but cards released before the end of the year will be able to do floating point calculations, which is the last gating factor....

...leads me to believe that this was June 27, 2002, since as of that date this year fp cards had been shipping for nearly a year--R9700P. In fact, Carmack simply seems to be discussing, in general, the trend of 3d-chips overtaking software renderers, which has been in progress ever since the V1 rolled out. So re-reading Carmack's very general statement here doesn't provide me with how you reached your ideas about nV3x being "special" in this regard, in comparison with R3x0--which is headed in the same direction.
 
DaveBaumann said:
Eh? As far as we know so far NV35 has a full combined ALU & texture address processor and two smaller ALU's for each of its 4 pipes. R300, ignoring the Vec/Scalar coissue, has an FP texture address processor a full ALU and a small ALU for each of its 8 pipes. Thats 12 FP processors for NV35 and 16 for R300, excluding the texture address processor, 24 including.
I'm really not sure that's an accurate depiction. As far as I can tell, from the evidence posted in the thread you linked earlier, the R300 need only have a multiplier and an adder in serial to have the performance characteristics supplied (making for one MAD per clock), not one MAD, and one unit capable of doing a few other things.
 
WaltC said:
Right, it's not just DX9 that's slow at full precision on nV3x, it's ARB2, as well.

The point for me is that the idea that nV3x was designed for "workstations" and not "3d gaming" is simply void, as it suffers from the same problems in workstation useage--it's slow at full precision there, too.
The point is that it's capable of full precision. While I'm not sure how it stacks up against 3DLabs' own solution right now, it is a definite step ahead of the R3xx in precision, which may make it the only current viable solution for low-end offline rendering.

And as far as speed is concerned, remember that we're talking about pitting the NV3x vs. software rendering.
 
Chalnoth said:
I'm really not sure that's an accurate depiction. As far as I can tell, from the evidence posted in the thread you linked earlier, the R300 need only have a multiplier and an adder in serial to have the performance characteristics supplied (making for one MAD per clock), not one MAD, and one unit capable of doing a few other things.

It is accurate. The diagram is a text depiction of the diagram Eric used to describe the R300 pixel shader. We don't know what other functions are supported by the second ALU yet either.
 
Back
Top