Carmack's comments on NV30 vs R300, DOOM developments

Joe DeFuria said:
Despite Rev's Advice... ;)

Nobody, including Carmack, knows if and when it will actually be fixed....and to what degree. That "nVidia told him" is not particularly convincing. I know that I can't personally recall when any driver update over any length of time increaed performance in a game situation (not just a sythetic situation), by 100%.

Well this is what I think, Carmack likes Nvidias driver team because they dont bull s hit him, if they start he won't like them much, so if he was talking to real people instead of PR people then it is probably true to a large extent at least, otherwise Nvidia will lose much of the esteem carmack has for them.
 
Thanks for the info, Lumiscent.

So, let me retry... I'm going to assume the R300's 24+8 ( 24 vector, 8 scalar ) architecture is equal to the GFFX 32 architecture. That's probably not correct, but it might be better sometimes and worse other times. So it might really impossible to say which is best...

R9700P, FP24: 325*32 = 10400
GFFX, FP16: 500*32 = 16000
GFFX, FP32: 500*32/2 = 8000

So GFFX being 50% of the R9700P performance in ARB makes no sense at all if it was really hardware related.
Even if you supposed ATI's architecture was better, 70% really seems like a worst-case scenario. So, unless nVidia is hiding us something, it's definitively driver-related.

As for the integer dedicated hardware... Could it be it's the same hardware which is used to merge subsamples for AA? Or is nVidia using 3DFX method for subsample merging?


Uttar
 
So, unless nVidia is hiding us something, it's definitively driver-related

Ughhh not another Chalnoth spouting 'definitley' & 'absolutley' & '100% confident' BS, your assumptions are based on speculation and not hard fact. :!:
 
The 8 scalar processors of the R300 fragment pipeline can be used for vector and scalar calculations simultaneously. So, not only do the fragment pipes operate similarly to NV30 (4 vector component processors), but they can also execute a complex scalar op on the side.
 
Doomtrooper said:
Ughhh not another Chalnoth spouting 'definitley' & 'absolutley' & '100% confident' BS, your assumptions are based on speculation and not hard fact. :!:

Oopsy, my mistake.
I know it's not based on hard fact. I meant to say it was definitively driver related IF my assumptions were correct, which is not certain at all.
Sorry for the confusion, I'd hate to look like a nVidia fanboy.

Lumiscent: So you'd say that, based on the little info we got, it's likely the R300 architecture is more efficient than the NV30 architecture?


Uttar
 
Dave H said:
Anyway, I think that nVidia went this route because they needed the FP32 for professional use, and right there they probably lost the option of going with 'one' pipeline.

Why?

Good question and I definitely don’t know the answer. :idea:

My reasoning is that they realized that they needed FP32 for professional quality, while speed wasn’t important here. Speed was important with FP16, however, for all the realtime rendering, so they made what they regarded as a clever split up of the ALU units in the fragment processor for double speed with FP16. But now they were architecturally bound by the different compromises they made in the flexible FP part and decided that int ops were better handled by the old register combiners (which you can still use on top of (after) the fragment processor).
 
I wouldn't be able to say for sure, Uttar only time will tell with possible driver optimizations on the horizon, but it seems the R300 has a more robust implementation (from an fragment shader, architectural standpoint).
 
:idea:

I think I've got it. Nvidia decided they wanted to support FP32 to target the professional market. FP32 is obviously overkill for most situations and will severely hurt performance, so they also support packed FP16 using the same pipeline, giving you twice the throughput. The question is, why not use the same pipeline to do Int8? It's no problem to do one Int8 calculation using the same hardware it takes to do one FP16 calc. (In fact, you can do Int12.) You just can't do 2 Int8's per 1 FP16, because even though there are 16 bits worth of functional units, the exponent bits do different math from the mantissa bits and can't be applied to doing integer arithmetic.

Now let's look at the R300 pipeline. It supports FP24, and it also uses the same execution units to provide Int8 functionality. Now, previously I was assuming Int8 ran at the same speed as FP24...but now that I think about it, there should be enough mantissa bits in the FP24 format to support packing 2 Int8 ops into the space of one FP24 op. In other words, just as FP16 has twice the throughput of FP32 on NV30, I think Int8 has twice the throughput of FP24 on R300!

So, in order to match R300's Int8 shading speed, NV30 needs a dedicated Int8 pipeline with roughly twice the throughput as its FP16 pipeline. (Assuming NV30's FP16 throughput ~= R300's FP24 throughput.) Mystery solved.

Now, is my guess about the R300 doing Int8 at double the throughput of FP24 correct? I don't know, but I'm quite sure many people on this board do...
 
DaveBaumann said:
nutball said:
I'm wondering if some of the lower-end NV3x cores will dump the fixed-function integer pipeline and do everything in the programmable FP (like ATi).

Unfortunately, this would go against NVIDIA’s prior methods for making lower end parts.

If recent history is an indication, then the future NV3x low-end part would thus inherit the fixed-function integer pipeline of the NV30 while not having a float pipeline at all. Additionally, scrap half the pixel pipelines and some of the vertex processors and ready is the low-end part...
However, didn't nvidia say it's all going to be CineFX architecture? The above outlined part would merely be a GF4 with only one texture unit per pipe, except for the more flexible (but probably much slower) vertex shader.
 
DaveBaumann said:
I'd say that R300 is doing all formats at exactly the same rate.
But then how do we reconcile all the following pieces of information?
  1. FP16 has twice the theoretical throughput of FP32 on NV30 (confirmed? and common sense in any case)
  2. Int8 has a considerably higher theoretical throughput than FP16 on NV30 (confirmed and obvious from the fact that they don't use the FP16 pipeline to handle Int8)
  3. FP16 on GFfx has similar real performance to FP24 on 9700 (based on Carmack's .plan, assuming Doom 3's NV30 fragment path uses primarily FP16 format)
  4. Int8 shaders on GFfx do not appear faster than on 9700 (based on 3dmark01, Shadermark, etc.)
Of course this is a mix of theory and actual results, cobbled together amongst different tests. Plus there is the question of the current state of the GFfx's drivers with respect to shader performance. Still, the only other coherent explanations I can think of are that either D3's NV30 path uses primarily Int8 rather than FP16, or that GFfx's current shader performance problems are primarily concentrated in the integer register-combiner pipeline rather than the FP pipeline. Neither of these alternatives is very credible, IMO. (A third possibility: there is some functionality in the integer shader that cannot be easily replicated in the FP16 pipeline...but I can't imagine what that would be.)

Plus, it seems (from my limited knowledge) that it would be relatively easy to run Int8 shaders at double throughput with a native FP24 pipeline.

But maybe I'm missing something. (Or, more likely, several things...)
 
Dave H, I am not sure if you would need extra control logic for the pipeline when packing/unpacking bits like that. Maybe Ati included it, maybe not. I guess you/we could ask OpenGl guy or someone else from Ati. OpenGl guy, are you there? :D
 
mczak said:
DaveBaumann said:
nutball said:
I'm wondering if some of the lower-end NV3x cores will dump the fixed-function integer pipeline and do everything in the programmable FP (like ATi).

Unfortunately, this would go against NVIDIA’s prior methods for making lower end parts.

If recent history is an indication, then the future NV3x low-end part would thus inherit the fixed-function integer pipeline of the NV30 while not having a float pipeline at all. Additionally, scrap half the pixel pipelines and some of the vertex processors and ready is the low-end part...
However, didn't nvidia say it's all going to be CineFX architecture? The above outlined part would merely be a GF4 with only one texture unit per pipe, except for the more flexible (but probably much slower) vertex shader.


Yes, this was what was motivating my comment. NV have said that all NV3x cores will be CineFX-capable (whatever that means in reality). I took it to mean they'd have programmable vertex and pixel shaders. So to cut down on transistors you have to start losing something else. How much slower would the FP-only path be than the old fixed-function? Maybe they'd drop support for FP32 as well, just have FP16?
 
nutball said:
How much slower would the FP-only path be than the old fixed-function? Maybe they'd drop support for FP32 as well, just have FP16?
Well, if the FX indeed has a completely separate 8-bit int FP unit (I think it still has some purely 8-bit int units, possibly related to fixed-function processing, but probably uses the 16-bit float unit for 8-bit int FP ops...), placed in there just for speed, then yes, that could easily be taken out to reduce die size.

However, I see no reason that nVidia would need to drop 32-bit support completely. I doubt it takes that many transistors to make two 16-bit pipes run as a single 32-bit pipe. I'm betting on just fewer FP/VP pipes.

Edit: Major typo at the end of the first paragraph... ("improve speed" changed to "reduce die size")
 
DaveBaumann said:
I'd say that R300 is doing all formats at exactly the same rate.

I'd guess the same thing, if only based on ATI's MO than anything else, they don't see to mico optimize much.
 
OpenGL guy said:
Himself said:
DaveBaumann said:
I'd say that R300 is doing all formats at exactly the same rate.

I'd guess the same thing, if only based on ATI's MO than anything else, they don't see to mico optimize much.
What do you want to optimize? You can't have it all.

I think he meant it as a compliment. I like the fact that ATI their efforts on getting the standard path (ARB2, whatever) to work the fastest. I don't think I'm the only person who's not particularily keen on writing a NV30 specific path. Kudos to ATI.
 
OpenGL guy said:
What do you want to optimize? You can't have it all.

I am not suggesting ATI do anything differently, just my opinion on the different approaches. NVIDIA is obsessed with speed at all costs in everything, they sometimes don't see the forest for the trees. ATI tends to navel gaze at the expense of pampering their customers.

16 bit vs 32 bit game modes, ATI same speed, generally 16 bit is much faster on NVIDIA cards relative to it's 32 bit. Sure it's historical, but indicates how they each think. NVIDIA, tons of FSAA modes, ATI only a few, but I prefer that. NVIDIA has coolbits to enable overclocking, ATI nadda. NVIDIA has 3D glasses drivers, ATI nadda. Boo! Hiss! NVIDIA has tons of demos, ATI has some and they are nice, but not so much, but on the other hand, we have Humus. Erm.. ATI seems to decide what would be best for all, and design around that, NVIDIA tends to give people a little bit of everything, all the time, as fast as they can make it go, regardless of what it looks like.

Just my observations. :)
 
fresh said:
I think he meant it as a compliment. I like the fact that ATI their efforts on getting the standard path (ARB2, whatever) to work the fastest. I don't think I'm the only person who's not particularily keen on writing a NV30 specific path. Kudos to ATI.
ATI was the initiator of the standard path. They can't help but have their hardware run it well.
 
NV30 path= Proprietary
ARB2 path= Open

Its also obvious the NV30 path would be optimal for the Nv30, problem is no other IHV can use the NV30 path...that can't be said about the ARB2 path.
 
Back
Top