I think I'll jump into the discussion.
Chalnoth, I posted this in another forum but you didn't get around to replying. It fits this discussion even better, though. I keep hearing NVidia fans, most notably you, making excuses about the lack of fixed point rendering in DX9 shaders, but the problem with that argument is that NVidia is not very good at fixed point either. Look at PS 1.1 shader benchmarks like ChameleonMark. Look at PS 1.4 shader benchmarks like ShaderMark (either before NVidia hand tuned it, or after instruction shuffling to avoid detection). The "Advanced Pixel Shader" is another PS 1.4 example. Even HL2 shows this in their DX8.1 version.
The only time NVidia seem to have an edge on ATI is with register-limited, fixed-point, mathematical shaders (i.e. low texture usage, and definately no dependant texture lookups), and even then the speed difference is at most proportional to the clock speed difference. How many DX9 shaders fit this profile? Almost none in the near term, except for Doom3, but can you really call that DX7 technology a DX9 shader?
NV3x has a LOT of problems regarding DX9 class shader performance. It's only when you add the register limitation, the FP32 peformance, the dependent texture inefficiency (see below), etc. that you get the horrible DX9 peformance from NV3x.
---------------------------------------
WaltC said:
Trust me...many offline rendering "farms" today do not need 128-bit color precision, nor 96-bit color precision--many have been operating for years at essentially 32-bit integer precision. This is why most of the rendering software out there does not support 96-bit/128-bit rendering yet--just like 32-bit 3d games don't magically render at 96/128-bits *unless* the software engines support it.
WaltC said:
Laa-Yosh said:
32 bit integer precision in an offline renderer?? You guys must be kidding... or else name this renderer
I seriously doubt that any of the big 4 (Max, Maya, XSI, LW) would be using less than 64 bit per color - in fact, AFAIK LW uses 128 bits per color... Mental Ray and PRMan should be at least as good. Dammit, MR can be 100% phisically accurate, which doesn't sound like integer precision to me.
Also please not that apart from movie VFX studios, PRMan is quite rare in the industry because of its very high price (USD 5000 / CPU AFAIK). Most of the 3D you see in game FMVs, commercials, documentaries etc. is made using the built-in renderers of the "big 4".
But if you want to use Lightwave to calculate to 128-bit colocr accuracy in a ray-traced frame--how's 128-bit fp in 3d chip going to help you do that? (It might be OK in a preview window--if you wanted to rotate an object while you create it--but why not just use flat shading or wire frame--it's much faster? I think most scene creators would use wire-frame or flat-shading in creating objects and doing pathing a scene. I see zero advantage to nV3x over R3x0 in this regard.)
More to the point, distinction needs to be made between 3d and 2d. I don't think that's being done here...
WaltC, I don't think you understand Laa-Yosh's point. He is saying that commercial raytracers use 128-bit floating point, i.e. FP128, not 4 x FP32 as NV30 is capable of. You might think this is overkill, but imagine some of the crazy space scenes we see on TV with interplanetary flybys and zooming through the atmosphere into a city. This can really stress the precision limits of FP, especially the mantissa. I think you are quite wrong in saying many renderfarms don't need 128-bit precision, especially if you mean 4 x FP32 like on the GPU.
As for using the 3D chip, we are talking about using pixel shaders on a GPU as a fast, parallel processor, and then reading data back from the GPU memory when needed. Basically, this is using a GPU in a way it wasn't primarily intended to be used.
Lightwave3D is used in many TV shows and movies, and doesn't cost much either. For an offline rendering system to use GPU's, there must be a significant performance boost, no drawbacks (this likely means being able to emulate larger precisions like the coders did for the x86 platform), and non-prohibitive development costs. For these reasons, I doubt we'll see offline rendering (beyond experimentation) in this generation of GPU's or even the next.
-------------------------------------
Dave H said:
Where R3x0 and NV3x come into this is in performance for dependent texture reads. I'd be interested in having more specifics, but as I understand it the R3x0 pipeline is deep enough to do an alright job minimizing the latency from one level of dependent texture reads (particularly if you fetch the texture as much in advance of when you need it as possible), but woe be unto you if you're looking for two or three levels. Meanwhile, NV3x is reputed to do just fine mixing several levels of texture reads in to the normal instruction flow--meaning that it must do a gangbusters job at hiding latency. Meaning it has a hell of a deep pipeline.
Meaning every exposed full-speed register is replicated many times, meaning that in order to fit in a given transistor budget, the number of full-speed registers might have to be cut pretty low.
I was surprised at this statement of yours, Dave H, as well as the praise you recieved for it. While it is a plausible explanation for the register usage problem of NV30, the problem with your statement is that a lot of data shows NV30 as having very poor dependent texure performance. Besides, ATI would need a similar FIFO buffer (or "duplicate registers" as you call them) during dependent texturing operations, and they have a significantly smaller transistor budget.
Remember
Ilfirin's benchmark? NV3x was about 1/8 of R300's performance . The register limitation will probably come into play here, but looking at the original version of MDolenc's
fillrate tester, before he made the shader more complex, NV3x did quite well with ordinary shaders.
(Aside: Ilfirin's benchmark also happens to be a good counter to your statement "Hmm. I don't recall seeing too many real-world examples over a factor of ~3x". Ashli is another example. This will happen quite often for anyone developing shaders, and remeber that games like HL2 don't use PS 2.0 on all surfaces, so those particular sections must be very slow to make the overall speed 50% of R300)
The most convincing evidence of NV3x's poor dependent texturing is mentioned at the top of this post -- PS 1.4 benchmarks. It seems like NV3x still has NV2x's register combiners, and does PS 1.1 effects with them to keep performance high (although who knows what's happening in ChameleonMark). However, PS 1.4 effects, which generally involve arbitrary dependant texture reads (or else can be made into PS 1.1) must be run through the regular PS pipeline, and slow down a lot on NV30.
Sure, NV30 has no limit on dependant texture reads, but how often will you need more than 4 levels of dependancy? I find it's quite rare to even need 2 levels, which runs well on R300 according to
ATI's optimization guide. 0 levels is most common by far, and 1 level seems to be popping up in many new games for water surfaces. Besides there is also multipass.
---------------------------
All things considered, there are very few advantages with the NV3x architecture. You can argue that it will be a better base for future architectures, but that's quite a far reaching statement, considering how much it has to be improved just to catch up to R300's peformance.