Nvidia and ARB2

OpenGL guy said:
When did I say anything about performance? But since you've brought it up, I've seen samples from Ashli that do several passes on the R300, yet are still much faster than single pass on NV3x parts. I believe it would take a long time before bandwidth considerations would offset that (large) performance difference.
Well, then, I have to ask: What other difference is there between multipass and the F-buffer besides performance?

For example, when one does multipass, there may be some state data that cannot be carried over. One should be able to calculate this state data again, which will waste cycles, making it less efficient than the F-buffer (everything else the same).

What other reason is there?
 
Chalnoth said:
OpenGL guy said:
When did I say anything about performance? But since you've brought it up, I've seen samples from Ashli that do several passes on the R300, yet are still much faster than single pass on NV3x parts. I believe it would take a long time before bandwidth considerations would offset that (large) performance difference.
Well, then, I have to ask: What other difference is there between multipass and the F-buffer besides performance?
For one, F-buffer supports alpha blending, float buffers do not.
For example, when one does multipass, there may be some state data that cannot be carried over. One should be able to calculate this state data again, which will waste cycles, making it less efficient than the F-buffer (everything else the same).

What other reason is there?
What does any of this have to do with your earlier comment:
Whenever using F-buffer or multipass, the relative performance hit will always be smaller if the hardware supports more instructions (or whatever the limit is, such as texture reads, which the NV3x can also do more of). Since there undoubtedly will be some performance hit associated with using the F-buffer (as opposed to just being able to run a longer program, everything else the same), I don't think you can state unequivocally that the R300 using the F-buffer with its 96 instruction limit will automatically have less of a performance hit than the NV3x with its 1024 instruction limit using multipass
How do you know using the F-buffer is slower than longer shaders on NV3x? Have you evaluated long shader performance on NV3x? Have you evaluated multipass performance on R300? R350 has the F-buffer, which, as you said earlier, should improve performance over straight multipass because you don't always need to recompute intermediate results.

You're the one who brought up performance, so please show me some performance comparisons.
 
OpenGL guy said:
For one, F-buffer supports alpha blending, float buffers do not.
Why does that matter? One can read from a float buffer. One can write to another float buffer. The same calculations can be done. If we're leaving performance out, this is of no consequence.

How do you know using the F-buffer is slower than longer shaders on NV3x?
I didn't say that. I said the relative performance hit would be smaller. The NV35+ is already going to be slower if more than 2-4 FP registers are used (which would seem to be very likely for a long shader). By not using any multipass unless the shader is very long and/or complex (uses more than 16 textures, for example), this helps the NV3x to catch up.

As for performance comparisons, I don't know which ones have been done on this subject, and I furthermore seriously doubt that they would explain the exact situation that I have laid out. I am, nevertheless, very confident that it is true.
 
I've got to chime in here to give my 2 cents on this:

Dave H said:
Walt-

  1. Most production frames are scanline-rendered. Not ray-traced. Even for feature film work, ray-tracing is typically only used to achieve a particular effect--not for most frames, and not necessarily for the entire frame when it is used. Ray-tracing for TV work is rare.


  1. Untrue. I know 2 large companies that use Mental Ray's ray-tracing for ENTIRE scenes. Look at the Matrix sequels and Ice Age. In fact, those companies will continue to use ray-tracing.

    [*]The primary technical reason NV3x is not suitable for final renders of many production frames is that its shader pipeline doesn't support shaders of arbitrary length or dynamic branching. Lack of FP64 or ray-tracing ability are less important concerns, particularly for the portion of the market likely to first start doing production renders on a consumer card. Compared to this, R3x0 is substantially less suitable because it lacks FP32 precision, has substantially more restrictive limits on shader length, and doesn't even do static branching.

    Not entirely true. You are forgetting geometry shaders. In this regard a LOT of bandwidth and memory are required which the 3d accelerators won't have anytime soon..

    -M
 
Chalnoth said:
How do you know using the F-buffer is slower than longer shaders on NV3x?
I didn't say that. I said the relative performance hit would be smaller. The NV35+ is already going to be slower if more than 2-4 FP registers are used (which would seem to be very likely for a long shader). By not using any multipass unless the shader is very long and/or complex (uses more than 16 textures, for example), this helps the NV3x to catch up.
Relative performance hit? So if one chip is 3x faster and takes a 25% hit using an F-buffer, doesn't that still means it's a heck of a lot faster? I am just making up numbers, but I don't understand your comment about "relative performance hit". Also, as you mentioned, using lots of registers is detrimental to performance on the NV3x. So what were you saying about "relative performance hits"?
As for performance comparisons, I don't know which ones have been done on this subject, and I furthermore seriously doubt that they would explain the exact situation that I have laid out. I am, nevertheless, very confident that it is true.
Why am I not surprised in your confidence? You haven't done the research, you haven't seen results, yet you are convinced you are correct.

You own a 9700 Pro, right? Why don't you download the Ashli stuff from ATI's website and experiment with long shaders. You can report your findings to us about how much of a performance hit the R300 takes doing multipass. This should provide a baseline for F-buffer performance on the R350. Why guess you are correct when you can actually get some data?
 
I think I'll jump into the discussion.

Chalnoth, I posted this in another forum but you didn't get around to replying. It fits this discussion even better, though. I keep hearing NVidia fans, most notably you, making excuses about the lack of fixed point rendering in DX9 shaders, but the problem with that argument is that NVidia is not very good at fixed point either. Look at PS 1.1 shader benchmarks like ChameleonMark. Look at PS 1.4 shader benchmarks like ShaderMark (either before NVidia hand tuned it, or after instruction shuffling to avoid detection). The "Advanced Pixel Shader" is another PS 1.4 example. Even HL2 shows this in their DX8.1 version.

The only time NVidia seem to have an edge on ATI is with register-limited, fixed-point, mathematical shaders (i.e. low texture usage, and definately no dependant texture lookups), and even then the speed difference is at most proportional to the clock speed difference. How many DX9 shaders fit this profile? Almost none in the near term, except for Doom3, but can you really call that DX7 technology a DX9 shader?

NV3x has a LOT of problems regarding DX9 class shader performance. It's only when you add the register limitation, the FP32 peformance, the dependent texture inefficiency (see below), etc. that you get the horrible DX9 peformance from NV3x.

---------------------------------------

WaltC said:
Trust me...many offline rendering "farms" today do not need 128-bit color precision, nor 96-bit color precision--many have been operating for years at essentially 32-bit integer precision. This is why most of the rendering software out there does not support 96-bit/128-bit rendering yet--just like 32-bit 3d games don't magically render at 96/128-bits *unless* the software engines support it.

WaltC said:
Laa-Yosh said:
32 bit integer precision in an offline renderer?? You guys must be kidding... or else name this renderer ;)

I seriously doubt that any of the big 4 (Max, Maya, XSI, LW) would be using less than 64 bit per color - in fact, AFAIK LW uses 128 bits per color... Mental Ray and PRMan should be at least as good. Dammit, MR can be 100% phisically accurate, which doesn't sound like integer precision to me.

Also please not that apart from movie VFX studios, PRMan is quite rare in the industry because of its very high price (USD 5000 / CPU AFAIK). Most of the 3D you see in game FMVs, commercials, documentaries etc. is made using the built-in renderers of the "big 4".

But if you want to use Lightwave to calculate to 128-bit colocr accuracy in a ray-traced frame--how's 128-bit fp in 3d chip going to help you do that? (It might be OK in a preview window--if you wanted to rotate an object while you create it--but why not just use flat shading or wire frame--it's much faster? I think most scene creators would use wire-frame or flat-shading in creating objects and doing pathing a scene. I see zero advantage to nV3x over R3x0 in this regard.)

More to the point, distinction needs to be made between 3d and 2d. I don't think that's being done here...:)

WaltC, I don't think you understand Laa-Yosh's point. He is saying that commercial raytracers use 128-bit floating point, i.e. FP128, not 4 x FP32 as NV30 is capable of. You might think this is overkill, but imagine some of the crazy space scenes we see on TV with interplanetary flybys and zooming through the atmosphere into a city. This can really stress the precision limits of FP, especially the mantissa. I think you are quite wrong in saying many renderfarms don't need 128-bit precision, especially if you mean 4 x FP32 like on the GPU.

As for using the 3D chip, we are talking about using pixel shaders on a GPU as a fast, parallel processor, and then reading data back from the GPU memory when needed. Basically, this is using a GPU in a way it wasn't primarily intended to be used.

Lightwave3D is used in many TV shows and movies, and doesn't cost much either. For an offline rendering system to use GPU's, there must be a significant performance boost, no drawbacks (this likely means being able to emulate larger precisions like the coders did for the x86 platform), and non-prohibitive development costs. For these reasons, I doubt we'll see offline rendering (beyond experimentation) in this generation of GPU's or even the next.

-------------------------------------

Dave H said:
Where R3x0 and NV3x come into this is in performance for dependent texture reads. I'd be interested in having more specifics, but as I understand it the R3x0 pipeline is deep enough to do an alright job minimizing the latency from one level of dependent texture reads (particularly if you fetch the texture as much in advance of when you need it as possible), but woe be unto you if you're looking for two or three levels. Meanwhile, NV3x is reputed to do just fine mixing several levels of texture reads in to the normal instruction flow--meaning that it must do a gangbusters job at hiding latency. Meaning it has a hell of a deep pipeline.

Meaning every exposed full-speed register is replicated many times, meaning that in order to fit in a given transistor budget, the number of full-speed registers might have to be cut pretty low.

I was surprised at this statement of yours, Dave H, as well as the praise you recieved for it. While it is a plausible explanation for the register usage problem of NV30, the problem with your statement is that a lot of data shows NV30 as having very poor dependent texure performance. Besides, ATI would need a similar FIFO buffer (or "duplicate registers" as you call them) during dependent texturing operations, and they have a significantly smaller transistor budget.

Remember Ilfirin's benchmark? NV3x was about 1/8 of R300's performance . The register limitation will probably come into play here, but looking at the original version of MDolenc's fillrate tester, before he made the shader more complex, NV3x did quite well with ordinary shaders.

(Aside: Ilfirin's benchmark also happens to be a good counter to your statement "Hmm. I don't recall seeing too many real-world examples over a factor of ~3x". Ashli is another example. This will happen quite often for anyone developing shaders, and remeber that games like HL2 don't use PS 2.0 on all surfaces, so those particular sections must be very slow to make the overall speed 50% of R300)

The most convincing evidence of NV3x's poor dependent texturing is mentioned at the top of this post -- PS 1.4 benchmarks. It seems like NV3x still has NV2x's register combiners, and does PS 1.1 effects with them to keep performance high (although who knows what's happening in ChameleonMark). However, PS 1.4 effects, which generally involve arbitrary dependant texture reads (or else can be made into PS 1.1) must be run through the regular PS pipeline, and slow down a lot on NV30.

Sure, NV30 has no limit on dependant texture reads, but how often will you need more than 4 levels of dependancy? I find it's quite rare to even need 2 levels, which runs well on R300 according to ATI's optimization guide. 0 levels is most common by far, and 1 level seems to be popping up in many new games for water surfaces. Besides there is also multipass.

---------------------------

All things considered, there are very few advantages with the NV3x architecture. You can argue that it will be a better base for future architectures, but that's quite a far reaching statement, considering how much it has to be improved just to catch up to R300's peformance.
 
Mr. Blue said:
I've got to chime in here to give my 2 cents on this:

Dave H said:
Walt-

Most production frames are scanline-rendered. Not ray-traced. Even for feature film work, ray-tracing is typically only used to achieve a particular effect--not for most frames, and not necessarily for the entire frame when it is used. Ray-tracing for TV work is rare.

Untrue. I know 2 large companies that use Mental Ray's ray-tracing for ENTIRE scenes. Look at the Matrix sequels and Ice Age. In fact, those companies will continue to use ray-tracing.

Ok, that's interesting; and no doubt as computational power increases, the benefits of ray-tracing will be deemed worth the performance costs in more and more situations. The issue, though, is that Walt seemed to suffer from the misconception that offline rendering was synonymous with ray-tracing. He claimed, over and over again, that the fact that NV3x couldn't do ray-tracing disqualified it from any serious use rendering production frames, and that's just not correct.

In fact, as I said, most production frames are scanline-rendered rather than ray-traced. And I gather it's still correct to say that where ray-tracing is used it is typically only for a particular effect and not necessarily applied to the entire scene.

[*]The primary technical reason NV3x is not suitable for final renders of many production frames is that its shader pipeline doesn't support shaders of arbitrary length or dynamic branching. Lack of FP64 or ray-tracing ability are less important concerns, particularly for the portion of the market likely to first start doing production renders on a consumer card. Compared to this, R3x0 is substantially less suitable because it lacks FP32 precision, has substantially more restrictive limits on shader length, and doesn't even do static branching.

Not entirely true. You are forgetting geometry shaders. In this regard a LOT of bandwidth and memory are required which the 3d accelerators won't have anytime soon..

Good point--I was forgetting geometry shaders. (Am I correct in thinking that once VS 3.0 brings texture sampling to the vertex shader pipeline, consumer 3d cards will be theoretically capable of running geometry shaders, to about the same degree that they are theoretically capable of running production-quality fragment shaders now?) However, a very sensible division of labor would be to calculate the geometry shaders in software as normal, but handle the pixel side of the rendering process in hardware as is being discussed. If a GPU could handle the pixel shaders despite the limitations I mentioned in my post, then it could offer a very substantial boost in rendering production frames even if it didn't handle the entire process by itself.

As for memory and bandwidth...OGL Guy or Dio pointed out in another thread that local bandwidth consumption is likely to be much lower (per clock; obviously much much higher per frame) when running complex shaders than in normal fixed-function use. Memory size, on the other hand, I could see being a huge problem; I would guess that the 256MB on a high-end consumer card would fall short of what's required for a complex scene by a pretty big factor, and without P10-style virtual memory, spilling to main memory is likely to be hugely inefficient. (Is this the problem you meant by "bandwidth and memory"; because in the context of overfilling local memory and having to shuttle everything across the AGP bus without virtual memory to manage those flows you would indeed be extremely bandwidth-bound.)
 
Minor correction: 'bandwidth' is a 'per second' number so there's no such thing as 'bandwidth per clock'.

I don't see bandwidth as any big problem when long shaders are in use. Just because it's a geometry shader doesn't really matter too much: as the shader length goes up, the data fetch becomes far less important. (There will of course be exceptions to this).

Spilling across AGP or successor technologies is likely the best solution. In the long term the card should have access to the entire memory of the host, if so required. Good design eliminates latency issues; bandwidth is, as I've pointed out, less of a problem than one might think.
 
Dave H said:
Ok, that's interesting; and no doubt as computational power increases, the benefits of ray-tracing will be deemed worth the performance costs in more and more situations. The issue, though, is that Walt seemed to suffer from the misconception that offline rendering was synonymous with ray-tracing. He claimed, over and over again, that the fact that NV3x couldn't do ray-tracing disqualified it from any serious use rendering production frames, and that's just not correct.

I see your point. If he was using it in this way, then yes, he would be inaccurate there.

In fact, as I said, most production frames are scanline-rendered rather than ray-traced. And I gather it's still correct to say that where ray-tracing is used it is typically only for a particular effect and not necessarily applied to the entire scene.

On average, I would have to say yes.

As for memory and bandwidth...OGL Guy or Dio pointed out in another thread that local bandwidth consumption is likely to be much lower (per clock; obviously much much higher per frame) when running complex shaders than in normal fixed-function use. Memory size, on the other hand, I could see being a huge problem; I would guess that the 256MB on a high-end consumer card would fall short of what's required for a complex scene by a pretty big factor, and without P10-style virtual memory, spilling to main memory is likely to be hugely inefficient. (Is this the problem you meant by "bandwidth and memory"; because in the context of overfilling local memory and having to shuttle everything across the AGP bus without virtual memory to manage those flows you would indeed be extremely bandwidth-bound.)

Yeap. You are correct. Geometry shaders would take up a lot of memory since they require such huge amounts of polys.

Whew, someone is seeing my points..:)

-M
 
To acknowledge Mr. Blue’s contribution to this thread could we please re phrase Pixar quality rendering to Blue Sky Studio quality rendering ;) ? BTW Ice Age was great.
 
Dave H said:
Ok, that's interesting; and no doubt as computational power increases, the benefits of ray-tracing will be deemed worth the performance costs in more and more situations. The issue, though, is that Walt seemed to suffer from the misconception that offline rendering was synonymous with ray-tracing. He claimed, over and over again, that the fact that NV3x couldn't do ray-tracing disqualified it from any serious use rendering production frames, and that's just not correct.
I thought it had been demonstrated that ray tracing could indeed be performed on hardware that supports PS 2.0?

Oh, and btw, I don't expect raytracing to ever gain wide acceptance. The way I see it, raytracing is a relatively small step up in required computing power when compared to scanline rendering. Other rendering techniques, such as radiosity, global illumination, and photon mapping, should instead take over. That is, I feel that raytracing is a stopgap measure.
 
Back
Top