Can F-buffer mask the importance of single-pass abilities?

Reverend · Mar 7, 2003

Luminescent said:
I came to this information by reading a few informative theads, in the Beyond 3D forum, and consulting with a certain, reliable, individual for clarification.

Okay, come on, out with it - who's this guy? I have this little "Beyond3D Reliable Smarties" black book that I want to fill up.

Goragoth · Mar 7, 2003

Wow wow wow. Sounds like you can compile RenderMan shaders for the R350 and run them just like that. I'm blown away. I can see this is irrelevant for games or any realtime but damn! Someone clear this up for me: is it possible to write a plugin (at least theoretically) for say Maya or 3dsMax that will present it self as a renderer and that will render the scene with shaders and everything on the R350 at a speed many times of what a fast (3.06Ghz) P4 or any other CPU would do? I can see that quality (esp. with regards to AA) may not be quite on the software rendred level so maybe not useful for movies but this would be GREAT for hobbyists/small CGI firms and so on. Basically a cheap renderfarm on a card (there already is a product specifically for this but this has the potential to be much bigger/cheaper because it is a consumer board)

. I hope I'm not getting all excited about nothing here but this just sounds too good to be true.

DemoCoder · Mar 7, 2003

Not quite. There is still the issue of handing per-pixel data dependent branching. The biggest problem of course being loops with data-dependent conditional exits. Many branches can be handled with predicates, but dynamic loop conditionals can't be 'unrolled' as easily.

Handling this with the F-Buffer requires a little more sophication since you have a different number of "passes" per pixel which can't be determined ahead of time, but only as a result of the last pass.

F-Buffer is a good way to remove register and instruction length limits for PS3.0 support, but we still need more native HW support for data dependent branching. Yes, it will kill performance, but, like the F-Buffer, hopefully 90% of shaders won't use any data dependent branching.

The goal here is to handle the special cases so that programmers don't get annoyed. Better that the shader runs (albeit non-optimally) on some HW than not run at all and generate a compile error.

Think of it as a "catch all" that steps in to allow degenerate shaders to run on all HW, regardless of how many instructions or registers the HW has. This would truly be great if all HW had it.

For Nvidia, it's unlikely they will implement it, since once you yet into the thousands of instructions, the limits are effectively "infinite"

arjan de lumens · Mar 7, 2003

The way I understand it, the main advantage of the F-buffer over an approach where you stream in shader instructions as needed (as NV30 seems to do, and which allows arbitrary-length shaders at least as easily as the F-buffer) is that you get better pixel shader instruction cache behavior: for each pass and each pixel block that fits in the F-buffer you need to reload the instruction cache only once.

With PS3.0 data-dependent branching, especially when the branching is used to create while-loops, you risk losing this advantage if you don't come up with some really nifty tricks (not so bad if the loop fits wholly within the N instructions allotted for each pass, but once you need to support while-loops larger than that, hardware design starts to get difficult)

dominikbehr · Mar 7, 2003

Goragoth said:
Wow wow wow. Sounds like you can compile RenderMan shaders for the R350 and run them just like that. I'm blown away. I can see this is irrelevant for games or any realtime but damn! Someone clear this up for me: is it possible to write a plugin (at least theoretically) for say Maya or 3dsMax that will present it self as a renderer and that will render the scene with shaders and everything on the R350 at a speed many times of what a fast (3.06Ghz) P4 or any other CPU would do? I can see that quality (esp. with regards to AA) may not be quite on the software rendred level so maybe not useful for movies but this would be GREAT for hobbyists/small CGI firms and so on. Basically a cheap renderfarm on a card (there already is a product specifically for this but this has the potential to be much bigger/cheaper because it is a consumer board) . I hope I'm not getting all excited about nothing here but this just sounds too good to be true.

now, you really want to come to my session
http://www.cmpevents.com/GDx/a.asp?option=C&V=11&SessID=906

as for people who want to run renderfarms on r3xx/gffx, thats a little backward idea, considering you already have cpu on this renderbox, and 24/32bit precision and rasterization rendering doesnt really cut it for high end rendering. i can see where this class of hardware can be used for tv quality rendering and previews.

DemoCoder · Mar 7, 2003

arjan de lumens said:
The way I understand it, the main advantage of the F-buffer over an approach where you stream in shader instructions as needed (as NV30 seems to do, and which allows arbitrary-length shaders at least as easily as the F-buffer) is that you get better pixel shader instruction cache behavior: for each pass and each pixel block that fits in the F-buffer you need to reload the instruction cache only once.

True, but you're just trading off one type of cache coherency and bandwidth, for another. When your shader exceeeds the cache size, you have to page the next block of instructions in, chewing up some bandwidth. With the F-Buffer, when you exceed your limits, you end up loading and storing pipeline state per fragment.

Once you start talking about very long shaders, I'm not even sure the differences in architecture will be the dominant issue in performance. Both the F-Buffer and "page in" bandwidth and latency will be under the radar.

With spare transistors, and assuming a 128-bit VLIW instruction size, it seems to be you could build a cache big enough to hold anywhere from 1024 to 4096 instructions, and it would be extremely very rare for shaders to exceed these limits, especially if loops are introduced.

Likewise, you could also put the F-Buffer on the chip core itself instead of in vid-ram. As I said, both approaches have their pros and cons.

arjan de lumens · Mar 7, 2003

DemoCoder said:
True, but you're just trading off one type of cache coherency and bandwidth, for another. When your shader exceeeds the cache size, you have to page the next block of instructions in, chewing up some bandwidth. With the F-Buffer, when you exceed your limits, you end up loading and storing pipeline state per fragment.

Once you start talking about very long shaders, I'm not even sure the differences in architecture will be the dominant issue in performance. Both the F-Buffer and "page in" bandwidth and latency will be under the radar.

For very long shaders, I suspect that both the F-buffer method and the instruction streaming method will eat a rather sizable chunk of the available bandwidth. If you have, say, 20 active temp registers at 96 bits each, you have 240 bytes of per-pixel data to swap in and out per pass for the F-buffering - for 160 instructions per pass, that's (240+240)/160 = 3 bytes of memory accesses per shader instruction per pixel. For the instruction stream method, you might expect each instruction to be applied to perhaps 8 pixels each, which, at 128 bits per instruction, amounts to 128 bits/8 = 2 bytes of memory accesses per shader instruction per pixel. You will need to have a rather large proportion of texture instructions (1 32-bit filtered texture lookup =~4-5 bytes of memory accesses, taking into account texture caching effects) to outweigh the bandwidth consumed by either the F-buffer or the instruction streamer.

With spare transistors, and assuming a 128-bit VLIW instruction size, it seems to be you could build a cache big enough to hold anywhere from 1024 to 4096 instructions, and it would be extremely very rare for shaders to exceed these limits, especially if loops are introduced.

Likewise, you could also put the F-Buffer on the chip core itself instead of in vid-ram. As I said, both approaches have their pros and cons.

A 64 KiB cache to hold either 4096 instructions or a ~200-500 fragments F-buffer would cost about 3 million transistors, which should be a rather small part of a next-generation GPU design, and help long-shader performance a great deal in either case.

For data-dependent branching, you do, however, get the problem of serving multiple instruction streams from a single cache, which may make it quite a bit more expensive, though.

nutball · Mar 7, 2003

dominikbehr said:
as for people who want to run renderfarms on r3xx/gffx, thats a little backward idea, considering you already have cpu on this renderbox, and 24/32bit precision and rasterization rendering doesnt really cut it for high end rendering. i can see where this class of hardware can be used for tv quality rendering and previews.

I'm intrigued! You're saying that 24/32-bit-per-component floating-point isn't good enough for high-end rendering?

Secondly, can a general-purpose processor really keep up with where GPUs are going?

Xmas · Mar 7, 2003

nutball said:
I'm intrigued! You're saying that 24/32-bit-per-component floating-point isn't good enough for high-end rendering?

'High end rendering' of Mandelbrot set zooms needs more than 32 bit precision

demalion · Mar 7, 2003

prods Reverend to continue the discussion where it left off in PMs so he doesn't have to test all the math himself.

Goragoth · Mar 7, 2003

I can see that it will be a while (if ever) before film-quality rendering can be done on a GPU but I can see the excitement for hobbyists/small companies using this consumer hardware to render out 3d content extremely quickly and obviously for any artist working on any 3d content to be able to see very accurate previews of scenes being worked on. One good example would be the person (can't remember the name) who created the KillerBean2 3d short all by himself. Took him something like a month to render it out. With hardware rendering he might have been able to do it in days or maybe even hours and since it was only for web distribution where it gets compressed lots anyways the quality/precision wouldn't be an issue. I'm guessing any DX9 card can probably do this with multipassing anyway but it sounds like it may be much simpler to implement something like this with the R350 and its f-buffer because you don't have to worry about special multipass code, right? I just hope ATI do enable it in drivers for the regular R350 boards and not just the professional (and pricey) FireGL line cards.

DemoCoder · Mar 7, 2003

It's not the bandwidth I'm worried about. Let's say an instruction is 128-bits long and you can fetch 256-bits per clock. Let's say you can execute 1 op per clock. And an instruction only has to be fetched once for all 8 pipelines (it's shared)

What this means is that unless you are fetching multiple textures every other clock cycle, or writing to the frame buffer every other clock cycle (not the case for very long shaders), on average, half the time your bandwidth is wasted.

On each clock cycle, I can fetch 1 128-bit instruction, and have another 128-bits of bandwidth left over to do with as I please. On a long shader (say, 500 instructions), with very few texture texture fetches (say, 80% color ops), 40% of my bandwidth is just wasted, since the color ops don't use bandwidth, and the instruction streaming doesn't saturate the bus.

That's why I said, even if we assume an on-chip cache big enough to hold the entire shader, the performance is dominated by the dispatch rate. Even streaming from video memory, the dispatch rate is 50% slower than the stream-in rate, so there is no chance you're going to be bottlenecked by fetching instructions.

Color-op bound shaders (e.g. those with a majority of the shader being non-memory oriented instructions) waste bandwidth. Hopefully, GPUs can use the extra bandwidth to intelligently prefetch stuff when the shader is idling the memory bus.

nutball · Mar 7, 2003

Xmas said:
nutball said:

I'm intrigued! You're saying that 24/32-bit-per-component floating-point isn't good enough for high-end rendering?

Click to expand...

'High end rendering' of Mandelbrot set zooms needs more than 32 bit precision

Ahhh, OK. I'll appreciate that more next time I see one at the movies

arjan de lumens · Mar 7, 2003

DemoCoder said:
It's not the bandwidth I'm worried about. Let's say an instruction is 128-bits long and you can fetch 256-bits per clock. Let's say you can execute 1 op per clock. And an instruction only has to be fetched once for all 8 pipelines (it's shared)

It's shared as long as you don't start doing data dependent jumps.

What this means is that unless you are fetching multiple textures every other clock cycle, or writing to the frame buffer every other clock cycle (not the case for very long shaders), on average, half the time your bandwidth is wasted.

On each clock cycle, I can fetch 1 128-bit instruction, and have another 128-bits of bandwidth left over to do with as I please. On a long shader (say, 500 instructions), with very few texture texture fetches (say, 80% color ops), 40% of my bandwidth is just wasted, since the color ops don't use bandwidth, and the instruction streaming doesn't saturate the bus.

This argument holds (except for vertex/ramdac traffic, which shouldn't be a lot) as long as you limit yourself to 1 op per clock. R300/350 can, AFAIK, do 3 (just what NV30 is capable of hasn't been quite established yet ...). Also, instructions get wider if constant values start appearing in them.

That's why I said, even if we assume an on-chip cache big enough to hold the entire shader, the performance is dominated by the dispatch rate. Even streaming from video memory, the dispatch rate is 50% slower than the stream-in rate, so there is no chance you're going to be bottlenecked by fetching instructions.

Color-op bound shaders (e.g. those with a majority of the shader being non-memory oriented instructions) waste bandwidth. Hopefully, GPUs can use the extra bandwidth to intelligently prefetch stuff when the shader is idling the memory bus.

Prefetching an instruction stream with no dynamic flow control should be dead easy. Prefetching texture maps is a bit harder, but not impossible to do.

Reverend · Mar 7, 2003

demalion said:
prods Reverend to continue the discussion where it left off in PMs so he doesn't have to test all the math himself.

I'd prefer to have this thread stay on track (it's an interesting question Lumi asked in his original question)... we can save FP precision discussions/arguments for another thread/time.

dominikbehr · Mar 7, 2003

nutball said:
I'm intrigued! You're saying that 24/32-bit-per-component floating-point isn't good enough for high-end rendering?

Secondly, can a general-purpose processor really keep up with where GPUs are going?

I think it is good for per pixel color/lighting calculations when rasterizing triangles (it was designed for it). I wouldnt use it for production quality raytracing. I am very excited by new class of problems you can solve on modern GPU but lets keep it sane. CPUs are going to stay here. GPUs are going to be used for realtime rendering, previews, maybe TV quality stuff, especially in realtime applications.

There is not that much disparity between CPU and GPU, 3GHz P4 running SSE code can do pretty amazing things too.

We just experienced big feature set change on GPU. We will probably spend next few years refining the speed and instruction sets and capabilities of current generation.

I just finished presentation on how you can translate 3dsmax materials to hardware shaders and display realtime preview of 3dsmax scene in a viewport. It's a very cool thing but i would call it high-end rendering.

DemoCoder · Mar 8, 2003

arjan de lumens said:
This argument holds (except for vertex/ramdac traffic, which shouldn't be a lot) as long as you limit yourself to 1 op per clock. R300/350 can, AFAIK, do 3 (just what NV30 is capable of hasn't been quite established yet ...).

I already covered this by suggesting the scenario where you have 80% color ops. The R300 is only going to reach 3 ops per clock if 33% of shader is texture instructions. If you have a very long shader (>160 instructions), say, 1024 instructions, this means you'd be sampling textures over 300 times, which I find unlikely, unless you are doing some kind of custom texture filtering on a large number of textures. In my experience, even the most complicated RM shaders fit with a few hundred instructions, and do not consist of 33% texture lookups.

Reverend · Mar 8, 2003

Reverend said:
demalion said:

prods Reverend to continue the discussion where it left off in PMs so he doesn't have to test all the math himself.

Click to expand...

I'd prefer to have this thread stay on track (it's an interesting question Lumi asked in his original question)... we can save FP precision discussions/arguments for another thread/time.

Wow... responding to myself...

Anyway, since Dominik appears to find this interesting enough to comment on in this thread, and even though it veers off-course from the original thread topic, I'm thinking... why not?

The following is what I wrote Dominik via email (and almost exactly identical to what I wrote demalion via PM prior to that) about FP precision when it comes to R300/350 vs NV30 :

Rev, to Dominik via email, prior to Dominik's post above :

What about positional computations?

For example, say you're using a general lighting algorithm that handles local lightsources (lightsources close enough to an object that the direction vector to them from each point on the object changes significantly over the course of the object). The most natural way to represent this is by interpolating the lightsource position in one of the pixel shader registers, so you can do light attenuation per-pixel.

With 8-bit precision, this approach is going to have unavoidably bad results in many cases for obvious reasons.

With 16-bit floating point (10.5.1), far-away lights are likely to be out of range causing floating point overflows. For large objects, you could get into serious floating point precision loss (the classic A-B problem where A and B are large floating point numbers that are nearly equal) -- with a 10 bit number, if you lose 5 bits to such things, there are few enough bits left that significant banding will be visible. With 16-bit, you could compensate for these issues by scaling lightsource positions and special-casing the setup code to workaround such problems.

With 32-bit floating point, the situation will be artefact-free in all conceivable realistic cases, and no setup trickery is needed.

Now, you may ask what is this positional data if it isn't an input to the vertex shader (which isn't supported yet by any hardware currently available)? Well, it could be in vertex or pixel shader constants, vertex stream values, textures (either manually generated or coming from render-to-texture).

So what really is the difference between 32-bit per channel render-targets/textures and the R3xx's 24-bit per channel pixel pipeline, you ask?

8-bit was enough for color if you're just doing a couple adds/multiplies and don't care about overbrightening and hack your lightsource brightnesses to be in a non-physical range (say 16-255).

16-bit floating point is all you need for color with overbrightening and realistic lightsources. But it's not enough for worldspace or screenspace positional math.

32-bit floating point is enough for per-pixel worldspace or screenspace positional math. It sometimes falls down sometimes in high-level matrix transformations in very large-scale scenes. This can be worked around at a high level, or you can use 64-bit floating point in your high-level engine code (while still using 32-bit in the pixel and vertex shaders).

Here's a reference : http://www.math.colostate.edu/~hulpke/lectures/gs510/sli3.pdf

This comes into play in 3D math because you might want to compute a per-pixel vector from your current position in the world to a point on another object or a lightsource, which entails subtracting them, which loses lots of precision if the points are close to each other but far from the origin, which is a big deal if you only have 10 bits (instead of 24) to start with.

The other way to look at this is to ask: why don't CPU's support 16-bit floating point? ANSWER: Because it's useless for nearly all general purpose math including 3D math. The one case I can see where it's useful are color and normal vector computations that produce colors as results, where your eyes are low enough precision that 10 bits mantissa is enough. But for anything besides color, 16-bit floating point is highly dubious. If it were generally useful, CPU's would have adopted it long before 32- and 64-bit floating point (which all Pentium's supported in silicon in 1994), since 16-bit is significantly less expensive to implement.

There are other (example) peeves I have about the R300's 24-bit internal FP calculations when there is a IEEE-32 standard that the NV30 (apparently, because I don't have a NV30) follows... but the above should suffice until I see fresh urgency to post additional examples.

LeStoffer · Mar 8, 2003

dominikbehr said:
We just experienced big feature set change on GPU. We will probably spend next few years refining the speed and instruction sets and capabilities of current generation.

Brilliant!

I was exactly hoping that we would reach a decent planteau with DX9-level features and stay here with focus on higher performance. I'm sure some developers want things beyond VS 3.0 and PS 3.0, but let's make complex/long shader with those plenty fast first (which is ready a rather big task IMHO).

MfA · Mar 8, 2003

Ive <A HREF=http://groups.google.com/groups?q=renderman+single-precision&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=87a9n5%2482b%241%40sherman.pixar.com&rnum=7>seen it said</A> that prman uses single precision floating point almost exclusively.

Can F-buffer mask the importance of single-pass abilities?

Reverend

Goragoth

DemoCoder

arjan de lumens

dominikbehr

DemoCoder

arjan de lumens

nutball

Xmas

Porous

demalion

Goragoth

DemoCoder

nutball

arjan de lumens

Reverend

dominikbehr

DemoCoder

Reverend

LeStoffer

MfA

Similar threads