PDA

View Full Version : Regarding hardware drawing efficiency


SA
15-Oct-2002, 02:22
What frame rate would you achieve at a resolution of 1600x1200, a frequency of 325 Mhz and just one pixel pipeline if you could actually draw one visible pixel per clock?

Bigus Dickus
15-Oct-2002, 03:11
1600 x 1200 = 1,920,000 pixels on screen = 1,920,000 pixels per frame.

325,000,000 cycles per second x 1 pixel per cycle = 325,000,000 pixels per second.

(325,000,000 pix/sec) / (1,920,000 pix/frame) = 169.27 frames/sec

Chalnoth
15-Oct-2002, 03:30
Just don't forget that many modern scenes will apply many textures per pixel, will compute the final color for a pixel in multiple passes, or make use of transparent surfaces.

What all of this means that if you took a real game scene from, say, Unreal Tournament 2003, and put it through hardware that was capable of outputting each pixel only once, it might still need to use many clocks per pixel just to get the processing done.

KnightBreed
15-Oct-2002, 03:35
Ok, point made. What do you suggest? You've been an open proponent of deffered rendering solutions.

SA
15-Oct-2002, 04:50
The point is that there is a great deal of inefficiency yet in today's hardware. Improving the rendering efficiency provides a route to improving performance that does not necessarily require costly new processes, large numbers of pipelines, etc. Not that these aren't great to have, they are. Just that there is also still plenty of low hanging fruit that can come from improving rendering efficiency.

As an example, you might simply added 8k of frame/depth buffer cache to a standard IMR (about a 32x32 pixel tile's worth) , then recommend that developers sort their render in roughly tile order and roughly front to back within a tile region. Older titles that did not do this would still see some benefit from the cache while developers that took full advantage of it would get tiler-like performance with a standard IMR. For those developers that wanted to use application driven deferred rendering they could still render the scene twice, once without shading (to set the depth buffer) and then again with shading.

Hierarchical z buffering would add even more benefit, especially if the upper levels were cached on the chip. I would recommend up to 5 levels (for quick elimination of large stencil polys, bounding volume occlusion checks, etc.).

Providing for the use of z occlusion culling using bounding volumes to eliminate unnecessary hidden vertex and pixel processing. This becomes an ever increasing issue as triangle rates and scene complexity increase. It think it important to provide this capability as a standard feature across all 3d hardware vendors and APIs. Z occlusion culling works particularly well with 5 or more levels of hierarchical z to quickly determine the visibility of the bounding volumes.

Using more efficient multisampling AA techniques such as Z3 or other coverage mask approach and sparse grid sampling, could provide 16x or even 32x near stocastic AA with little performance impact. It would correctly handle implicit edges and order independent transparency sorting to boot.

There are still some improvements both in performance and quality that can be made in anisotropic filtering as well. Some of the ideas in the Feline approach would be useful.

There are, of course, many other possibilities. Improving rendering efficiency has just begun to be tapped and offers all the vendors the opportunity for a great deal of performance improvement in the near term.

LeStoffer
15-Oct-2002, 07:01
As an example, you might simply added 8k of frame/depth buffer cache to a standard IMR (about a 32x32 pixel tile's worth), then recommend that developers sort their render in roughly tile order and roughly front to back within a tile region.

Nice and fairly simple, but NV/ATI still have to convience game developers to sort [roughly] front to back and take advantage of LMA and HyperZ.

Anyway, I had this stupid idea recently about doing the sorting between the vertex and pixel level on a big Z-check onchip buffer before any texels are applied to the pixel (e.g. before any pixels are actually rendered). My lame idea was that you only had to keep the "pre-pixels" Z-value and thus could built up these pre-pixels data in the buffer and remove all the hidden ones based on their Z-values. When every pre-pixel is either rejected or accepted in the buffer, you would go on to actually render those pixels.

But then I realized it doesn't make any bloody sense because you have to store a lot of data to go with each and every pixel that is about to be drawn. :oops:

Hellbinder
15-Oct-2002, 08:29
Improving the rendering efficiency provides a route to improving performance that does not necessarily require costly new processes, large numbers of pipelines


Remember i said (publically) that the Nv30 was a 4x4 architecture that employs several new features instead of more pipelines to gain large ammounts of speed... oh about a week ago.. ;)

LeStoffer
15-Oct-2002, 08:44
]Remember i said (publically) that the Nv30 was a 4x4 architecture that employs several new features instead of more pipelines to gain large ammounts of speed... oh about a week ago.. ;)

We remember. :wink: The question, however, is what this employs several new features is really about. So what is it gonna be, Hell? :P

Randell
15-Oct-2002, 08:49
hmm another one of SA's famous hints?

Z3 AA (which I still dont understand fully even after having looked the the white paper) sounds a great implementation.

Kristof
15-Oct-2002, 09:00
As an example, you might simply added 8k of frame/depth buffer cache to a standard IMR (about a 32x32 pixel tile's worth) , then recommend that developers sort their render in roughly tile order and roughly front to back within a tile region.

:o

Err... I think that render order is one of the things that the developer should not have to care about... plenty of other things to worry about. We don't want developers to worry about low-level things like optimising per pixel HSR... this is one of the most basic features of 3D hardware and it should just work efficiently.

I am sure that NVIDIA and ATI would prefer that developers start following the absolute basics optimisation rules. Just to give some examples: Do a flip rather than a blit from back buffer to front buffer (one is some pointer changes and one is a full memory copy)... submitting more than 2 polygons per draw primitive call... this all sounds trivial but if there are developers out there that can not even get this right, god only knows what will happen if you expect them to do the kind of sorting you suggested.

Also I believe that ATI already has some kind of back-end tile-like buffer, IIRC this was promoted a bit by Marketing for 8500 ?

K-

Simon F
15-Oct-2002, 09:05
... this all sounds trivial but if there are developers out there that can not even get this right, god only knows what will happen if you expect them to do the kind of sorting you suggested.
Bubble sort, perhaps?

GetStuff
15-Oct-2002, 09:44
]
Improving the rendering efficiency provides a route to improving performance that does not necessarily require costly new processes, large numbers of pipelines


Remember i said (publically) that the Nv30 was a 4x4 architecture that employs several new features instead of more pipelines to gain large ammounts of speed... oh about a week ago.. ;)


As if its really hard to come to a concluscion based on all the bits and pieces floating around the internet... :roll: :lol: :lol:

arjan de lumens
15-Oct-2002, 09:51
I seem to remember an old block diagram of ATI's Rage128 chip with an 8 Kbyte framebuffer cache - if that old chip had it, I would find it likely that newer chips also have it, probably more than 8 KBytes as well. Also, AFAIK, most IMRs today already use tiled framebuffers, typically with 8x8 pixel tiles, presumably caching multiple such tiles.

Using bounding boxes on 3d objects to do optimizations on them is entirely possible, but requires extensive support at both API and application level. Rejecting bounding boxes based on hierarchical Z seems to be doable on IMR architectures only - and you need to sort the objects in front-to-back order (which probably precludes them from being sorted in tile order) to see this kind of benefit.

Z3/coverage mask AA methods? Just wondering what the memory usage, performance hit and image quality on these methods are compared to e.g. ATI's multisampling implementation (compressed multisample buffer => fairly small performance hit).

I am not really convinced that there are any really low-hanging fruit left to collect (other than perhaps better texture compression methods) - for now, it looks like compressing the multisample buffer was the last one that didn't require extensive API support.

LeStoffer
15-Oct-2002, 10:07
I am not really convinced that there are any really low-hanging fruit left to collect (other than perhaps better texture compression methods) - for now, it looks like compressing the multisample buffer was the last one that didn't require extensive API support.

I would think the same considering the ATI is at their third generation HyperZ and nVidia at their second LMA.

If you're going for big benefits it would seem that you have to do some kind of sorting of either polygons or pixels into a list instead of just removing some hidden pixel based on Z-check along the way. And thus the question is: Is there any methode where you don't need a full scale sorting?

Dave Baumann
15-Oct-2002, 10:34
I think SA may be making a point here. If we remember back to the when the remain of 3dfx were purchased by NVIDIA you may remember a number of interviews at the time with NV's CEO, and others, stating that they doubt they would fully adopt the gigapixel deferred rendering approach, but there may be ways of marrying some of the benefits of the tiling approach with IMR's. Now, what SA is talking sounds like one of the possabilities that they were talking about at the time.

arjan de lumens
15-Oct-2002, 11:25
I don't quite see how. Either you do immediate-mode rendering, drawing polygons as you receive them, or you do deferred rendering, collecting polygon data for an entire scene before drawing any of it. To me, it would seem that anything between would inherit the disadvantages of both and the advantages of neither.

Sorting objects in near-tile-order gets difficult with objects that are larger than a tile or straddle tile boundaries - it seems to me that at best you get a rather small increase in the framebuffer cache hit rate (this would, in any case, not require changes to modern IMRs)

Ailuros
15-Oct-2002, 15:48
Z3/coverage mask AA methods? Just wondering what the memory usage, performance hit and image quality on these methods are compared to e.g. ATI's multisampling implementation (compressed multisample buffer => fairly small performance hit).

Theoretically (or essentially) for "free" if the framebuffer is on chip with far more than just 4 samples and across resolutions. That's at least what I understood last time it was analyzed.

Sorting objects in near-tile-order gets difficult with objects that are larger than a tile or straddle tile boundaries - it seems to me that at best you get a rather small increase in the framebuffer cache hit rate.

What if you use varying sizes of tiles, f.e. split up the scene into 2 or 3 parts and then resplit it afterwards? My knowledge on stuff like that is very basic to be honest, but from the little I understood trying to decode the latest PVR patent into laymans terms, there doesn't seem a necessity to complete a frame before moving to the next one, in occassions like described in it.

(Simon correct me please if I'm wrong).

On a sidenote can someone please add some more simple input on possible advantages of Feline algorithms? Last time a patent was posted I got lost even trying to read it *ahem*.

Gollum
15-Oct-2002, 16:32
arjan de lumens, SA has been carefully hinting that despite some people believing otherwise, there is still headroom left for performance improvement in current and future hardware accellerators, by increasing the rendering pipeline efficiency, which doesn't neccessarily mean the way polygons are being fed to the pipeline by IMRs or TBRs IMHO. So why not talk about how this could be achieved and go into where these tweaks might be possible? As an old tech lurker here I was hoping some of the more technically versed people could make some interesting comments to learn from... :)

arjan de lumens
15-Oct-2002, 17:23
I just do not see that there is all that much efficiency headroom left, at least not in IMR architectures running legacy applications. The tiled framebuffer cache seems to have been around for some time, at least since Radeon8500 and almost certainly much longer (voodoo?); Z3 may look better than 4xRGMS, but requires more per-pixel data for non-edge pixels (=problem in IMR, should work fine in TBR); bounding box optimizations are nice, but require application support; how well does the Feline algorithm perform compared to whatever method it is that ATI uses for anisotropic mapping (assuming that it isn't the very same algorithm)?

There seems to be an idea floating around here about an immediate-mode tiler architecture. Such a beast will require applications/games to be written such that they supply data in tile order. OK so far - here is the difficult part: it needs an efficient method for handling objects that straddle tile boundaries.

Hyp-X
15-Oct-2002, 17:54
My guess: Z-only first pass...

With proper hw support it could be a killer feature. The question is not the number of pixel pipes, but the number of Z-operations possible per cycle, when the pixel pipelines are not used...

RoOoBo
15-Oct-2002, 18:08
My guess: Z-only first pass...

With proper hw support it could be a killer feature. The question is not the number of pixel pipes, but the number of Z-operations possible per cycle, when the pixel pipelines are not used...

For that you would need full vertex shader or T&L for all the scene (and two times) which I hardly can see as efficient.

arjan de lumens
15-Oct-2002, 18:14
My guess: Z-only first pass...

With proper hw support it could be a killer feature. The question is not the number of pixel pipes, but the number of Z-operations possible per cycle, when the pixel pipelines are not used...

Makes sense for scenarios with complex multitexturing/pixel shaders and/or high overdraw, when the memory traffic saved for overdrawn pixels (modern renderers are generally smart enough not to texture a pixel that fails Z test) outweighs the additional Z traffic produced in the Z-only pass and the fact that you need to pass geometry twice. Doesn't Doom3 do something like this already? Having dedicated hardware for this task may or may not make sense, depending on whether the standard pixel pipes are already able to saturate the available memory bandwidth with Z-only traffic.

Hyp-X
15-Oct-2002, 18:47
For that you would need full vertex shader or T&L for all the scene (and two times) which I hardly can see as efficient.

Transform yes, lightining no.
No environment mapping computation, per-pixel lighting precalc, etc.
It's quite a big saving.

Also, games are still not vertex limited (not even UT2003).

They could also increase the vertex processing power to make it possible (note, I said proper hw support.)

Humus
15-Oct-2002, 22:36
I don't quite see how. Either you do immediate-mode rendering, drawing polygons as you receive them, or you do deferred rendering, collecting polygon data for an entire scene before drawing any of it. To me, it would seem that anything between would inherit the disadvantages of both and the advantages of neither.

You could collect a small amount of polygons, but not neccesarily the whole scene. If the hardware would batch up say 1000 polygons or so and sort them before drawing you could increase efficiency quite a lot.

Nagorak
15-Oct-2002, 22:53
For that you would need full vertex shader or T&L for all the scene (and two times) which I hardly can see as efficient.

Transform yes, lightining no.
No environment mapping computation, per-pixel lighting precalc, etc.
It's quite a big saving.

Also, games are still not vertex limited (not even UT2003).

They could also increase the vertex processing power to make it possible (note, I said proper hw support.)

Games may not be vertex limited, but isn't that just because newer hardware contains a ridiculous amount of vertex shaders (4 in R300, etc)? Maybe I misunderstand the use of the vertex shaders, but why would both ATi and Nvida keep adding more if they had no affect on performance.

arjan de lumens
15-Oct-2002, 23:07
You could collect a small amount of polygons, but not neccesarily the whole scene. If the hardware would batch up say 1000 polygons or so and sort them before drawing you could increase efficiency quite a lot.

I can see how polygon batching and sorting could give the benefits of front-to-back rendering within a 3d object - which would give a moderate efficiency increase in an object that partially covers itself (how common is this?) - at the cost of additional memory traffic for sorting and writing and re-reading of T&Led vertices. It's a tradeoff, as far as I can see. For data sets larger than an 'object', further batching & sorting should give results similar to or slightly weaker than what happens when you sort the objects yourself.

Hyp-X
15-Oct-2002, 23:43
Games may not be vertex limited, but isn't that just because newer hardware contains a ridiculous amount of vertex shaders (4 in R300, etc)? Maybe I misunderstand the use of the vertex shaders, but why would both ATi and Nvida keep adding more if they had no affect on performance.

They always improve VS performance along with fillrate increase so 4x VS was a logical move for the R9700.
Only budget cards go in the opposite direction (Gf4MX / Xabre), where they excluded hw VS support because no games really need it NOW.
High-end cards are meant for the future.
Whether the IHVs are guessing the future right is another question...

One of the reasons to increase VS processing power is to make longer shader programs feasible.

Hyp-X
15-Oct-2002, 23:50
Lets see:
256bits DDR interface, 32bit Z-buffer, 16 values can be transferred per clock.
Assuming 1:4 compression means 64 values per clock.
That would allow 64 pixels on Z-fail, or 32 on Z-pass.
Actually more if mem clock > core clock.
Compare that to the 8 pipelines.
It's quite far from full utilization...
That Z-pass could be made really fast!

Then the normal passes could skip 64 pixels per cycle when occluded...

Hellbinder
16-Oct-2002, 00:56
Sorry to interupt your very cool and interesting discussion again.. but..


As if its really hard to come to a concluscion based on all the bits and pieces floating around the internet


To my knowledge no one anywhere other than myself has stated anything even close to the Nv30 being a 4x4. Everyone and their brother claims it is a 8x2. that is pretty common knowledge. I did not throw that out there just becuase I *thought it would be cool*... The wind whispered it to me in a dream... ;)

the question is.. is the wind right? :lol:

Nagorak
16-Oct-2002, 01:58
If NV30 is really 4*4 then I have to question why (maybe one of their engineers is into pick-up trucks?). All bandwidth saving, etc excluded it doesn't seem like having 4 TMUs is really going to pay off. On a game like Doom3 with 6 texture layers you'll still need to multipass anyway whether you have 4 TMUs or 1 TMU.

I think it would be very interesting if Nvidia followed a totally different path than ATi, but it just seems so out of character. In the past, ATi has been the one to try competing with more technically advanced/efficient parts, while Nvidia concentrated more on brute force. Maybe they've traded places now...but just on paper 8*1 seems a lot better than 4*4, at least unless you like hauling heavy loads. ;)

3dcgi
16-Oct-2002, 03:41
My guess: Z-only first pass...

With proper hw support it could be a killer feature. The question is not the number of pixel pipes, but the number of Z-operations possible per cycle, when the pixel pipelines are not used...

I would think these extra Z-operations are being used for sub-pixels to increase the quality of antialiasing.

LittlePenny
16-Oct-2002, 04:33
I have a question about the sorting developers would need to do for the cache thing. Please keep in mind I haven't developed any games myself.

Would it be possible to develop a generic data structure for primitives? I imagine if an IHV did this, and developers used inheritance to add their own ideas to the mix this would make things more doable.

Tagrineth
18-Oct-2002, 15:21
If NV30 is really 4*4 then I have to question why (maybe one of their engineers is into pick-up trucks?). All bandwidth saving, etc excluded it doesn't seem like having 4 TMUs is really going to pay off. On a game like Doom3 with 6 texture layers you'll still need to multipass anyway whether you have 4 TMUs or 1 TMU.

I think it would be very interesting if Nvidia followed a totally different path than ATi, but it just seems so out of character. In the past, ATi has been the one to try competing with more technically advanced/efficient parts, while Nvidia concentrated more on brute force. Maybe they've traded places now...but just on paper 8*1 seems a lot better than 4*4, at least unless you like hauling heavy loads. ;)

4*4 results in 16 TMU's, 8*1 results in 8.

In theory, a quad-textured game (Serious Sam) would run around twice as fast on the 4*4 at an equal clock speed... (full four pixels with four texels per cycle, versus two pixels with four texels on the 8*1)

Chalnoth
18-Oct-2002, 16:26
The only thing is, with a 4x4 pipeline, the performance wouldn't be as high as with an 8x1 pipeline with a random number of textures applied.

That is, consider this (pixels per clock):

0 textures: 4 vs. 8
1 texture : 4 vs. 8
2 textures: 4 vs. 4
3 textures: 4 vs. 2.66
4 textures: 4 vs. 2
5 textures: 2 vs. 1.6
6 textures: 2 vs. 1.33
7 textures: 2 vs. 1.14
8 textures: 2 vs. 1

If you assume that each of these is equally-likely, then you get the following:

4x4 pipeline: 3.1 pixels per clock on average
8x1 pipeline: 3.3 pixels per clock on average

The crucial difference here is the inclusion of the "0 textures per pixel" portion, which will be important for DOOM3. Without the inclusion of that portion of the chart, the average over any possible number of textures would result in a tie here. I think that ever since JC announced how he was going to do shadows in that game, it has changed how 3D chip designers have thought about high performance. Starting with DOOM3, it will be quite a bit better to have more pixel pipelines with fewer textures per pipeline than few pipelines with still more textures possible per clock. With a game like DOOM3, it will be even better to have a 16x0/8x1 pipeline configuration, where 16 pixels per clock are possible if no textures are applied.

Randell
18-Oct-2002, 17:00
Isnt it also to do with the fact that the 9000 & 9700 can do more with 1 TMU than older architectures, so its not an apple to apples comaprison. I expect the same from all future hardware.

arjan de lumens
18-Oct-2002, 17:06
The only thing is, with a 4x4 pipeline, the performance wouldn't be as high as with an 8x1 pipeline with a random number of textures applied.

That is, consider this (pixels per clock):

0 textures: 4 vs. 8
1 texture : 4 vs. 8
2 textures: 4 vs. 4
3 textures: 4 vs. 2.66
4 textures: 4 vs. 2
5 textures: 2 vs. 1.6
6 textures: 2 vs. 1.33
7 textures: 2 vs. 1.14
8 textures: 2 vs. 1

If you assume that each of these is equally-likely, then you get the following:

4x4 pipeline: 3.1 pixels per clock on average
8x1 pipeline: 3.3 pixels per clock on average

??? ?? ? ???

This only makes sense if the pipelines spends the same amount of drawing time for each texture count. If you instead assume that the number of pixels with a given texture count is the same for each texture count, this computation is bogus and it should be redone with a clocks per pixel metric instead.

To draw an analogy: Suppose you drive a car at 15 mph half of the time and 45 mph the other half of the time. In that case, your average speed is 30 mph. If you instead drive half the distance at 15 mph and the other half at 45 mph, your average speed is only 22.5 mph - because you spend much more time running at 15 mph that at 45 mph. Similarly, you end up spending a lot more time rendering high-texture-count pixels than low-texture-count ones.

Clocks per pixel: (4x4 vs 8x1)
0 tex: 1/4 vs 1/8
1 tex: 1/4 vs 1/8
2 tex: 1/4 vs 2/8
3 tex: 1/4 vs 3/8
4 tex: 1/4 vs 4/8
5 tex: 1/2 vs 5/8
6 tex: 1/2 vs 6/8
7 tex: 1/2 vs 7/8
8 tex: 1/2 vs 1

Average 4x4: 0.361 clocks per pixel (2.77 pixels per clock)
Average 8x1: 0.514 clocks per pixel (1.94 pixels per clock)

Although I suspect that in real life, the numbers are severely skewed towards the lower texture counts.

edit: fleshed out the car analogy a little.

Hellbinder
19-Oct-2002, 03:17
chalnoth..

For a while now the TBR rumors surrounding the Nv30 have been cropping up...gigapixel technology getting thrown around etc. Now none of us actually thinks that the Nv30 is a TBR...well....What if it just borrows ideas born in TBR etc... As SA points out there are several methods that Could be used to dramatically increase the efficiency of the pipeline.

Thus a 4x4 design may not be hindered as much as it seems.

Ailuros
19-Oct-2002, 04:13
As SA points out there are several methods that Could be used to dramatically increase the efficiency of the pipeline.

Thus a 4x4 design may not be hindered as much as it seems.

Sounds more like an oxymoron to me. There's even more reason to increase the number of pipelines when those are more efficient, than to increase the number of TMU's.

PS: Ever wondered why Tilers (since you brought it up) do not need at any price more than one TMU per pipe up until now?

Nagorak
19-Oct-2002, 05:37
4*4 results in 16 TMU's, 8*1 results in 8.

In theory, a quad-textured game (Serious Sam) would run around twice as fast on the 4*4 at an equal clock speed... (full four pixels with four texels per cycle, versus two pixels with four texels on the 8*1)

That's great, but name a single quad textured game out there or that's going to be out there. Doom 3 is going to have 6 textures per pass and force a loop back even on a 4*4, so it loses its major advantage very quickly while sacrificing additional pipes which do much more for performance.

And don't forget those extra TMUs can easily just go to waste in a lot of situations. See original Radeon for details.

LeStoffer
20-Oct-2002, 09:42
Average 4x4: 0.361 clocks per pixel (2.77 pixels per clock)
Average 8x1: 0.514 clocks per pixel (1.94 pixels per clock)

Although I suspect that in real life, the numbers are severely skewed towards the lower texture counts.

I should just add that John Carmack made a note of this during his interview here at beyond3d:

Several hardware vendors have poorly targeted their control logic and memory interfaces under the assumption that high texture counts will be used on the bulk of the pixels. While stencil shadow volumes with zero textures are an extreme case, almost every game of note does a lot of single texture passes for blended effects.

This is part why I think that the 8 x 1 architecture on the R 9700 is a very nice choice. I would guess that it's more easy to tune your memory logic this way rather than for, lets say, 4 pipelines with multiple TMU's where you don't know how many of those TMU that will be idle some of the time.
My point is that with 8 x 1 your performance hit from having to apply and fetch more texels should be fairly linear.