Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Old 15-Oct-2002, 02:22   #1
SA
Member
 
Join Date: Feb 2002
Posts: 100
Default Regarding hardware drawing efficiency

What frame rate would you achieve at a resolution of 1600x1200, a frequency of 325 Mhz and just one pixel pipeline if you could actually draw one visible pixel per clock?
SA is offline  
Old 15-Oct-2002, 03:11   #2
Bigus Dickus
Member
 
Join Date: Feb 2002
Posts: 772
Default

1600 x 1200 = 1,920,000 pixels on screen = 1,920,000 pixels per frame.

325,000,000 cycles per second x 1 pixel per cycle = 325,000,000 pixels per second.

(325,000,000 pix/sec) / (1,920,000 pix/frame) = 169.27 frames/sec
__________________
Looks like it was option "B." Sigh.
Bigus Dickus is offline  
Old 15-Oct-2002, 03:30   #3
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,678
Default

Just don't forget that many modern scenes will apply many textures per pixel, will compute the final color for a pixel in multiple passes, or make use of transparent surfaces.

What all of this means that if you took a real game scene from, say, Unreal Tournament 2003, and put it through hardware that was capable of outputting each pixel only once, it might still need to use many clocks per pixel just to get the processing done.
Chalnoth is offline  
Old 15-Oct-2002, 03:35   #4
KnightBreed
Member
 
Join Date: Feb 2002
Posts: 203
Default

Ok, point made. What do you suggest? You've been an open proponent of deffered rendering solutions.
__________________
Like my Dad always says, "The day I can't do my job drunk, is the day I hand in my badge and gun."
KnightBreed is offline  
Old 15-Oct-2002, 04:50   #5
SA
Member
 
Join Date: Feb 2002
Posts: 100
Default

The point is that there is a great deal of inefficiency yet in today's hardware. Improving the rendering efficiency provides a route to improving performance that does not necessarily require costly new processes, large numbers of pipelines, etc. Not that these aren't great to have, they are. Just that there is also still plenty of low hanging fruit that can come from improving rendering efficiency.

As an example, you might simply added 8k of frame/depth buffer cache to a standard IMR (about a 32x32 pixel tile's worth) , then recommend that developers sort their render in roughly tile order and roughly front to back within a tile region. Older titles that did not do this would still see some benefit from the cache while developers that took full advantage of it would get tiler-like performance with a standard IMR. For those developers that wanted to use application driven deferred rendering they could still render the scene twice, once without shading (to set the depth buffer) and then again with shading.

Hierarchical z buffering would add even more benefit, especially if the upper levels were cached on the chip. I would recommend up to 5 levels (for quick elimination of large stencil polys, bounding volume occlusion checks, etc.).

Providing for the use of z occlusion culling using bounding volumes to eliminate unnecessary hidden vertex and pixel processing. This becomes an ever increasing issue as triangle rates and scene complexity increase. It think it important to provide this capability as a standard feature across all 3d hardware vendors and APIs. Z occlusion culling works particularly well with 5 or more levels of hierarchical z to quickly determine the visibility of the bounding volumes.

Using more efficient multisampling AA techniques such as Z3 or other coverage mask approach and sparse grid sampling, could provide 16x or even 32x near stocastic AA with little performance impact. It would correctly handle implicit edges and order independent transparency sorting to boot.

There are still some improvements both in performance and quality that can be made in anisotropic filtering as well. Some of the ideas in the Feline approach would be useful.

There are, of course, many other possibilities. Improving rendering efficiency has just begun to be tapped and offers all the vendors the opportunity for a great deal of performance improvement in the near term.
SA is offline  
Old 15-Oct-2002, 07:01   #6
LeStoffer
Senior Member
 
Join Date: Feb 2002
Location: Somewhere not *that* rotten in Denmark
Posts: 1,197
Default

Quote:
Originally Posted by SA
As an example, you might simply added 8k of frame/depth buffer cache to a standard IMR (about a 32x32 pixel tile's worth), then recommend that developers sort their render in roughly tile order and roughly front to back within a tile region.
Nice and fairly simple, but NV/ATI still have to convience game developers to sort [roughly] front to back and take advantage of LMA and HyperZ.

Anyway, I had this stupid idea recently about doing the sorting between the vertex and pixel level on a big Z-check onchip buffer before any texels are applied to the pixel (e.g. before any pixels are actually rendered). My lame idea was that you only had to keep the "pre-pixels" Z-value and thus could built up these pre-pixels data in the buffer and remove all the hidden ones based on their Z-values. When every pre-pixel is either rejected or accepted in the buffer, you would go on to actually render those pixels.

But then I realized it doesn't make any bloody sense because you have to store a lot of data to go with each and every pixel that is about to be drawn.
__________________
Best regards, LeStoffer
LeStoffer is offline  
Old 15-Oct-2002, 08:29   #7
Hellbinder
Naughty Boy!
 
Join Date: Feb 2002
Posts: 1,444
Default

Quote:
Improving the rendering efficiency provides a route to improving performance that does not necessarily require costly new processes, large numbers of pipelines
Remember i said (publically) that the Nv30 was a 4x4 architecture that employs several new features instead of more pipelines to gain large ammounts of speed... oh about a week ago..
Hellbinder is offline  
Old 15-Oct-2002, 08:44   #8
LeStoffer
Senior Member
 
Join Date: Feb 2002
Location: Somewhere not *that* rotten in Denmark
Posts: 1,197
Default

Quote:
Originally Posted by Hellbinder[CE
]Remember i said (publically) that the Nv30 was a 4x4 architecture that employs several new features instead of more pipelines to gain large ammounts of speed... oh about a week ago..
We remember. The question, however, is what this employs several new features is really about. So what is it gonna be, Hell? :P
__________________
Best regards, LeStoffer
LeStoffer is offline  
Old 15-Oct-2002, 08:49   #9
Randell
Senior Daddy
 
Join Date: Feb 2002
Location: London
Posts: 1,869
Default

hmm another one of SA's famous hints?

Z3 AA (which I still dont understand fully even after having looked the the white paper) sounds a great implementation.
Randell is offline  
Old 15-Oct-2002, 09:00   #10
Kristof
Senior Member
 
Join Date: Jan 2002
Location: Abbots Langley
Posts: 732
Default

Quote:
Originally Posted by SA
As an example, you might simply added 8k of frame/depth buffer cache to a standard IMR (about a 32x32 pixel tile's worth) , then recommend that developers sort their render in roughly tile order and roughly front to back within a tile region.


Err... I think that render order is one of the things that the developer should not have to care about... plenty of other things to worry about. We don't want developers to worry about low-level things like optimising per pixel HSR... this is one of the most basic features of 3D hardware and it should just work efficiently.

I am sure that NVIDIA and ATI would prefer that developers start following the absolute basics optimisation rules. Just to give some examples: Do a flip rather than a blit from back buffer to front buffer (one is some pointer changes and one is a full memory copy)... submitting more than 2 polygons per draw primitive call... this all sounds trivial but if there are developers out there that can not even get this right, god only knows what will happen if you expect them to do the kind of sorting you suggested.

Also I believe that ATI already has some kind of back-end tile-like buffer, IIRC this was promoted a bit by Marketing for 8500 ?

K-
Kristof is offline  
Old 15-Oct-2002, 09:05   #11
Simon F
Tea maker
 
Join Date: Feb 2002
Location: In the Island of Sodor, where the steam trains lie
Posts: 4,382
Default

Quote:
Originally Posted by Kristof
... this all sounds trivial but if there are developers out there that can not even get this right, god only knows what will happen if you expect them to do the kind of sorting you suggested.
Bubble sort, perhaps?
__________________
"Your work is both good and original. Unfortunately the part that is good is not original and the part that is original is not good." -(attributed to) Samuel Johnson

"I invented the term Object-Oriented, and I can tell you I did not have C++ in mind." Alan Kay
Simon F is offline  
Old 15-Oct-2002, 09:44   #12
GetStuff
Junior Member
 
Join Date: Jul 2002
Posts: 67
Default

Quote:
Originally Posted by Hellbinder[CE
]
Quote:
Improving the rendering efficiency provides a route to improving performance that does not necessarily require costly new processes, large numbers of pipelines
Remember i said (publically) that the Nv30 was a 4x4 architecture that employs several new features instead of more pipelines to gain large ammounts of speed... oh about a week ago..

As if its really hard to come to a concluscion based on all the bits and pieces floating around the internet...
GetStuff is offline  
Old 15-Oct-2002, 09:51   #13
arjan de lumens
Senior Member
 
Join Date: Feb 2002
Location: gjethus, Norway
Posts: 1,256
Default

I seem to remember an old block diagram of ATI's Rage128 chip with an 8 Kbyte framebuffer cache - if that old chip had it, I would find it likely that newer chips also have it, probably more than 8 KBytes as well. Also, AFAIK, most IMRs today already use tiled framebuffers, typically with 8x8 pixel tiles, presumably caching multiple such tiles.

Using bounding boxes on 3d objects to do optimizations on them is entirely possible, but requires extensive support at both API and application level. Rejecting bounding boxes based on hierarchical Z seems to be doable on IMR architectures only - and you need to sort the objects in front-to-back order (which probably precludes them from being sorted in tile order) to see this kind of benefit.

Z3/coverage mask AA methods? Just wondering what the memory usage, performance hit and image quality on these methods are compared to e.g. ATI's multisampling implementation (compressed multisample buffer => fairly small performance hit).

I am not really convinced that there are any really low-hanging fruit left to collect (other than perhaps better texture compression methods) - for now, it looks like compressing the multisample buffer was the last one that didn't require extensive API support.
arjan de lumens is offline  
Old 15-Oct-2002, 10:07   #14
LeStoffer
Senior Member
 
Join Date: Feb 2002
Location: Somewhere not *that* rotten in Denmark
Posts: 1,197
Default

Quote:
Originally Posted by arjan de lumens
I am not really convinced that there are any really low-hanging fruit left to collect (other than perhaps better texture compression methods) - for now, it looks like compressing the multisample buffer was the last one that didn't require extensive API support.
I would think the same considering the ATI is at their third generation HyperZ and nVidia at their second LMA.

If you're going for big benefits it would seem that you have to do some kind of sorting of either polygons or pixels into a list instead of just removing some hidden pixel based on Z-check along the way. And thus the question is: Is there any methode where you don't need a full scale sorting?
__________________
Best regards, LeStoffer
LeStoffer is offline  
Old 15-Oct-2002, 10:34   #15
Dave Baumann
Gamerscore Wh...
 
Join Date: Jan 2002
Posts: 12,950
Default

I think SA may be making a point here. If we remember back to the when the remain of 3dfx were purchased by NVIDIA you may remember a number of interviews at the time with NV's CEO, and others, stating that they doubt they would fully adopt the gigapixel deferred rendering approach, but there may be ways of marrying some of the benefits of the tiling approach with IMR's. Now, what SA is talking sounds like one of the possabilities that they were talking about at the time.
__________________
Expand. Accelerate. Dominate.
Tweet Tweet!
Dave Baumann is offline  
Old 15-Oct-2002, 11:25   #16
arjan de lumens
Senior Member
 
Join Date: Feb 2002
Location: gjethus, Norway
Posts: 1,256
Default

I don't quite see how. Either you do immediate-mode rendering, drawing polygons as you receive them, or you do deferred rendering, collecting polygon data for an entire scene before drawing any of it. To me, it would seem that anything between would inherit the disadvantages of both and the advantages of neither.

Sorting objects in near-tile-order gets difficult with objects that are larger than a tile or straddle tile boundaries - it seems to me that at best you get a rather small increase in the framebuffer cache hit rate (this would, in any case, not require changes to modern IMRs)
arjan de lumens is offline  
Old 15-Oct-2002, 15:48   #17
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 7,767
Default

Quote:
Z3/coverage mask AA methods? Just wondering what the memory usage, performance hit and image quality on these methods are compared to e.g. ATI's multisampling implementation (compressed multisample buffer => fairly small performance hit).
Theoretically (or essentially) for "free" if the framebuffer is on chip with far more than just 4 samples and across resolutions. That's at least what I understood last time it was analyzed.

Quote:
Sorting objects in near-tile-order gets difficult with objects that are larger than a tile or straddle tile boundaries - it seems to me that at best you get a rather small increase in the framebuffer cache hit rate.
What if you use varying sizes of tiles, f.e. split up the scene into 2 or 3 parts and then resplit it afterwards? My knowledge on stuff like that is very basic to be honest, but from the little I understood trying to decode the latest PVR patent into laymans terms, there doesn't seem a necessity to complete a frame before moving to the next one, in occassions like described in it.

(Simon correct me please if I'm wrong).

On a sidenote can someone please add some more simple input on possible advantages of Feline algorithms? Last time a patent was posted I got lost even trying to read it *ahem*.
Ailuros is offline  
Old 15-Oct-2002, 16:32   #18
Gollum
Senior Member
 
Join Date: May 2002
Location: germany
Posts: 1,217
Default

arjan de lumens, SA has been carefully hinting that despite some people believing otherwise, there is still headroom left for performance improvement in current and future hardware accellerators, by increasing the rendering pipeline efficiency, which doesn't neccessarily mean the way polygons are being fed to the pipeline by IMRs or TBRs IMHO. So why not talk about how this could be achieved and go into where these tweaks might be possible? As an old tech lurker here I was hoping some of the more technically versed people could make some interesting comments to learn from...
Gollum is offline  
Old 15-Oct-2002, 17:23   #19
arjan de lumens
Senior Member
 
Join Date: Feb 2002
Location: gjethus, Norway
Posts: 1,256
Default

I just do not see that there is all that much efficiency headroom left, at least not in IMR architectures running legacy applications. The tiled framebuffer cache seems to have been around for some time, at least since Radeon8500 and almost certainly much longer (voodoo?); Z3 may look better than 4xRGMS, but requires more per-pixel data for non-edge pixels (=problem in IMR, should work fine in TBR); bounding box optimizations are nice, but require application support; how well does the Feline algorithm perform compared to whatever method it is that ATI uses for anisotropic mapping (assuming that it isn't the very same algorithm)?

There seems to be an idea floating around here about an immediate-mode tiler architecture. Such a beast will require applications/games to be written such that they supply data in tile order. OK so far - here is the difficult part: it needs an efficient method for handling objects that straddle tile boundaries.
arjan de lumens is offline  
Old 15-Oct-2002, 17:54   #20
Hyp-X
Irregular
 
Join Date: Feb 2002
Posts: 1,170
Default

My guess: Z-only first pass...

With proper hw support it could be a killer feature. The question is not the number of pixel pipes, but the number of Z-operations possible per cycle, when the pixel pipelines are not used...
Hyp-X is offline  
Old 15-Oct-2002, 18:08   #21
RoOoBo
Member
 
Join Date: Jun 2002
Posts: 305
Default

Quote:
Originally Posted by Hyp-X
My guess: Z-only first pass...

With proper hw support it could be a killer feature. The question is not the number of pixel pipes, but the number of Z-operations possible per cycle, when the pixel pipelines are not used...
For that you would need full vertex shader or T&L for all the scene (and two times) which I hardly can see as efficient.
RoOoBo is offline  
Old 15-Oct-2002, 18:14   #22
arjan de lumens
Senior Member
 
Join Date: Feb 2002
Location: gjethus, Norway
Posts: 1,256
Default

Quote:
Originally Posted by Hyp-X
My guess: Z-only first pass...

With proper hw support it could be a killer feature. The question is not the number of pixel pipes, but the number of Z-operations possible per cycle, when the pixel pipelines are not used...
Makes sense for scenarios with complex multitexturing/pixel shaders and/or high overdraw, when the memory traffic saved for overdrawn pixels (modern renderers are generally smart enough not to texture a pixel that fails Z test) outweighs the additional Z traffic produced in the Z-only pass and the fact that you need to pass geometry twice. Doesn't Doom3 do something like this already? Having dedicated hardware for this task may or may not make sense, depending on whether the standard pixel pipes are already able to saturate the available memory bandwidth with Z-only traffic.
arjan de lumens is offline  
Old 15-Oct-2002, 18:47   #23
Hyp-X
Irregular
 
Join Date: Feb 2002
Posts: 1,170
Default

Quote:
Originally Posted by RoOoBo
For that you would need full vertex shader or T&L for all the scene (and two times) which I hardly can see as efficient.
Transform yes, lightining no.
No environment mapping computation, per-pixel lighting precalc, etc.
It's quite a big saving.

Also, games are still not vertex limited (not even UT2003).

They could also increase the vertex processing power to make it possible (note, I said proper hw support.)
Hyp-X is offline  
Old 15-Oct-2002, 22:36   #24
Humus
Crazy coder
 
Join Date: Feb 2002
Location: Stockholm, Sweden
Posts: 3,216
Send a message via ICQ to Humus Send a message via MSN to Humus
Default

Quote:
Originally Posted by arjan de lumens
I don't quite see how. Either you do immediate-mode rendering, drawing polygons as you receive them, or you do deferred rendering, collecting polygon data for an entire scene before drawing any of it. To me, it would seem that anything between would inherit the disadvantages of both and the advantages of neither.
You could collect a small amount of polygons, but not neccesarily the whole scene. If the hardware would batch up say 1000 polygons or so and sort them before drawing you could increase efficiency quite a lot.
__________________
[ Visit my site ]
I speak for myself and only myself.
Humus is offline  
Old 15-Oct-2002, 22:53   #25
Nagorak
Member
 
Join Date: Jun 2002
Posts: 854
Default

Quote:
Originally Posted by Hyp-X
Quote:
Originally Posted by RoOoBo
For that you would need full vertex shader or T&L for all the scene (and two times) which I hardly can see as efficient.
Transform yes, lightining no.
No environment mapping computation, per-pixel lighting precalc, etc.
It's quite a big saving.

Also, games are still not vertex limited (not even UT2003).

They could also increase the vertex processing power to make it possible (note, I said proper hw support.)
Games may not be vertex limited, but isn't that just because newer hardware contains a ridiculous amount of vertex shaders (4 in R300, etc)? Maybe I misunderstand the use of the vertex shaders, but why would both ATi and Nvida keep adding more if they had no affect on performance.
Nagorak is offline  

 

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
J Allard talks more on Xbox 2 hardware (still vauge) Megadrive1988 Console Technology 63 27-May-2004 20:44
Xbox 2 Hardware 'LOCKED DOWN' - More Information in 2005 Megadrive1988 Console Technology 56 24-May-2004 09:44
The Way its Meant to be Reviewed? Dave Baumann Beyond3D News 266 31-Dec-2003 16:24
3D Hardware Vendors Key to Microsoft Dave Baumann Beyond3D News 22 09-Jun-2003 09:50
+'s/-'s and feasability of lengthened hardware release cycle JavaJones 3D Architectures & Chips 12 09-Mar-2002 16:56


All times are GMT +1. The time now is 03:18.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.