Regarding hardware drawing efficiency

Discussion in 'General 3D Technology' started by SA, Oct 15, 2002.

  1. SA

    SA
    Newcomer

    Joined:
    Feb 9, 2002
    Messages:
    100
    Likes Received:
    2
    What frame rate would you achieve at a resolution of 1600x1200, a frequency of 325 Mhz and just one pixel pipeline if you could actually draw one visible pixel per clock?
     
  2. Bigus Dickus

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    943
    Likes Received:
    16
    1600 x 1200 = 1,920,000 pixels on screen = 1,920,000 pixels per frame.

    325,000,000 cycles per second x 1 pixel per cycle = 325,000,000 pixels per second.

    (325,000,000 pix/sec) / (1,920,000 pix/frame) = 169.27 frames/sec
     
  3. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,902
    Likes Received:
    218
    Location:
    Seattle, WA
    Just don't forget that many modern scenes will apply many textures per pixel, will compute the final color for a pixel in multiple passes, or make use of transparent surfaces.

    What all of this means that if you took a real game scene from, say, Unreal Tournament 2003, and put it through hardware that was capable of outputting each pixel only once, it might still need to use many clocks per pixel just to get the processing done.
     
  4. KnightBreed

    Newcomer

    Joined:
    Feb 7, 2002
    Messages:
    203
    Likes Received:
    0
    Ok, point made. What do you suggest? You've been an open proponent of deffered rendering solutions.
     
  5. SA

    SA
    Newcomer

    Joined:
    Feb 9, 2002
    Messages:
    100
    Likes Received:
    2
    The point is that there is a great deal of inefficiency yet in today's hardware. Improving the rendering efficiency provides a route to improving performance that does not necessarily require costly new processes, large numbers of pipelines, etc. Not that these aren't great to have, they are. Just that there is also still plenty of low hanging fruit that can come from improving rendering efficiency.

    As an example, you might simply added 8k of frame/depth buffer cache to a standard IMR (about a 32x32 pixel tile's worth) , then recommend that developers sort their render in roughly tile order and roughly front to back within a tile region. Older titles that did not do this would still see some benefit from the cache while developers that took full advantage of it would get tiler-like performance with a standard IMR. For those developers that wanted to use application driven deferred rendering they could still render the scene twice, once without shading (to set the depth buffer) and then again with shading.

    Hierarchical z buffering would add even more benefit, especially if the upper levels were cached on the chip. I would recommend up to 5 levels (for quick elimination of large stencil polys, bounding volume occlusion checks, etc.).

    Providing for the use of z occlusion culling using bounding volumes to eliminate unnecessary hidden vertex and pixel processing. This becomes an ever increasing issue as triangle rates and scene complexity increase. It think it important to provide this capability as a standard feature across all 3d hardware vendors and APIs. Z occlusion culling works particularly well with 5 or more levels of hierarchical z to quickly determine the visibility of the bounding volumes.

    Using more efficient multisampling AA techniques such as Z3 or other coverage mask approach and sparse grid sampling, could provide 16x or even 32x near stocastic AA with little performance impact. It would correctly handle implicit edges and order independent transparency sorting to boot.

    There are still some improvements both in performance and quality that can be made in anisotropic filtering as well. Some of the ideas in the Feline approach would be useful.

    There are, of course, many other possibilities. Improving rendering efficiency has just begun to be tapped and offers all the vendors the opportunity for a great deal of performance improvement in the near term.
     
  6. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    Nice and fairly simple, but NV/ATI still have to convience game developers to sort [roughly] front to back and take advantage of LMA and HyperZ.

    Anyway, I had this stupid idea recently about doing the sorting between the vertex and pixel level on a big Z-check onchip buffer before any texels are applied to the pixel (e.g. before any pixels are actually rendered). My lame idea was that you only had to keep the "pre-pixels" Z-value and thus could built up these pre-pixels data in the buffer and remove all the hidden ones based on their Z-values. When every pre-pixel is either rejected or accepted in the buffer, you would go on to actually render those pixels.

    But then I realized it doesn't make any bloody sense because you have to store a lot of data to go with each and every pixel that is about to be drawn. :oops:
     
  7. Hellbinder

    Banned

    Joined:
    Feb 8, 2002
    Messages:
    1,444
    Likes Received:
    12
    Remember i said (publically) that the Nv30 was a 4x4 architecture that employs several new features instead of more pipelines to gain large ammounts of speed... oh about a week ago.. ;)
     
  8. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    We remember. :wink: The question, however, is what this employs several new features is really about. So what is it gonna be, Hell? :p
     
  9. Randell

    Randell Senior Daddy
    Veteran

    Joined:
    Feb 14, 2002
    Messages:
    1,869
    Likes Received:
    3
    Location:
    London
    hmm another one of SA's famous hints?

    Z3 AA (which I still dont understand fully even after having looked the the white paper) sounds a great implementation.
     
  10. Kristof

    Regular Alpha

    Joined:
    Jan 30, 2002
    Messages:
    733
    Likes Received:
    1
    Location:
    Abbots Langley
    :eek:

    Err... I think that render order is one of the things that the developer should not have to care about... plenty of other things to worry about. We don't want developers to worry about low-level things like optimising per pixel HSR... this is one of the most basic features of 3D hardware and it should just work efficiently.

    I am sure that NVIDIA and ATI would prefer that developers start following the absolute basics optimisation rules. Just to give some examples: Do a flip rather than a blit from back buffer to front buffer (one is some pointer changes and one is a full memory copy)... submitting more than 2 polygons per draw primitive call... this all sounds trivial but if there are developers out there that can not even get this right, god only knows what will happen if you expect them to do the kind of sorting you suggested.

    Also I believe that ATI already has some kind of back-end tile-like buffer, IIRC this was promoted a bit by Marketing for 8500 ?

    K-
     
  11. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,560
    Likes Received:
    157
    Location:
    In the Island of Sodor, where the steam trains lie
    Bubble sort, perhaps?
     
  12. GetStuff

    Newcomer

    Joined:
    Jul 9, 2002
    Messages:
    67
    Likes Received:
    0

    As if its really hard to come to a concluscion based on all the bits and pieces floating around the internet... :roll: :lol: :lol:
     
  13. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    I seem to remember an old block diagram of ATI's Rage128 chip with an 8 Kbyte framebuffer cache - if that old chip had it, I would find it likely that newer chips also have it, probably more than 8 KBytes as well. Also, AFAIK, most IMRs today already use tiled framebuffers, typically with 8x8 pixel tiles, presumably caching multiple such tiles.

    Using bounding boxes on 3d objects to do optimizations on them is entirely possible, but requires extensive support at both API and application level. Rejecting bounding boxes based on hierarchical Z seems to be doable on IMR architectures only - and you need to sort the objects in front-to-back order (which probably precludes them from being sorted in tile order) to see this kind of benefit.

    Z3/coverage mask AA methods? Just wondering what the memory usage, performance hit and image quality on these methods are compared to e.g. ATI's multisampling implementation (compressed multisample buffer => fairly small performance hit).

    I am not really convinced that there are any really low-hanging fruit left to collect (other than perhaps better texture compression methods) - for now, it looks like compressing the multisample buffer was the last one that didn't require extensive API support.
     
  14. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    I would think the same considering the ATI is at their third generation HyperZ and nVidia at their second LMA.

    If you're going for big benefits it would seem that you have to do some kind of sorting of either polygons or pixels into a list instead of just removing some hidden pixel based on Z-check along the way. And thus the question is: Is there any methode where you don't need a full scale sorting?
     
  15. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,079
    Likes Received:
    648
    Location:
    O Canada!
    I think SA may be making a point here. If we remember back to the when the remain of 3dfx were purchased by NVIDIA you may remember a number of interviews at the time with NV's CEO, and others, stating that they doubt they would fully adopt the gigapixel deferred rendering approach, but there may be ways of marrying some of the benefits of the tiling approach with IMR's. Now, what SA is talking sounds like one of the possabilities that they were talking about at the time.
     
  16. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    I don't quite see how. Either you do immediate-mode rendering, drawing polygons as you receive them, or you do deferred rendering, collecting polygon data for an entire scene before drawing any of it. To me, it would seem that anything between would inherit the disadvantages of both and the advantages of neither.

    Sorting objects in near-tile-order gets difficult with objects that are larger than a tile or straddle tile boundaries - it seems to me that at best you get a rather small increase in the framebuffer cache hit rate (this would, in any case, not require changes to modern IMRs)
     
  17. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,420
    Likes Received:
    179
    Location:
    Chania
    Theoretically (or essentially) for "free" if the framebuffer is on chip with far more than just 4 samples and across resolutions. That's at least what I understood last time it was analyzed.

    What if you use varying sizes of tiles, f.e. split up the scene into 2 or 3 parts and then resplit it afterwards? My knowledge on stuff like that is very basic to be honest, but from the little I understood trying to decode the latest PVR patent into laymans terms, there doesn't seem a necessity to complete a frame before moving to the next one, in occassions like described in it.

    (Simon correct me please if I'm wrong).

    On a sidenote can someone please add some more simple input on possible advantages of Feline algorithms? Last time a patent was posted I got lost even trying to read it *ahem*.
     
  18. Gollum

    Veteran

    Joined:
    May 14, 2002
    Messages:
    1,217
    Likes Received:
    8
    Location:
    germany
    arjan de lumens, SA has been carefully hinting that despite some people believing otherwise, there is still headroom left for performance improvement in current and future hardware accellerators, by increasing the rendering pipeline efficiency, which doesn't neccessarily mean the way polygons are being fed to the pipeline by IMRs or TBRs IMHO. So why not talk about how this could be achieved and go into where these tweaks might be possible? As an old tech lurker here I was hoping some of the more technically versed people could make some interesting comments to learn from... :)
     
  19. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    I just do not see that there is all that much efficiency headroom left, at least not in IMR architectures running legacy applications. The tiled framebuffer cache seems to have been around for some time, at least since Radeon8500 and almost certainly much longer (voodoo?); Z3 may look better than 4xRGMS, but requires more per-pixel data for non-edge pixels (=problem in IMR, should work fine in TBR); bounding box optimizations are nice, but require application support; how well does the Feline algorithm perform compared to whatever method it is that ATI uses for anisotropic mapping (assuming that it isn't the very same algorithm)?

    There seems to be an idea floating around here about an immediate-mode tiler architecture. Such a beast will require applications/games to be written such that they supply data in tile order. OK so far - here is the difficult part: it needs an efficient method for handling objects that straddle tile boundaries.
     
  20. Hyp-X

    Hyp-X Irregular
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,170
    Likes Received:
    5
    My guess: Z-only first pass...

    With proper hw support it could be a killer feature. The question is not the number of pixel pipes, but the number of Z-operations possible per cycle, when the pixel pipelines are not used...
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...