Regarding hardware drawing efficiency

Discussion in 'General 3D Technology' started by SA, Oct 15, 2002.

  1. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    For that you would need full vertex shader or T&L for all the scene (and two times) which I hardly can see as efficient.
     
  2. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    Makes sense for scenarios with complex multitexturing/pixel shaders and/or high overdraw, when the memory traffic saved for overdrawn pixels (modern renderers are generally smart enough not to texture a pixel that fails Z test) outweighs the additional Z traffic produced in the Z-only pass and the fact that you need to pass geometry twice. Doesn't Doom3 do something like this already? Having dedicated hardware for this task may or may not make sense, depending on whether the standard pixel pipes are already able to saturate the available memory bandwidth with Z-only traffic.
     
  3. Hyp-X

    Hyp-X Irregular
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,170
    Likes Received:
    5
    Transform yes, lightining no.
    No environment mapping computation, per-pixel lighting precalc, etc.
    It's quite a big saving.

    Also, games are still not vertex limited (not even UT2003).

    They could also increase the vertex processing power to make it possible (note, I said proper hw support.)
     
  4. Humus

    Humus Crazy coder
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,217
    Likes Received:
    77
    Location:
    Stockholm, Sweden
    You could collect a small amount of polygons, but not neccesarily the whole scene. If the hardware would batch up say 1000 polygons or so and sort them before drawing you could increase efficiency quite a lot.
     
  5. Nagorak

    Regular

    Joined:
    Jun 20, 2002
    Messages:
    854
    Likes Received:
    0
    Games may not be vertex limited, but isn't that just because newer hardware contains a ridiculous amount of vertex shaders (4 in R300, etc)? Maybe I misunderstand the use of the vertex shaders, but why would both ATi and Nvida keep adding more if they had no affect on performance.
     
  6. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    I can see how polygon batching and sorting could give the benefits of front-to-back rendering within a 3d object - which would give a moderate efficiency increase in an object that partially covers itself (how common is this?) - at the cost of additional memory traffic for sorting and writing and re-reading of T&Led vertices. It's a tradeoff, as far as I can see. For data sets larger than an 'object', further batching & sorting should give results similar to or slightly weaker than what happens when you sort the objects yourself.
     
  7. Hyp-X

    Hyp-X Irregular
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,170
    Likes Received:
    5
    They always improve VS performance along with fillrate increase so 4x VS was a logical move for the R9700.
    Only budget cards go in the opposite direction (Gf4MX / Xabre), where they excluded hw VS support because no games really need it NOW.
    High-end cards are meant for the future.
    Whether the IHVs are guessing the future right is another question...

    One of the reasons to increase VS processing power is to make longer shader programs feasible.
     
  8. Hyp-X

    Hyp-X Irregular
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,170
    Likes Received:
    5
    Lets see:
    256bits DDR interface, 32bit Z-buffer, 16 values can be transferred per clock.
    Assuming 1:4 compression means 64 values per clock.
    That would allow 64 pixels on Z-fail, or 32 on Z-pass.
    Actually more if mem clock > core clock.
    Compare that to the 8 pipelines.
    It's quite far from full utilization...
    That Z-pass could be made really fast!

    Then the normal passes could skip 64 pixels per cycle when occluded...
     
  9. Hellbinder

    Banned

    Joined:
    Feb 8, 2002
    Messages:
    1,444
    Likes Received:
    12
    Sorry to interupt your very cool and interesting discussion again.. but..

    To my knowledge no one anywhere other than myself has stated anything even close to the Nv30 being a 4x4. Everyone and their brother claims it is a 8x2. that is pretty common knowledge. I did not throw that out there just becuase I *thought it would be cool*... The wind whispered it to me in a dream... ;)

    the question is.. is the wind right? :lol:
     
  10. Nagorak

    Regular

    Joined:
    Jun 20, 2002
    Messages:
    854
    Likes Received:
    0
    If NV30 is really 4*4 then I have to question why (maybe one of their engineers is into pick-up trucks?). All bandwidth saving, etc excluded it doesn't seem like having 4 TMUs is really going to pay off. On a game like Doom3 with 6 texture layers you'll still need to multipass anyway whether you have 4 TMUs or 1 TMU.

    I think it would be very interesting if Nvidia followed a totally different path than ATi, but it just seems so out of character. In the past, ATi has been the one to try competing with more technically advanced/efficient parts, while Nvidia concentrated more on brute force. Maybe they've traded places now...but just on paper 8*1 seems a lot better than 4*4, at least unless you like hauling heavy loads. ;)
     
  11. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,435
    Likes Received:
    263
    I would think these extra Z-operations are being used for sub-pixels to increase the quality of antialiasing.
     
  12. LittlePenny

    Regular

    Joined:
    Feb 10, 2002
    Messages:
    276
    Likes Received:
    0
    Location:
    Rolla, Missouri, USA
    I have a question about the sorting developers would need to do for the cache thing. Please keep in mind I haven't developed any games myself.

    Would it be possible to develop a generic data structure for primitives? I imagine if an IHV did this, and developers used inheritance to add their own ideas to the mix this would make things more doable.
     
  13. Tagrineth

    Tagrineth SNAKES... ON A PLANE
    Veteran

    Joined:
    Feb 14, 2002
    Messages:
    2,512
    Likes Received:
    9
    Location:
    Sunny (boring) Florida
    4*4 results in 16 TMU's, 8*1 results in 8.

    In theory, a quad-textured game (Serious Sam) would run around twice as fast on the 4*4 at an equal clock speed... (full four pixels with four texels per cycle, versus two pixels with four texels on the 8*1)
     
  14. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,902
    Likes Received:
    218
    Location:
    Seattle, WA
    The only thing is, with a 4x4 pipeline, the performance wouldn't be as high as with an 8x1 pipeline with a random number of textures applied.

    That is, consider this (pixels per clock):

    0 textures: 4 vs. 8
    1 texture : 4 vs. 8
    2 textures: 4 vs. 4
    3 textures: 4 vs. 2.66
    4 textures: 4 vs. 2
    5 textures: 2 vs. 1.6
    6 textures: 2 vs. 1.33
    7 textures: 2 vs. 1.14
    8 textures: 2 vs. 1

    If you assume that each of these is equally-likely, then you get the following:

    4x4 pipeline: 3.1 pixels per clock on average
    8x1 pipeline: 3.3 pixels per clock on average

    The crucial difference here is the inclusion of the "0 textures per pixel" portion, which will be important for DOOM3. Without the inclusion of that portion of the chart, the average over any possible number of textures would result in a tie here. I think that ever since JC announced how he was going to do shadows in that game, it has changed how 3D chip designers have thought about high performance. Starting with DOOM3, it will be quite a bit better to have more pixel pipelines with fewer textures per pipeline than few pipelines with still more textures possible per clock. With a game like DOOM3, it will be even better to have a 16x0/8x1 pipeline configuration, where 16 pixels per clock are possible if no textures are applied.
     
  15. Randell

    Randell Senior Daddy
    Veteran

    Joined:
    Feb 14, 2002
    Messages:
    1,869
    Likes Received:
    3
    Location:
    London
    Isnt it also to do with the fact that the 9000 & 9700 can do more with 1 TMU than older architectures, so its not an apple to apples comaprison. I expect the same from all future hardware.
     
  16. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    ??? ?? ? ???

    This only makes sense if the pipelines spends the same amount of drawing time for each texture count. If you instead assume that the number of pixels with a given texture count is the same for each texture count, this computation is bogus and it should be redone with a clocks per pixel metric instead.

    To draw an analogy: Suppose you drive a car at 15 mph half of the time and 45 mph the other half of the time. In that case, your average speed is 30 mph. If you instead drive half the distance at 15 mph and the other half at 45 mph, your average speed is only 22.5 mph - because you spend much more time running at 15 mph that at 45 mph. Similarly, you end up spending a lot more time rendering high-texture-count pixels than low-texture-count ones.

    Clocks per pixel: (4x4 vs 8x1)
    0 tex: 1/4 vs 1/8
    1 tex: 1/4 vs 1/8
    2 tex: 1/4 vs 2/8
    3 tex: 1/4 vs 3/8
    4 tex: 1/4 vs 4/8
    5 tex: 1/2 vs 5/8
    6 tex: 1/2 vs 6/8
    7 tex: 1/2 vs 7/8
    8 tex: 1/2 vs 1

    Average 4x4: 0.361 clocks per pixel (2.77 pixels per clock)
    Average 8x1: 0.514 clocks per pixel (1.94 pixels per clock)

    Although I suspect that in real life, the numbers are severely skewed towards the lower texture counts.

    edit: fleshed out the car analogy a little.
     
  17. Hellbinder

    Banned

    Joined:
    Feb 8, 2002
    Messages:
    1,444
    Likes Received:
    12
    chalnoth..

    For a while now the TBR rumors surrounding the Nv30 have been cropping up...gigapixel technology getting thrown around etc. Now none of us actually thinks that the Nv30 is a TBR...well....What if it just borrows ideas born in TBR etc... As SA points out there are several methods that Could be used to dramatically increase the efficiency of the pipeline.

    Thus a 4x4 design may not be hindered as much as it seems.
     
  18. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,420
    Likes Received:
    179
    Location:
    Chania
    Sounds more like an oxymoron to me. There's even more reason to increase the number of pipelines when those are more efficient, than to increase the number of TMU's.

    PS: Ever wondered why Tilers (since you brought it up) do not need at any price more than one TMU per pipe up until now?
     
  19. Nagorak

    Regular

    Joined:
    Jun 20, 2002
    Messages:
    854
    Likes Received:
    0
    That's great, but name a single quad textured game out there or that's going to be out there. Doom 3 is going to have 6 textures per pass and force a loop back even on a 4*4, so it loses its major advantage very quickly while sacrificing additional pipes which do much more for performance.

    And don't forget those extra TMUs can easily just go to waste in a lot of situations. See original Radeon for details.
     
  20. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    I should just add that John Carmack made a note of this during his interview here at beyond3d:

    This is part why I think that the 8 x 1 architecture on the R 9700 is a very nice choice. I would guess that it's more easy to tune your memory logic this way rather than for, lets say, 4 pipelines with multiple TMU's where you don't know how many of those TMU that will be idle some of the time.
    My point is that with 8 x 1 your performance hit from having to apply and fetch more texels should be fairly linear.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...