The New and Improved "G80 Rumours Thread" *DailyTech specs at #802*

Discussion in 'Pre-release GPU Speculation' started by Geo, Sep 11, 2006.

Thread Status:
Not open for further replies.
  1. Dio

    Dio
    Veteran

    Joined:
    Jul 1, 2002
    Messages:
    1,758
    Likes Received:
    8
    Location:
    UK
    Considering the ALU:texture ratio, the cost of a trilinear aniso fetch averages under 2 clock cycles over a typical scene. That's not to say it's not a bottleneck in some areas (as there is locality of reference as to when you need trilinear and aniso) but the optimum is nothing like as high as 9:1 on 580.

    Latency is a separate issue. Either there is enough latency hiding or there isn't, and that calculation is much more complex than just an ALU:texture ratio and it's resistant to being predictable in real-world situations. Adding trilinear and aniso doesn't actually affect it much; it could even ease latency pressure, because reducing throughput increases effective latency hiding.
     
  2. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
  3. MulciberXP

    Regular

    Joined:
    Oct 7, 2005
    Messages:
    331
    Likes Received:
    7
  4. INKster

    Veteran

    Joined:
    Apr 30, 2006
    Messages:
    2,110
    Likes Received:
    30
    Location:
    Io, lava pit number 12
    The last page of that review seems to imply that Nvidia is currently using a non-optimized driver for G80, maybe with a fixed shader ratio, in order to guarantee compatibility at launch with a broad set of games and that in the future, with the focus on load balancing at driver level, we can expect a significant performance gain in many games.

     
    #2604 INKster, Nov 7, 2006
    Last edited by a moderator: Nov 7, 2006
  5. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Given leaked texturing rate I'd say the answer is no:
    8 (SP clusters) x 8 TF x 0.575 GHz = 36.8 GSamples/s
     
  6. demonic

    Regular

    Joined:
    Nov 1, 2002
    Messages:
    321
    Likes Received:
    5
    Location:
    London
    Seriously, when will this actually be do-able in games. For example, HL2. Its quite good. But im dying to see realism like this for an entire game, not a tech-demo.

    DX11, DX12? An equivilent engine of Doom4, UE5, Source2, Crysis Engine2? I would mention the team who did serious sam lol, but i wasnt impressed with their toy-engine...

    When dammit! :razz:
     
  7. allnighter

    Newcomer

    Joined:
    Aug 2, 2003
    Messages:
    14
    Likes Received:
    0
    'Scuse my ignorance and lazyness but when is the official NDA lift-off again?
     
  8. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    215
    Location:
    Uffda-land
    To "guarantee compatibility"? Umm. . .what? How would that guarantee compatibility? What would break in older games with a dynamic allocation in a hypothetical USA arch?
     
  9. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    It would make games too fast, so they found a way to slow them down a bit :)
     
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    "Guaranteeing performance since our dynamic balancing sucks" just doesn't sound as good :grin:
     
  11. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Don't you think R580's pixel shaders have something to do with its victory over G71? I seriously doubt R520 will also beat a HQ mode G71 in those games (but you're welcome to prove me wrong with some data). While I recognize that G71 skimps on texture quality, I'm not convinced that HQ on NVidia is fair either in terms of evaluating the texturing system. I think that completely disables brilinear, whereas ATI is still using it (though to a lesser degree than NVidia's default).

    Rendering time is lost. During the actual lighting passes, you are no more bandwidth limited in D3 because of the time lost during stenciling. You still have 76 bytes per clock. In fact, you may be less BW limited because a shadow might give the MC a break while the rasterizer skips pixels.

    Yes, but also because their architecture increased cost only 20% to improve ALU perf 3x. There is no such thing as a fixed ALU:TEX ratio. Some parts of the screen with have higher ratios than others, so the R520->R580 improvement changes for different parts of the screen. Even if a game averages 1.5:1, hardware optimized for a 2:1 ratio could easily perform worse than 3:1.

    This is wrong, and this is what I have a beef with.

    If there was a R590 that had 32 texture units and 48 pixel shaders but the same bandwidth as R580, it would undoubtedly perform a lot faster. Not twice as fast, but faster for sure. A 7800GT has more texturing perf than NV40, even though it has less bandwidth, and the same is true (by a huge margin) for 7600GS vs. X1600XT.

    Not really. R300-R480 can reach their peak texturing rate in any multitexturing test, where you have a 1:1 ratio. But you're right that under gaming conditions sometimes the latency can't be hidden completely, and AF/trilinear allow more math ops to be used for free. There's also scalar co-issue.

    I've noticed you say this before, but I think you're making too big a deal of out of order threading. Many of the gains from R480->R520 were simply due to the new memory controller. In lots of shader tests and games it was merely helping ATI catch up to NV4x in perf per pipe per clock, not really surpass it. The other thing that helped ATI for quality is decoupling the texture units.

    I think what hurts G7x right now is that it can't hide multiple cycle texture access with math because the TMU's are sort of inline with the ALU's. At least that's what I'm assuming from the GPUBench results. That's one of the reasons NVidia is so aggressive with trilinear and AF optimizations, as these multi-cycle operations stall the pipeline.
    EDIT: Yup, this confirms it: http://www.pconline.com.cn/images/h...ic/D00018NN.JPG&namecode=diy&subnamecode=home
     
    #2611 Mintmaster, Nov 7, 2006
    Last edited by a moderator: Nov 7, 2006
  12. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    215
    Location:
    Uffda-land
    Btw, just as a forum managment announcement, this thread will be locked at 11am PST/2pm ET/7pm GMT tomorrow.
     
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    But isn't that just half the picture? Even with decoupled texture units would G7x have had enough math available to keep its ALU's busy in the absence of out-of-order threading?
     
  14. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,262
    Likes Received:
    22
    Location:
    Land of the 25% VAT
    :yep2:
     
  15. INKster

    Veteran

    Joined:
    Apr 30, 2006
    Messages:
    2,110
    Likes Received:
    30
    Location:
    Io, lava pit number 12
    Then why are many Xbox 360 games also doing the same right now ?
    Because they were probably developed when USA hardware was not widespread enough on the market yet, and most are just conversions of existing engines.
    We don't know how USA will work in DX10, because Vista isn't due until Jan/Feb 2007, and there are no DX10 games out either, now do we ?

    I really don't care about USA right now.
    The performance and quality already demonstrated in these tests is proof enough that it works, and it works well. Everything else about the future is speculation at this point. :smile:
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Hmm, seems to have a distinct lack of subsurface scattering.

    This still looks computer generated - apart from detail, which is fantastic (though the hair aint), this is far short of what I thought NVidia was shooting for. Oh well.

    Jawed
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
  18. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    Sorry for cherry picking here, but why is that so? I've never understood the technical reasons behind this. Difference SDRAM types require increasingly complex MC's, but what was so bad about previous MC's that was improved in R520? (And how does it related to the MC quality of its competitor.)

    Are there benchmarks that show this to be the case? (Not contesting your argument, just wondering.)
     
  19. Bouncing Zabaglione Bros.

    Legend

    Joined:
    Jun 24, 2003
    Messages:
    6,363
    Likes Received:
    83

    Exactly my first thought - no subsurface scattering. Although she looks great, her skin also looks flat, rather than the luminesence you expect from beautiful skin. Oh well, it's probably an artistic decision or something.

    It's a pity we don't see much of her hair, as that has alway been traditionally very hard to get looking realistic.
     
    #2619 Bouncing Zabaglione Bros., Nov 7, 2006
    Last edited by a moderator: Nov 7, 2006
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    You should look at the review's 4xAA/16xAF results. Particularly the Far Cry results, which scale perfectly with bandwidth, based upon X1950XTX (60GB/s versus 86.4GB/s). Despite the fact G80 has 3.5x the TMU capability. Bandwidth rules here.

    Now, in future algorithms where (unfiltered?) fp16 or fp32 textures are used, e.g. for intermediate results (deferred shading), things may be somewhat different. I dunno.

    Insanely high performance with AA and AF turned off doesn't tell us much, though. We don't want to play games like that.

    Sadly, the one thing that diagram doesn't show is what happens when there isn't enough ALU code to hide the TEX latency. i.e. what happens when the "A" needs to be done while the TEX filtering pipe is busy on the previous TEX operation.

    So far as I can see, G80's "excess" TMU capability is running aground on standard game texturing, being entirely bandwidth-bound at high IQ. Some of that "excess" is arguably there to support constant buffers, for example (guess). So it's hard to judge the degree of excess, since that stuff only gets stretched under D3D10. But as pure Int8 TMU pipelines, there's wildly too many.

    Which is why I think G80 is designed for full-speed fp16 texturing and presumably half-speed fp32 texturing.

    Jawed
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...