The New and Improved "G80 Rumours Thread" *DailyTech specs at #802*

Discussion in 'Pre-release GPU Speculation' started by Geo, Sep 11, 2006.

Thread Status:
Not open for further replies.
  1. Anteru

    Newcomer

    Joined:
    Jul 4, 2004
    Messages:
    114
    Likes Received:
    3
    True, but for such a radical change twice the performance of the previous generation is an even greater leap forward. Most of the time, they've been trying to squeeze the most out of a given architecture just to stay competitive, now they do something completely new.

    I wonder how long they've been working on the G80, this one seems so dramatically different to the N4x line that it must have been in development since years ... anyway, if AMD/ATI is "just" focusing on a R580 on steroids, they'll have a hard time. And I assume AMD won't be willing to sink huge amounts of money into the high-end 3d products to catch up with nVidia but rather invest that money into "Fusion".

    @physics: I also think that nVidia doesn't push marketing $$$ into physics if they don't have some very good project to showcase it.
     
  2. dizietsma

    Banned

    Joined:
    Mar 1, 2004
    Messages:
    1,172
    Likes Received:
    13
    yeah, that's such a drag .....
     
  3. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I think you're overestimating the bandwidth needed for texturing. Anywhere you have magnification (which is prevalent at high res), AF, and/or texture compression, bandwidth usage by texturing is very low. Just because R580 keeps the texture units saturated doesn't mean that the texture units are saturating the memory bus. In fact, it's quite the opposite. 16 TMU's at 95% efficiency isn't as good as 24 at 75%. That's why G71 beats R580 substantially in a few shadermark and rightmark tests. Moreover, even if having more texture units decreased their efficiency, it will increase the bandwidth utilization.

    Stencil shadowing consumes next to zero bandwidth, especially if you're clever about how you compress the Z-buffer. Moreover it doesn't consume any BW when you're actually doing any texturing.

    ATI didn't increase the shader:texture ratio because they were bandwidth limited, they did it because the workload demanded it, especially in more recent titles.
     
  4. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
    I've a question that's probably a little OT, maybe, but if I have some shader code that's using a r rendertarget (that is, it computes just one colour component, not rgba), is there any value to making it SIMD, so to speak, if I'm going to be running it on this kind of architecture? By not computing a full rgba value I figured I was losing a lot of performance, but is that relevant here with scalar processors?
     
  5. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Or it could be 64 @ 575MHz. The way they drew it suggests maybe 8 bilinear samplers per TCP, but grouped in pairs? This would jive with the free trilinear theory, but seems a bit odd to me. Why add all the logic and buses for filtering and memory access but skimp on the address calc? Both ATI and NV went to great lengths to minimize the number of samples needed for trilinear and anisotropic filtering.
     
  6. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Cause for the foreasable feature we don't need an improved texturing ratio over the previous generation, what we need is more quality, not quantity, imho.
     
  7. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    I think you skimp on the address calc because the texture units are aimed at better IQ; or, if you prefer, because they would otherwise be underused/wasted. If you wanted just plain more texturing, that would mean things for L2 cachesize (and L1? -- depending on decompression, I suppose) and bandwidth. As a guess, anyway....

    edit: yeah, what nAo said :)
     
  8. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    If you did that, you'd oversample the footprint (massively). You really want that sqrt() in there.
     
    Geo likes this.
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    It's why ATI designed R580 to run at max image quality in real games, not synthetics :roll: G71's 24 TMUs are roundly embarrassed, when it's set to produce nearly as good texture filtering, in games.

    No, but texturing bandwidth is lost, forever, while stencil shadows are being created. If you turned off stencil shadowing in D3 then texturing performance would look different. I've never seen a comparison of R580 and R520 with stencil shadowing off, though.

    Barely. FEAR is the most ALU-limited of any game out there and the performance delta between R580 and R520 belies the ALU:TEX ratio.

    In other words, ATI did that in expectation of the way games are going - and their long-standing recommendations on the ALU:TEX ratio. And they did that because with a given bandwidth there's no point in increasing TMU capacity. Particularly when increasing the ALU capacity also increases the utilisation of the TMUs, hence overall TMU performance.

    R5xx's texturing architecture is dependent on out of order threading. I'm sure there are corner cases where R520's texturing is as fast as R580's - but game tests show that R580 gets more texturing performance.

    R300-R480 is designed for 3:1 code - ATI has been pushing that concept for a few years now, before R520 appeared. R300's asynchronous texturing (but not out of order threading) means that bilinear texturing latency can only be fully hidden if there's enough non-dependent ALU instructions.

    The effective ratio is more like 9:1 with R580.

    We should be seeing similar gains in TMU pipe utilisation with G80 due to out of order threading. What I wonder, though, is how much of that gain is hampered by running pixel shader threads in order, as demanded by the PIOR flag (which needs to be set to allow out of order threading). What proportion of pixel shading will be done under PIOR being reset.

    My gut feeling is that NVidia designed G80 for single cycle fp16 texturing. What use is there in that? I think this subject deserves its own thread...

    Jawed
     
  10. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Doing two bilinear samples requires more bandwidth and L2 cachesize regardless of whether they're used for any two samples or specifically for free trilinear / volume texture / 64-bit textures. That's my point. They already did double almost everything. Buses, filtering, cache, read-ports, parts of the mem-controller, and who knows what else. All that's left is address calc and getting the results to the stream processor, and though that may not be trivial, it seems like a fraction of the work that all the other stuff involves.

    I don't think it would improve the ratio. If the Archmark tests are correct, bilinear perf is about 10% faster, but other tests show ALU perf is 2-2.5x better. And often quantity does in fact give better quality. Imagine free detail textures, double the PRT coefficients, double the shadow map samples, etc.

    Anyway, I'm still not entirely convinced that bilinear perf didn't double. Like I said before, in shading tasks where R580 failed to improve much on R520, G80 is have no trouble improving on G71, and often those shaders seem to be using ordinary tex2D() instructions.
     
  11. HAL

    HAL
    Newcomer

    Joined:
    Nov 12, 2005
    Messages:
    103
    Likes Received:
    2
    [​IMG]

    [​IMG]

    [​IMG]
    :wink:
     
    #2591 HAL, Nov 7, 2006
    Last edited by a moderator: Nov 7, 2006
  12. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I know that, I was just talking strictly about yes/no comparisons (which you'd notice if you didn't truncate my sentence :wink: )

    Actually, when I typed that initially I didn't have the math worked out so explicitly, so I wasn't sure if those quantities were actually needed. I'm still not sure, because the hardware may not even explicitly calculate the number of samples. Maybe it just calculates the starting point and the stride vector, then terminates when it marches far enough.

    Anyway, given that you know a lot more about actual texturing hardware than I do, was that post mostly correct? I was sort of reverse engineering it, and haven't seen any IHV algorithms or code first hand.
     
  13. ants

    Newcomer

    Joined:
    Feb 10, 2006
    Messages:
    44
    Likes Received:
    3
    Jawed likes this.
  14. Anteru

    Newcomer

    Joined:
    Jul 4, 2004
    Messages:
    114
    Likes Received:
    3
    :shock:

    Yeah baby! Seems my prediction that they know how to do skin wasn't way off, thanks ;) But the eyes seem a bit strange still ...
     
  15. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    All these operations in most cases involve computing or fetching some sampling coordinates and accumulating or composing some sampled texture. Let say you need a couple of instructions (rough approximation cause now we have scalar units..) et voila'..you already got this (more or less) with the current ALU/TMU ratio. What I'm trying to say is that the ratio between the number of clock cycles you need to issue a sampling instruction and to accomulate a sample value and the number of clock cycles you need to perform the sampling process is already circa 1:1.
    Things could be improved a bit, but certainly not in a dramatic way.
    So let them throw more processors (at the same ratio) to have more detail textures, longer PRT vectors, more shadow map samples, etc.. but let also them give us higher filtering quality ;)
    On programmable hw something as free detail maps can't exist..

    Correct me if I'm wrong but R580 and R520 have the same number of TMUs, right? so it's natural to see G80 doing better wrt G70.
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    No doubt, now, that NVidia is waving the Unified flag, and that it's 90nm.

    Jawed
     
  17. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Wow! NDA lift-off is already crossing through the timezones, isn't it. :grin:
     
    #2597 fellix, Nov 7, 2006
    Last edited by a moderator: Nov 7, 2006
  18. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    At the rate those pages are loading it'll probably be tomorrow by the time I read it anyway :)
     
  19. Cuthalu

    Newcomer

    Joined:
    Oct 28, 2006
    Messages:
    118
    Likes Received:
    3
  20. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    [​IMG]

    Are those TEX address units ticking at the same clock rate as the SPs?
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...