The New and Improved "G80 Rumours Thread" *DailyTech specs at #802*

Discussion in 'Pre-release GPU Speculation' started by Geo, Sep 11, 2006.

Thread Status:
Not open for further replies.
  1. Rys

    Rys PowerVR
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,163
    Likes Received:
    1,452
    Location:
    Beyond3D HQ
    Enough Inq bashing if that's alright, these threads get big enough v.quickly without noise to poop on the signal. Deleting pointless posts soon......
     
  2. Rys

    Rys PowerVR
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,163
    Likes Received:
    1,452
    Location:
    Beyond3D HQ
    I'm not convinced that's the case for R6.
     
  3. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,388
    Likes Received:
    114
    Location:
    msk.ru/spb.ru
    You're starting to look desperate guys :)
     
  4. chavvdarrr

    Veteran

    Joined:
    Feb 25, 2003
    Messages:
    1,165
    Likes Received:
    34
    Location:
    Sofia, BG
    But what about transistor count?
    What eats more transistors: 48 unified pipes or 16Vertex+48Pixel ones ?
    Which will be faster on common for this and next year load: 48U or 16V+48P ?
     
  5. jb

    jb
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,636
    Likes Received:
    7

    Traditionally, NV has always had advanced features, but they ran at piss poor perfromance: 32bit color but mega hit perf on TNT cards, 32bit pps on Fx, branching on last gen cards, ect. So following history its a safe bet to see where the G80 will be very very fast at current tec (ie DX9) but probably slow at next gen (DX10).
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Generally I mean (and Geo means, I guess) specific features of D3D10. The new stuff in D3D10 is a seriously big deal. Arguably a lot of this can be simulated in an SM3 GPU (e.g. using R2VB) but the implementation details in D3D10 make it work much more smoothly. There's also new data formats that provide for a big improvement in bandwidth usage for a given quality level of textures.

    A new API provides scope for alternative methods to implement visual effects. So the actual code is different, depending on whether you look at the DX9 code or the D3D10 code. So the DX9 algorithm may well be TMU-intensive, while the D3D10 version is ALU-intensive.

    If you're comparing two different GPU designs, they could have different trade-offs between support for TMU-intensive and ALU-intensive code. Clearly, we don't know, this is all hypothesis about the mechanism for performance disparities across the API gap.

    But NVidia chose different trade-offs for TEX and ALU than ATI, which is merely an example of how GPUs that are functionally similar can perform differently, across the API gap (i.e. SM1.x to SM2).

    We're not talking absolutes here ("NV30 is so useless at SM2 forget it"), merely that SM1.x performance didn't indicate SM2 performance.

    And, there are genuinely hard/costly to implement features in D3D10. Post-GS cache has been the subject of surprise round here recently - it just wants to gobble-up transistors and asks awkward questions about parallelism...

    Actually, what I was hinting at was that ATI's texturing architecture (out of order, cache design, ring-bus, programmable memory controller) might be capable of responding to large amounts of extra bandwidth, without needing to be sized-up. I don't know this is the case. The GDDR4 attached to X1950XT doesn't seem to back me up on this - I dunno, could be a driver issue. Have to wait and see :cry:

    Take a heavily ROP-dependent algorithm: if one GPU design is DX9++ and the other is DX9/D3D10-balanced then the latter may not dedicate as much transistor budget to eke out the last iota of ROP performance - hence a difference in performance for nominally the same visuals.

    It's just a curve: x-axis is transistors spent, y-axis is efficiency per byte of bandwidth per clock per ROP. GPU A might cut-off at 90% up the curve, while GPU B cuts-off at 99%... If you draw a curve for streamout you might find that GPU A is at 90% while GPU B is at 80%. It's all about transistors spent (or die yield) versus R&D-effort.

    And, transistor cost for ROPs or streamout doesn't just lie within the ROP and streamout portions of the die. Those functions are heavily dependent upon the avoidance of bottlenecks elsewhere. Which obviously also costs transistors.

    Going forwards there are two key concepts I can think of that define new, viable, algorithms in D3D10:
    1. take graphics work away from the CPU and make it run entirely on the GPU - the GPU can now write to local memory and re-circulate data in so many new ways as to go far beyond what the CPU could achieve (constrained by FLOPs, CPU<->GPU bandwidth, API-overheads)
    2. provide enough programmability, resource usage models and brute force that complex, ALU-intensive, shaders can be used to reduce dependencies upon texturing thus making more effective use of available bandwidth and texturing-throughput - e.g. by using more efficient texture formats (that require post-processing in the form of extra ALU instructions) or dynamic branching to minimise texture fetches
    Guess which gets easier to implement with D3D10:

    [​IMG]

    [​IMG]

    Jawed
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    You'd think so. In fact I do. I made a thread about this a few months ago, and the conclusions were very muted.

    http://www.beyond3d.com/forum/showthread.php?t=30276

    So the effect could be anything from "barely noticable" to "ZOMG! now we've started using it we can't go back!"

    The likelihood is that point-sampling TMUs will be heavily loaded by the vertex pipes, so it's hard to know how much is left over.

    Where's Jack?...

    Jawed
     
  8. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    704
    Location:
    Guess...
    Only if assuming that the shaders are identical and the seperate shaders of a discrete architecture are not more efficient than those of a unified architecture due to their higher level of specialisation.

    And what if you can afford to add more shaders in a seperate design because of the saved die space (if any?). Is unified still better? I.e. if 48ps + 8vs take up the same number of transistors as 48us + the scheduler, which is better then?

    I don't think its as clean cut as the quote suggests.
     
  9. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    Hmm there isn't much difference when it comes to pixel shaders from dx9 to dx10. Just a new syntax and geometry shaders. Yes it isn't going to be an ultra fast dx10 chip, but it probably won't be a slow one either and it really doesn't need to be super fast. Keep in mind dx10 has less overhead then dx9 so unless the g80 has piss poor geometry shader speed, that will be the only way it fails in dx10.
     
  10. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    That we will see with the g80 and r600 ;). I think the transistor count will end up close edge to ununified. But it seems the g80 is very ambitious and quite different from a traditionally design it might not really give us a hint at all. Sounds to me a unifed pool for TMU's + ROP's, a unified pool of geometry shaders + vertex shaders, and detached pixel pipelines.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I'd love to see an argument for why.

    Jawed
     
  12. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,481
    Likes Received:
    500
    Location:
    New York
    Gotcha, thanks for the clarifications.

    Oh, I read similiar capabilities as similiar performance, not similiar number of units. I completely agree - if R600 retains only 16 texture units they will have to achieve higher throughput through higher efficiency / bandwidth.

    I see what you're getting at. I was looking at things under the unified-vs-discrete umbrella for expected performance in DX9 vs DX10 at a macro level. But if we look at it in terms of transistor budget allocated to making new features zippy then I see your point. Guess I need to redefine what I consider to be "DX10 performance" in order to keep up with you guys.

    Having said that, DX10 performance is going to be defined a lot more by pixel/vertex/TMU/ROP/AA performance than geometry shading / streamout etc so how do you decide what defines a "fast" DX10 GPU? ATI better be blazing fast in the former categories before they decide to devote too much budget to the new whiz-bang stuff.
     
    #172 trinibwoy, Sep 13, 2006
    Last edited by a moderator: Sep 13, 2006
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    As I described earlier with those imaginary curves, is it better to go 90/90% or 99/80%?

    In other words, as we were once so fond of asking, do we really need to be able to run Quake 3 at >300fps at 1600x1200?

    To be fair, it's in the nature of SM3 that it's far harder to generate the extreme-FPS Q3 case. SM3 is still shiny and new.

    I'm curious to see what will happen when games straddle this API gap. Will games end up being:
    • SM2/SM4 - SM2 games such as FEAR and CoD2 show that visual quality can go an awfully long way within the constraints of SM2. Devs then go to town on the SM4 code, because SM2 code provides well-defined limits
    • SM3/SM4 - devs aim to make them practically indistinguishable, whilst cutting off a lot of gamers with SM2 hardware
    • some mixture of these?
    Jawed
     
  14. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    Well there really is no way to make sm 4 games look and feel the same as sm 3 games, things like tesselation and displacement just won't be possible with sm 3. Granted I don't think these new hardwares will be able to do these things at any reasonable speeds in game situations but thats just my opinion.
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    The implication was that SM4 version of code goes faster (just like SM3 Far Cry goes faster in some places than the SM2 version) or that extreme effects, such as "super voluminous smoke" take the place of crappy billboard smoke.

    Question of degrees...

    Jawed
     
  16. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    213
    Location:
    Uffda-land
    I'm trying to figure that out myself. :smile:
     
  17. ERK

    ERK
    Regular

    Joined:
    Mar 31, 2004
    Messages:
    287
    Likes Received:
    10
    Location:
    SoCal
    Just brainstorming here, but wouldn't a unified architecture have better potential geometry shader performance?

    ERK

    EDIT: I guess that doesn't really address the point--the reason it may be better (load balancing and lots of geometry power) is just the reason it would also be good for DX9. :(
     
    #177 ERK, Sep 14, 2006
    Last edited by a moderator: Sep 14, 2006
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    During geometry-only passes a unified architecture should be utterly incredible - subject to bandwidth/fetch though.

    But code that uses dynamic branching could really suffer in comparison with a discrete architecture. In a discrete architecture it's possible to have each invocation (each primitive) run in its own "thread". In a unified GPU, the SIMD architecture of the pixel shader pipes gets applied to the GS, and so 16 (or 64, whatever) primitives all find themselves lumped together in one thread.

    That means an IF THEN ELSE statement in the code will run both clauses for ALL primitives, even if 15 primitives want to execute THEN and 1 primitive wants to execute ELSE. So dynamic branching is handicapped somewhat...

    Jawed
     
  19. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    True but I think that speed increase has alot more to do with the current Windows overhead. So it should come across to Dx9l to some degree. But this is uncharted territory for me so if someone else can chime in here it would be great ;)!
     
  20. Brimstone

    Brimstone B3D Shockwave Rider
    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    1,835
    Likes Received:
    11
    I'm also guessing that PS3 RSX will have a XDR bus and nVidia might use XDR on their upcomming GPU's as well.


    Rambus has clearly stated they're targeting both ATI and nVidia with Rambus technology. XDR makes a lot of sense for bandwidth hungry GPU's.


    From a Rambus analyst meeting 6/01/2006
    http://rambus.org/cc/2006-06-01_Transcript.txt


    Sony signed a second contract with nVidia. From what I understand, this contract revenue is greater than the first contract between Sony and nVidia (this needs to be double checked). My guess this is another GPU for the PS3 platform.


    Sony needs to drive volume of XDR. Why use GDDR, when you could use XDR for the GPU and double the volume of XDR you consume? Elpida is going to be the main source of XDR, so Sony will have to utilize the PS3 as the major driver of XDR volume.
     
    #180 Brimstone, Sep 14, 2006
    Last edited by a moderator: Sep 14, 2006
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...