The New and Improved "G80 Rumours Thread" *DailyTech specs at #802*

Discussion in 'Pre-release GPU Speculation' started by Geo, Sep 11, 2006.

Thread Status:
Not open for further replies.
  1. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    I still don't think it makes much sense to have a dedicated bus. Yes, it is simpler, but GPU's have had unified buses for many years now. I doubt they'd take a step backwards like this.

    After all, don't forget that it's not just the memory bandwidth that is being dedicated, but also the memory space. All individual areas of memory space are highly-variable in today's GPU designs.
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    It's my interpretation of patents, etc. that NVidia wants to merge TMU and ROP functionality into one programmable "unit".

    Whether that unit is a decoupled pipeline that runs alongside the ALU pipeline, or is integrated as macros into the ALU pipeline, who knows... I expect the former initially.

    So the end result is one point of access to memory.

    ---

    There's an interesting, minor, corrollary with streamout in my view:

    Streamout writes data to memory that then needs to be read back (sometime soon!) for rendering to continue. Streamout is a geometry (vertex) specific technique.

    A lot of pixel shading techniques would benefit from writing a pixel value and then (sometime soon!) reading it for rendering to continue.

    As it happens, in both cases "sometime soon!" is blocked - the dev is forced to flush things out and the whole thing is fairly clunky. It makes the parallelism of the GPU much easier to implement, but programmers apparently have been screaming they want "immediate read after write" for donkey's years.

    So, in my view, both streamout and ROP-output make natural targets for "more timely" writing/reading.

    Apart from what we might see in G80 (prolly only exposed in OGL 3.0? or as an NVidia extension in OGL?) I'm doubtful that this "fully programmable ROP" (and streamout?) will come any time soon, i.e. to DX.

    I'm still unclear on the mechanics of read-after-write in a pixel shader. How restrictive would it end up?, and would those restrictions nullify most of the benefit devs have been dreaming about?

    Jawed
     
  3. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Maybe we shouldn't think about 12 memory chips around one GPU.. what about 6 mem chips x 2 GPUs? :idea: ok ok..I shut up :wink:
    The original patents I was referring to are these ones:

    Pixel load instruction for a programmable graphics processor

    Position conflict detection and avoidance in a programmable graphics processor

    Position conflict detection and avoidance in a programmable graphics processor using tile coverage data

    BTW..while I was checking those patents I found a new interesting one (LOL): what's the difference between a costant value held in a texture or in a costant register in the end? well, the latter must reside closer to your 'heart', so here we go:

    Shader cache using a coherency protocol
     
  4. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    It's clear, that in a DX10 GPU there is little or no more place for fixed-function parts, so the ROPs either must go for full programmability or their functions shall fall back to the fragment pipes and thus all legacy blending/sampling op's must be emulated on driver/API level (as was for T'n'L).
    I honestly bet for the second option, as it will save some level of complexity (in favour of extra VS/PS units) and will "close" more the memory interface to the fragment core, if it has now to deal with the burden of framebuffer op's in sampling/blending & etc. The other thing also is the support for virtual addressing in the GPU - will be there an extra (mini)AGU for each fragment pipe/quad or this function will be too consumed by the new "multipurpose" ALU's?
     
  5. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    I second this option, imho nvidia will continue to use PS ALUs as AGUs for texturing ops and even for general purpose read/write memory ops (makes even more sense now that they are going to support integer ops as well, sharing the same computational units with their floating point counterparts)
    At the same time I believe they will decouple TMUs from PS units since now they have to massively use them to serve multiple clients (VS/GS/PS).
    I also wonder if they are going to have a single big L2 (texture) cache which will serve all texturing requestes from all possible clients or whether they will have a multiple dedicated L2s.
    Wouldn't be nice having your pixel shader slowing down cause a mad vertex shading is thrashing all your texture cache, lol :)

    Marco
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    That's what associativity is for...

    Jawed
     
  7. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    215
    Location:
    Uffda-land
    Do I hear a second for any of these? :smile: Trying to keep it consensus-y.

    Chose the wording of > 256-bit carefully, to allow for either 384 or 512. :wink: One second hand reported "spotting" does not spring make. Or something like that. Show me a piccie, and I'm in. :)
     
  8. JF_Aidan_Pryde

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    601
    Likes Received:
    3
    Location:
    New York
    What clock rate do you guys figure?
     
  9. INKster

    Veteran

    Joined:
    Apr 30, 2006
    Messages:
    2,110
    Likes Received:
    30
    Location:
    Io, lava pit number 12
    No more than 550MHz, that's my take.
     
  10. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    no degree of associativity will automagically solve texture cache(s) trashing
     
  11. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    My guess is 500 Mhz or even below at 90 nm..
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Depends on how you parameterise the associativity and then frobnicate the tiling.

    Judging by the incredibly efficient texturing in R580 (which also runs texturing out of order from multiple batches, just like a unified shader), you're going to need to come up with some evidence for your assertion.

    Jawed
     
  13. trumphsiao

    Regular

    Joined:
    Jan 31, 2006
    Messages:
    285
    Likes Received:
    11

    700MHz for sure.

    people who want to upgrade from G71/R580 to G80/R600 shall bear in mind that advance architecture will give you less performance pickup on less-ALUs computaional game .
     
  14. trumphsiao

    Regular

    Joined:
    Jan 31, 2006
    Messages:
    285
    Likes Received:
    11
    I believe either G80 or R600 are 4:1 concept architecture.
     
  15. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    You can improve things, you can't fix them.
    If one thread decides to go berserk and to randomly fetch all the memory in your system as quick as it can there's NOTHING you can do to stop it.
    You missed my point, last time I checked R580 very efficient TMUs as serving one single shader (as they don't really support vertex texturing at all), which is expected (not in every case) to express some degree of coherency (fetching the same textures, using similar patterns and so on..)
    I'm talking about 2 completely different entities that must be compete for the same resource in 2 completely different ways.
    Even on SM3.0 shaders if you fill a texture with random values and you use dependant texture reads to perturb texture coordinates you can easily trash your texture cache despite any degree of associativity. It's just going to be much easier to do this with SM4 shaders and D3D10 rendering pipeline.

    p.s. I don't even want to think about trying to randomly index texture arrays, lol :)
     
    #35 nAo, Sep 12, 2006
    Last edited: Sep 12, 2006
  16. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Yeah..but they're supposed to have more pipeline and more memory bandwidth, that should make up for the lack of Mhz
    Can you elaborate on this? :)
    If G80 (according rumours) has 16 VS/GS and 32 PS with 32 TMUs and one ALU per VS pipe and 2 ALUs per PS pipe we end up having a 2.5 ratio.
    To approach 4:1 we should assume something like 3 ALUs per PS pipe or lower the number of TMUs to 24 or so, but I don't like the latter at all :)
     
  17. INKster

    Veteran

    Joined:
    Apr 30, 2006
    Messages:
    2,110
    Likes Received:
    30
    Location:
    Io, lava pit number 12
    Unless each of those TMU's are completely different in their habilities from NV40/NV45/G70/G71... ;)
     
  18. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    What do you mean? are you hinting to slower texturing rates per TMU? (like 2 clock cycles per bilinear sample???)
     
  19. trumphsiao

    Regular

    Joined:
    Jan 31, 2006
    Messages:
    285
    Likes Received:
    11
    from What I heard

    (Rumors)
    G80:32X2 ALUs: 16TMUs

    sample benchmark indicates that G80 could be slower than /equal to previous generation on numerous older game.
    this round better architecture and good clock scaling are equally important.
    R600 will be quite faster than G80 on next 3DMark.

    R600
    1.architecture is better than G80 but speed scaling right now is still in stall.

    G80
    1.Good architecture for appropriate right time only but needs better speed scaling for lack of advance.(have to reach 1GHz in case of faster-clock R600 )
     
  20. trumphsiao

    Regular

    Joined:
    Jan 31, 2006
    Messages:
    285
    Likes Received:
    11
    Nvidia is a slothful giant.:wink:
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...