Nvidia GT300 core: Speculation

Discussion in 'Architecture and Products' started by Shtal, Jul 20, 2008.

Thread Status:
Not open for further replies.
  1. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    I'd like to assume that shared-memory atomics are implemented with dedicated hardware instructions. If that is the case, would seem as if you are right in that nothing special happens at the shared memory level. Otherwise things would get rather messy, you'd have to serialize groups of instructions on address "collisions". Not sure about the complexity trade off in hardware between these two options.

    Yeah that paper presents nearly the worst case I can see for atomic operations, huge number of global atomic operations. Each global atomic should be 64-bytes of global memory traffic (GT200, 32-byte minimum transfer size: load, atomic op, store) for a single 32-bit integer.

    The more I look at this, the more I think that fast atomic operations are in-fact more important than any type of dynamic warp formation (DWF), so I'm changing my prediction about DWF. DWF for bank conflict avoidance no longer seems worth it when you consider that you can just load data into shared memory at a bank offset based on thread index (can completely avoid bank conflicts). So about the only thing DWF gets you is better branch performance, but divergent branching messes up everything required for data locality, and tightly ordered synchronization. Which leads me to wonder about just what that "cGPU" buzz word actually means. Perhaps it is just Multiple Kernel - SIMD (MK-SIMD), better cross core load balancing, combined with some better more shared caching for atomics/ROP?
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Ah, the curse of the histogram, particularly one with an unknown bin count :lol:

    I think the most grievous issue is simply that a kernel's domain (grid) cannot entirely fit on a GPU at any one time, i.e. a grid by definition runs out of global memory.

    My hope for DWF is that it solves all waterfall scenarios, whenever any kind of random fetching/thread divergence crops up.

    In the end, maybe dynamic branching, like atomic operations, comes under the heading of "use sparingly, if at all". Still too-early days to tell.

    It seems that data-parallel-specific techniques such as scan can be used to work around these gotchas - but these techniques are, themselves, pretty expensive. The paper you linked seems to be an exemplar - the concept of the oracle to parse the input and build a tree is something that I expect we'll be seeing much more of as people grapple with "non-square" parallelism.

    It's interesting that ATI has a variety of speed-ups for some of these kinds of things buried in the architecture, like transposing reads from shared memory and lane-shared registers. Maybe this is where NVidia will be putting in a lot of effort, to make data-parallel primitives function with less off-die/divergence/serialisation.

    Well, NVidia's had a few years to think about this, so it could be as "radical" as shared memory was. Certainly something that's more fluid, less "square", has a naive attractiveness about it. Hard to know how long the shine would last, though. GPUs make awful ray tracers when you consider the raw computational throughput that's left on the table.

    The definition of cGPU appears to be Larrabee.

    Jawed
     
  3. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Still better than using 32 SIMD units for 1 ray in raytracing.
     
  4. KonKort

    Newcomer

    Joined:
    Dec 29, 2008
    Messages:
    89
    Likes Received:
    0
    Location:
    Germany, Ennepetal
    Again in cooperation with BSON I can present more GT300 details.

    The chip would have 512 SPs by a 512 Bit memory-interface.

    Hardware-Infos

    BSON
     
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,062
    Likes Received:
    3,119
    Location:
    New York
    And what do they plan to do with all that bandwidth if true? 512 GT200 class SP's won't be enough.
     
  6. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,237
    Likes Received:
    4,260
    Location:
    Guess...
    Its more than twice the shader power at less than twice the bandwidth so it seemsless wastefull in terms of bandwidth than GT200.

    Besides, if Xenos can use than much bandwidth i'm sure a GT300 could ;)
     
  7. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Depends on how fast they can clock those SPs... but yea, I somehow find the 512 bit AND GDDR5 to be a bit too much aswell.
     
  8. bowman

    Newcomer

    Joined:
    Apr 24, 2008
    Messages:
    141
    Likes Received:
    0
    Hey, it's about time. This thing should be a true 8800GTX successor.

    On the other hand 'bright side of news' with Theo Valich and that other site isn't exactly the most reliable source of info in the world.
     
  9. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    If 512 Bit and especially GDDR5 are true at all, then I wonder if and how Nvidias turn at making a compelling idle mode for this card will work out. :)

    [edit: Yippie - 1k postings completed. Advance to next level[/i]
     
  10. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Well, one thing I could think of... why not disable half the memory altogether in 2d mode?
    I mean, 1 GB or more memory is nice for 3d, but for a standard OS desktop it's way overkill. In fact, I guess even just 128 mb would be more than enough.
    Did any manufacturer ever try something like that?
     
  11. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,249
    Likes Received:
    3,419
    Why less? What GDDR5 speeds are we expecting to be available at the end of the year?
    And SP number probably doesn't mean anything besides a hint that they'll remain to be serial scalars from the number itself.

    Edit:
    GT300 delayed till 2010. So, what kind of GDDR5 memory are we expecting in the beginning of 2010? 8)
     
    #951 DegustatoR, May 5, 2009
    Last edited by a moderator: May 5, 2009
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Maybe NVidia's doing it already?

    But Aero is a 3D app, so you'd also be turning off ROPs on current hardware to make this work, I guess.

    ATI's IGPs seem to recognise a frozen image in the framebuffer and use that as a signal to turn off stuff. Something like that, IGP stuff is so interesting.

    Jawed
     
  13. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    I'm not sure if there's any difference at all between 2d and 3d, as far as hardware is concerned.
    That is, why would there be any specific 2d hardware when 3d texturing/shading hardware can perform the same operations?
    So I don't think Aero or 'classic' Windows interface makes a difference.
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Yes, you're right.

    I think it's just a question of how flexible the tiling of screen space would end up being, then. i.e. can you run aero solely on one MC? By definition the architecture can do this - the question is really about the dynamic switching...

    Jawed
     
  15. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Yea, if I were developing a GPU, I'd want to look into that.
    Aero itself is really light, even on my Intel X3100 it runs very well. So you can castrate a modern GPU down to X3100-levels (well let's see, I have 667 MHz dualchannel DDR2, 8 'stream processors' and a maximum of 384 mb), and people probably won't even notice. But the power saving would be huge. What does an X3100 use anyway? Less than 10W I suppose. A videocard idling at < 10W, that would be something :)
     
  16. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,062
    Likes Received:
    3,119
    Location:
    New York
    Perhaps but don't we all expect required math:bandwidth ratio to increase rapidly going forward? Or will DX11 apps remain texture/bandwidth bound?
     
  17. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    You should be getting yourself an HD 4670 then ;)

    Seriously, Aero is a no-go on my GMA500. I mean, it works, but veeeery slooooowly. So, sweet spot would be somewhere in between yours (X3100) and mine (GMA500).
     
  18. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    No, I want something that's efficient when idle, but is actually fast when running at maximum speed.
    Heck, would an HD4670 even be faster than the 8800GTS I have currently? It would not support PhysX, and it won't have the same level of texture filtering anyway.
     
  19. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    Do your PhysX reduce idle power? angle independancy AF has a 16X power savings mode? Oh wait, nV marketing does not condone the use of the Power of 3 in a discussion about power savings.

    what about switching graphics just like on the notebook side? you'll pay $20 more for your videocard but it would (very very very very very very eventually) earn that back by switching GPU's under load.



    GT300, 512 bit, super duper memory controller. delay after delay. NV30 and R600 finally can have a threesome.
     
  20. Sxotty

    Legend

    Joined:
    Dec 11, 2002
    Messages:
    5,497
    Likes Received:
    867
    Location:
    PA USA
    That post was full of crazy incoherent language.

    Still we had that it was called hybrid Sli for power savings, only available on AMD chips (with Nv board) and died an early death. Maybe in the future it will rear its head again, preferably with OS support so that it could deal with AMD<-->Intel<-->Nvidia without it mattering.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...