NVIDIA confirms Next-Gen close to 1TFlop in 4Q07

Discussion in 'Architecture and Products' started by B3D News, May 23, 2007.

  1. Shtal

    Veteran

    Joined:
    Jun 3, 2005
    Messages:
    1,344
    Likes Received:
    4
  2. Domell

    Newcomer

    Joined:
    Oct 17, 2004
    Messages:
    247
    Likes Received:
    0
    Hmmm but as you see below Fudzilla said NVIDIA G100 is scheduled for 1Q 2008....
    http://www.fudzilla.com/index.php?option=com_content&task=view&id=355&Itemid=1

    I don`t think ATI will release 3 High-End GPUs in 10 months.... r600 May 2007, r650 September/October 2007 and r700 about March 2008.... Imo we will see r700 in the 2H 2008 togehter with G100....
     
  3. Shtal

    Veteran

    Joined:
    Jun 3, 2005
    Messages:
    1,344
    Likes Received:
    4
    G92 and G100 has very small time frame differentness between of them ?? - well possibly ATI may have some kind Extreme edition of R700 as well.
     
  4. Domell

    Newcomer

    Joined:
    Oct 17, 2004
    Messages:
    247
    Likes Received:
    0
    But i think G100 and r700 as well won`t be released in Q1 2008.... At the earliest in Q2 2008 but more probably Q2/Q3 2008....
     
  5. INKster

    Veteran

    Joined:
    Apr 30, 2006
    Messages:
    2,110
    Likes Received:
    30
    Location:
    Io, lava pit number 12
    Michael Hara said recently that Nvidia will stick to the "high-end launch in the Fall, mainstream launch in the Spring" business model for the foreseeable future, so no G100 in no Q1'08, or even Q2'08.
    G92 is what you get, and it's not that bad either.
     
  6. Shtal

    Veteran

    Joined:
    Jun 3, 2005
    Messages:
    1,344
    Likes Received:
    4
    Probably ATI-R670 will fight Nvidia G92 then!

    The latest batch of roadmaps tells of details about several new parts, for example the RV670 and R670, http://www.theinquirer.net/default.aspx?article=40068
     
  7. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,559
    Likes Received:
    34
    I dont think that R670 will fight against one G92. :wink:
     
  8. Domell

    Newcomer

    Joined:
    Oct 17, 2004
    Messages:
    247
    Likes Received:
    0
    Why not??
     
  9. INKster

    Veteran

    Joined:
    Apr 30, 2006
    Messages:
    2,110
    Likes Received:
    30
    Location:
    Io, lava pit number 12
    Probably because it's a dual-GPU card, with maybe two R650's working in Crossfire mode.
    However, i wouldn't rule out a similar move by Nvidia, even though we know G92 is not another GX2-type of refresh product (due to known process changes, added FP64 support, etc).
     
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,110
    Location:
    New York
    65^2 is 52% of 90^2. How'd you get a 28% decrease?
     
  11. _xxx_

    Banned

    Joined:
    Aug 3, 2004
    Messages:
    5,008
    Likes Received:
    86
    Location:
    Stuttgart, Germany
    90 - 28% = 65. Obviously miscalculated ;)
     
  12. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    True, but IMO that's not a particularly useful capability. There should be enough threads on each cluster that this situation is statistically rare. If you really wanted to take care of this corner case, it would make more sense to make the sequencer a bit more intelligent in choosing which batch to go after next.

    Probably. 90nm --> 65nm theoretically means 48% less area per transistor (though usually the gains aren't quite that high between processes). A 25% decrease in area and 30% increase in tranny count only needs a 42% decrease in transistor area, so it's very realistic.

    G92 looks pretty nuts to me. I thought ATI might have an advantage in clocking up its shaders when AMD came aboard, but now that NVidia beat them to that with G80 and will likely go even further with G92, I don't see ATI having much success against the latter.
     
  13. Geeforcer

    Geeforcer Harmlessly Evil
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,320
    Likes Received:
    525
    Remind me to stay away from calculations at 3 am. :oops:
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Strangely enough, what Davros suggests seems to be how R600 works. But it works like that all the time.

    Hmm...

    Jawed
     
  15. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    "Davros" :)

    Well, I'm not sure I'm in love with R600's approach, either. Should nearby pixels really be handled by disjoint TMUs? Does it make sense to *always* ship work across the chip? One could come up with a trivial predication case that would effectively underutilize R600's TMUs as well.

    From a high-level perspective, I guess I think of this problem in a couple of different ways. Either worktypes are determined, and processing units (TMUs, SFUs, ALUs) assigned dynamically, or a kernel forks off requests to unit processing farms, which report back results (the individual 'farms' manage prioritization of incoming requests, etc.).

    MintMaster is probably right, that multiple threads can almost certainly hide underutilization, but the above seems somewhat more flexible when it comes to handling DB. As long as #units/#sequencers(?) <= average(kernel_data_width), you don't have a DB problem. I'm sure there are much larger problems to deal with, though -- like shipping data all over a chip.... Something I wouldn't expect a higher-clocked chip to try to do. [And it is looking like the G92 is a MUL-enabled, 192proc, higher speed chip, if "2x theoretical" and "2.5-3x real" and "30% smaller die" are to be believed]

    -Dave
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Ah, sorry, Dave - hasty posting during an advert break syndrome :oops:

    For the rest of this posting, just assume I've got one eye somewhere else :grin: ...

    I think one would need to do some serious simulation to understand this.

    I can only think that once you've built latency tolerance, the two approaches (private TUs versus shared-distributed TUs) end up moving the same amount of data around the ring.

    Hmm, except that texels in compressed form (which I presume they are, while they're in L2) would consume less ring bandwidth. When a TU produces a quad of texel results (or, perhaps, 4 quads of texel results as a burst in response to one batch) that are fully filtered and are destined for registers, surely they consume more bandwidth on the ring? Then again, texel-overhead relating to anisotropic filtering is saved, since those extra texels tend to stay in their "home" L2. Gah.

    We don't know the rasterisation pattern in R600. Considering a batch of 64 pixels, for example, is it:

    1111222233334444
    1111222233334444
    1111222233334444
    1111222233334444

    or:

    1111111133333333
    1111111133333333
    2222222244444444
    2222222244444444

    etc.

    I remember a rasterisation patent document that implied rasterisation along the long axis of a triangle, so either width-wise or height-wise rasterisation is possible. What's the effect of that on texel locality? How big are the screen-space tiles within which rasterisation is constrained? What about that texture caching patent application I keep linking, the prefetching one?

    I can't think what kind of trivial predication you're referring to that would waste R600's TUs. The "home" arbiter for the texture requests (for a batch) is forced to treat the 16 quads of texel results that it's waiting for as asynchronous events. Predication would de-select texture-fetches at the quad level, I guess, so the arbiter would only send out quad-fetches to "foreign" TUs as needed.

    Brainfade...

    Jawed
     
  17. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    Fair enough. You should assume that I'm asleep while posting this, then :)

    If quad X always goes to TMUx, then a predication mask that always masks (say) quad 2, will leave TMU2 without any work to do.

    I'm not sure how a local TMU uses the ring at all -- local ALUs talk to local TMUs, I wouldn't expect that to be over the ring. As it is, ALUs are always talking to remote TMUs (how remote depends on which quad). Have I misunderstood something? [that's a stupid question ;)] What have I misunderstood?

    -Dave [->sleep]
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Ah, OK, that's the kind of thing synthetics are for. Actually, that'd prolly make a really good synthetic for testing the performance of R600 texturing. Similar to dynamic branching tests that only use rectangular areas of coherence.

    Which reminds me of a similar possibility with the way textures are defined and then fetched. It's possible to use a stride that will hit only one memory channel.

    No, but some of the texels could be in a foreign TMU's L2 already. Presuming that L2 is distributed - which I'm assuming is the case for the time being...

    No, I don't think you misunderstood anything. I might draw a diagram of how I think it all hangs together at some point...

    Ooh, hang on, there's this from Watch Impress

    [​IMG]

    I wish AMD would just post the complete set of slides.

    Anyway, that doesn't show the ring bus at all, so I prolly should still have a go at a more detailed diagram.

    Jawed
     
  19. mhouston

    mhouston A little of this and that
    Regular

    Joined:
    Oct 7, 2005
    Messages:
    344
    Likes Received:
    38
    Location:
    Cupertino
    Eric Demers gave a talk about the R6XX processors at Stanford's CS448 and AMD actually let us post the slides. http://graphics.stanford.edu/cs448-07-spring/. The talk was not a completely deep technical dive as it was in some ways designed to inspire students aiming to become architects and talk about why some things were done.
     
    Jawed likes this.
  20. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    215
    Location:
    Uffda-land
    We have Eric's architecture deep-dive from Tunis. We also have a long list of interview questions into Eric. Hopefully these things get published together. . .
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...