AMD: R9xx Speculation

Discussion in 'Architecture and Products' started by Lukfi, Oct 5, 2009.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Conceptually, isn't distributed setup/raster another way of doing TBR. Or atleast a beginning of migration towards TBR?
     
  2. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    If you are not expecting changes in geometry processing, what other changes are you expecting from SI/NI/whichever thing comes this year.
     
  3. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    I wouldn't say so necessarily. It's just able to lessen a serial bottleneck which would greatly benefit also any TB(D)R.
     
  4. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Ad it does that by performing spatial binning of primitives right? So may be not a full blown TBR, but IMHO, fermi is laying foundations for TBR to come back to desktop.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    The setup-rasteriser architecture is explicitly large-triangle friendly, not small-triangle friendly. It needs a complete overhaul for future scaling. While it appears adequate for games - that's mostly because it's early days, I reckon. And lack of analysis.

    I'm waiting for a decent analysis of the behaviour of Fermi architecture here.

    That difference didn't really cost NVidia anything. I think that's partly because ATI rasterisation performance in games is poor these days and partly because NVidia had a Z rate advantage - though not necessarily being used very well.

    More analysis of frame-rate minima is needed.

    How big is the setup-rasteriser in GT200? Also, how much of that growth was caused by improved early-Z culling, screen-space tiling, etc.? In other words, 10% isn't very meaningful :???:

    Evergreen had features axed in order to launch on time. Sure, any chip gets features axed, in theory, to launch on time.

    I think there are 4 key areas of change in R700->Evergreen:
    1. ALU utilisation - a real DOT3, pairs of DOT2 and various friends (instructions with PREV in ISA name), improved precision and general flexibility
    2. thread-generation/interpolation/LDS - interpolation is dependent upon LDS and thread generation used to be a sub-function of interpolation (or interpolation used to be a sub-function of thread generation - doesn't really matter)
    3. tessellation - support for HS and DS, a dedicated TS stage for D3D11 (distinct from Xenos/R600 style TS)
    4. scatter/gather/atomics - memory operation performance freed from TUs and ROPs
    To put it bluntly, all of these are failures:
    1. there is no need to have kept 5-way SIMD in my view, 4-way is clearly preferable (>80% utilisation is rare)
    2. thread generation is still bound by rasterisation rate it seems - multiple DirectCompute kernels can execute concurrently which helps, but thread generation for small triangles is fucked I reckon
    3. it's not going to scale, it's designed for a slow rasteriser (see 2.)
    4. scatter/gather is still a second-class citizen
    Options:
    1. There's a few choices for VLIW reorganisation. Ultimately it's intricately tied to register-file and TU organisation/operation. One issue is that as the GPU architecture moves closer to being re-used in Fusion, double-precision can not remain optional, i.e. has to be optimal for all ALU implementations. But that could be 5 years away. How much is Fusion supposed to trail GPU? 18 months? Also, what clock rates are required for Fusion i.e. when tighter integration of GPU-ALU processing comes, is 1GHz suitable when the CPU is running at 3GHz?
    2. and 3. go together. LDS seems fine, though the compiler guys are really struggling to make it work well (similar story for interpolation).
      Rasterisation is inherently parallelisable at the granularity of a quad. Hierarchical-Z/early-Z makes hierarchical-rasterisation preferable, but that process is the slave of one triangle at a time per quad, due to fundamental ordering-constraints. So hierarchical-Z either needs to be shallower (with smaller screen-space tiles) or a delayed-commit early-Z system needs to be implemented.
      There's an argument that if rasterisation of small triangles is done quickly (so that a hardware thread can be populated with <=16 triangles' fragments in <=4 cycles) then early-Z becomes redundant at least some of the time. This is on the basis that it's faster to rasterise, shade and late-Z cull than it is to delay a hardware thread until it's fully populated with 16 triangles by a slow rasteriser/hierarchical-Z unit (or to run a 16-quad hardware thread with only 1 quad of fragments active). Best for short shaders/Z-prepass/shadow-buffer-rendering.
      A joker in the pack is sample-frequency shading (a feature of D3D10.1), i.e. shading per MSAA sample, not per fragment - this naturally makes all triangles typically 4 times bigger :razz:
    3. once the hardware can cope with lots of small triangles then it'll be worth making a fast TS.
      GDS in Evergreen (an enhanced version of GDS from R700 and presumably R600) is a bottleneck for TS operations.
      Also, triangles always "exit" a core, so there's a wodge of data moving large distances. NVidia's design minimises the movement of triangle data (though doesn't eliminate it - depends on the screen-space tiles that a triangle ends-up being rasterised over), i.e. L2 is a bottleneck for some triangles. TS is trivially parallelisable per patch, though a patch can generate a vast output - cache ahoy.
      I still wonder if setup could be implemented as a GS kernel (not much different to the fixed function interpolators that are now run as a kernel). Since GS in ATI is, sometimes, dependent upon buffers in global memory, caching for global memory would be able to underly GS and keep setup's data on-die.
    4. memory controllers in Evergreen still seem to be ROP/TU centric. There appears to be no meaningful coalescing, as it appears to be available only when a scatter/gather has no incoherency whatsoever: black/white - detail is scant though. Cache ahoy?
    I'm really puzzled why 1. didn't happen in Evergreen. Maybe that was purely workload, rather than 40nm-problems cut-back?

    As for the rest, they're all major changes. Then it's a matter of whether the architecture undergoes creeping-featurism or one final radical upgrade.

    I tend to think it's creeping-featurism.

    I suspect 2 and 3 are dependent upon 4, because data-paths/cache-hierarchy need to be made robust, something that NVidia did a good job of in Fermi. 3 doesn't seem particularly difficult (and I don't see anything wrong with making TS a kernel, for what it's worth). 4 could be done without doing 2 and 3, leaving them for later. Hell, 3 could be done even if 2 isn't (the presence of two distinct TS units screams kludge to me).

    Jawed
     
  6. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
  7. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,535
    Likes Received:
    144
    What exactly are you basing this on? If it's Damien's numbers, perhaps you shouldn't. Cypress can hit near its theoretical rate with 1 pixel tris just fine, and it will be screwed by large triangles that straddle across screen tiles, thus making raster the limiting factor. The story with tessellation is a bit more ample, and the really awful 1 tri per 3 clocks behaviour is not, in spite of what has been alluded, necessarily the norm (albeit there seems to be a fixed cost attached to enabling tessellation, which is independent of tessellation factor/triangle size).
     
  8. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    20,516
    Likes Received:
    24,424
    What if you laughed before visiting the link?
     
  9. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    What a pointless waste of time. Prolly xxx is running this site. :grin:
     
  10. Novum

    Regular

    Joined:
    Jun 28, 2006
    Messages:
    335
    Likes Received:
    8
    Location:
    Germany
    It's very unlikely that existing FS do that, because you couldn't render with DP precision until very recently.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Patents which I've linked and discussed + very strong recommendations from AMD devrel not to tessellate below 8-fragment triangles + the stunningly awful performance in non-game tests.

    The instant a triangle falls below 4 quads in size and occupies only one screen space tile, one rasteriser is idle. This kills performance on z-prepass and shadow buffer rendering.

    Tessellation performance?

    http://www.hitechlegion.com/reviews/video-cards/4742-evga-geforce-gtx-460-768mb-video-card?start=18
    http://www.hitechlegion.com/reviews/video-cards/3177?start=17

    See the SubD11 sample result at the bottom of those pages. This is something that AMD demonstrated running on Juniper over 1 year ago. I bet the guys at NVidia had a chuckle when they saw the performance.

    NVidia's water demo is comparatively kind :lol:

    Jawed
     
  12. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Tell me, did we leave "the early days" of DX10's Geometry Shader? (I don't have to spell out the analogies here, have i?).
    Apparently, people associate bad-ass slowliness with "in software" and that's what i was talking about - not whether or not Fermi might use transistors for other stuff than the tessellation stage. :)

    Frankly, I have no idea what you're saying here. Which difference? Whose raster perf is poor? And which z-rates are poorly utilized?


    Since only multiplying setup/rasterizer doesn't get you anywhere if you do not also reinforce the necessary infrastructure… And to do it properly, you'll have to walk the painful way, I guess.
    Anyways, I had the same question and they said: 10% more compared to an approach analogue to GT200/RV790.

    You forgot one very important key change: The number of units. Granted, it's a rather obvious thing, but if you have a performance, cost and yield target, you also have to factor in, exactly how many of the engineer's dreams you can incorporate into the new design in order to meet these goals.
     
  13. Drazick

    Newcomer

    Joined:
    Jan 27, 2010
    Messages:
    58
    Likes Received:
    0
    It's Egg Chicken Circle.
    Once they give the performance and capabilities there'll be programs -> applications who would take advantage of that.

    I'm not really sure(Don't have any real knowledge in that), but does Video Encoders use DP or do they us SP?

    I don't know, for me as a user who doesn't play any game I'm eager to see some applications who would benefit from the GPU, Something beyond gaming.
    In much larger scale.
     
  14. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    Only the pixel shader thread generation is bound by rasterization. Compute is not.
     
  15. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    13,878
    Likes Received:
    4,727
    Thanks for all that information and work . Really cool. I'd personaly love to have a discarded wafer with bad chips on it. I'd frame it and hang it up on my wall. They are really beautiful things. I've only see two wafers that were acctually used though.
     
  16. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Not sure what NVIDIA would know about the "RV790" size / approach. The simple fact of the matter is that NVIDIA had to make a larger change to their architecture to support Tessellation simply because they didn't have it before, whereas it has been ingrained in our desigsn for multiple generations.
     
  17. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    They primarily use int. They use FP but generally only as control parameters for things like quality metrics, rate control, etc. Most of the actual data stays in the fixed point domain.
     
  18. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Video decode/encode is almost exclusively integer ops based. And 8bit and 16bit ops dominate there.
     
  19. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,535
    Likes Received:
    144
    Patents are nice and all, but their materialization in hardware is an unknown quantity. nVidia tends to reccomend the same thing with regards to not going under 8 pixel triangles in general - does that mean they're small triangle unfriendly(hint, Fermi's epic setup rate happens with really small triangles, so if that were the only consideration that'd be what they'd want always)?

    Do you base your definitive statement about what happens the instant a triangle falls under a particular area on actual experience with the hardware? If so, please detail it, because that's definitely what I (and others) are seeing in practice. Tessellation means a shitload more than setup - raster, and there are multiple potential sticking points with regards to data flow that can and apparently do hamper Cypress performance, or rather, expose the parts where it has a less graceful performance decline compared to Fermi. I don't think that we should be using SDK samples that are written for readability rather than performance, and which do quite a few things, to underline specific architectural traits. If we want to talk about setup/raster, we need to use something that isolates those portions as well as possible, and start from there.
     
  20. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Yes, of course. And of course their 0.5-triangles-drawn-per-clock-approach would have made them look REALLY bad in tessellated workloads. OTOH, their knowledge of your chips is - IMHO - second only to your own (and vice versa). Both companies should have tools for analyzing chips normal people like us can only dream of. :)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...