AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Discussion in 'Architecture and Products' started by UniversalTruth, Dec 17, 2010.

  1. GZ007

    Regular

    Joined:
    Jan 22, 2010
    Messages:
    416
    Likes Received:
    0
    With multiple primitive pipes and R/W L2 the new radeon will probably catch up fermi in tesselation too.:razz:
     
  2. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    195
    Location:
    Stateless
  3. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,455
    Likes Received:
    471
    GZ007: nVidia will probably introduce a new marketing "theonlyimportantfeature" with their new product, so tessellation will move to the "dry cow" list, directly under HDR and PhysX. No one will care about it at that time...
     
  4. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    195
    Location:
    Stateless
    What do they mean by "PRTs to drive virtual texturing"? What does PRT mean?
     
  5. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Partially Resident Texture (think megatexture)
     
  6. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Probably less than 10 wavefronts needed (this number per SIMD is mentioned several times, is probably a number on the safe side).

    I just wonder what the execution latency on the vector ALUs itself is. In the Evergreen and Cayman architecture manuals some dependent operations within an VLIW appear, indicating the pure execution latency is <=4 cycles. One of the slides comparing Caymans VLIW with the new CU architecture mentions "interleaved wavefronts required" for VLIW (resulting in the known 8 cycles we know for the VLIWs) on one hand, while it reads "vector back-to-back wavefront instruction issue" as an advantage of the new architecture. Does that mean the cleaned up register files (operand collection is much easier) enabled it to reduce the latency to 4 cycles so it can issue dependent vector ops every for cycles on one SIMD?
    That would really simplify the whole scheduling (okay, the VLIW stuff does almost nothing in the SIMD-engines itself, that's why it can be inefficient as changing clauses costs quite some time), as you don't have to track those dependencies (as Fermi needs to do with its 18 to about 40 cycles latency and instruction issue every 2 cycles). And it would also fit perfectly with the description of the instruction arbitration part. I would like it. But I don't know if it is very feasible when looking at the latencies of nvidia GPUs. But when comparing with CPUs on the other side, Fermi has a factor 10 higher latency for floating point instructions (DP may be worse, integer is just ridiculous) and Cayman is not that much better. Reducing this distance a bit may be possible even when considering the die size and power budget of an ALU in a GPU. The CPU guys at AMD should have plenty of experience how to design fast register files, operand and result networks and such stuff, isn't it? Because as said, the cayman ALUs have probably just 4 cycles latency (maybe except for FMA and double precision stuff).
     
  7. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    If it ends up better at it than Fermi, I'll bet NVIDIA will claim that tessellation is an unimportant gimmick… :D
     
  8. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Looks like the TS stage will still be tied to the setup pipeline like the current architecture and not distributed among the SIMD multiprocessors like Fermi. But with (at least) four of those primitive pipes and the coherent L2, I think AMD can catch with NV in heavy tessellation performance.
     
  9. Tim

    Tim
    Regular

    Joined:
    Mar 28, 2003
    Messages:
    875
    Likes Received:
    5
    Location:
    Denmark
    No, in the webcast Eric said the next iteration of DX11.
     
  10. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Don't you think the L1 bandwidth is a bit underwhelming? Fermi can fetch twice than that, per SM.

    By the way, Eric slipped an aggregate BW estimation of 1,5TB for the cache. With conservative estimation for the chip clock-rate (~850MHz), that would yield between 26 and 30 CUs for the flagship SKU.
     
  11. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Not per clock. A GF100 can also only fetch 64 Bytes/clock in the best case, isn't it?
    Edit:
    And AMDs new CU architecture has a separate access to the LDS with 128 Bytes/clock bandwidth. Fermi has to share the 64 Bytes/clock for cache/local memory, afaik.

    Lets say 800 MHz (would fit to the 4 cycle vector pipeline latency proposed above ;)) and it means there are 32 CUs. At least if that are 1.5 TiB/s. Otherwise 28 CUs at 850 MHz is also quite close. The number of CUs needs to be divisibly by 4.
     
    #251 Gipsel, Jun 16, 2011
    Last edited by a moderator: Jun 16, 2011
  12. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    The poor choice is designing hardware assuming the maximum amount of memory is always needed.
     
  13. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    True, but a CU looks like it has about 4x the GF100 SM compute resources. So it better have higher bandwidth to memory or it will fall down on kernels with low arithmetic to mem op ratios relative to GF100.
     
  14. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    No, just twice per clock peak (64 fma vs. 32 fma per clock) vs. GF100/110 and only 33% more (64 vs. 48) vs. the SMs of the GF104 type.
     
    #254 Gipsel, Jun 17, 2011
    Last edited by a moderator: Jun 17, 2011
  15. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    Ah sorry. For some reason I was thinking a GF100 SM did 16 fmas per clock - should have double checked. Still (GF104 excepted) it's not quite as big a difference as it looks like at first.
     
  16. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    (Twice as has already been pointed out). And it should also be running at ~40% lower clocks.

    2×0.6 = 1.2, or 20% more compute power per CU versus Fermi's SM.
     
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Fermi is also 64 FMA's per core clock. ALUs run at 2x core. A Fermi SM and a CU have the same throughput per core clock. Register file bandwidth per SIMD per clock is equivalent as well.

    With respect to nVidia's response, there's no new API to shout about so they'll have to try harder.
     
  18. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,245
    Likes Received:
    4,465
    Location:
    Finland
    Unless you're finnish, then no ;)
     
  19. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    128.
     
  20. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    In what sense does it take it another step further?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...