AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Discussion in 'Architecture and Products' started by UniversalTruth, Dec 17, 2010.

  1. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    If my calculations are right, a 32-CU chip would have:

    — 4×64kB (vec regs) + 64kB (LDS) + 16kB (R/W data L1) + 8 kB (scalar regs) = 344 kB per Compute Unit,
    — 16 kB (shared L1) + 32 kB (shared iL1) = 48 kB per CU Array,
    — Probably 512 kB of L2.

    That's a total of 32×344 + 8×48 + 512 = 11,904 kB or 11.6 MB of internal memory, i.e. registers + cache. That's quite a lot, and if I recall correctly, Fermi has about 4 MB.

    Edit:

    Actually, Fermi has:

    — 128 kB (vec regs) + 64 kB (L1) = 192 per SM,
    — 768 kB of L2.

    That's a total of 16×192 + 768 = 3,840 kB or 3.75 MB of internal memory.
     
    #321 Alexko, Jun 18, 2011
    Last edited by a moderator: Jun 18, 2011
  2. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    L2 is per mem channel.
     
  3. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Right (edit: you forgot the Tex-L1 for Fermi, if you include that it's a bit more than 4 MB), and Cayman has:
    24 x (256 kB Regs + 32 kB LDS + 8 kB L1) = 24 x 296 kB
    512 kB L2 ( and 64 kB GDS)

    Total: 7.5 MB
     
    #323 Gipsel, Jun 18, 2011
    Last edited by a moderator: Jun 18, 2011
  4. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Yes, 128 kB per channel, same as with AMDs higher end GPUs.
     
  5. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    Isn't it per memory controller, each controller driving two 32-bit channels? I was assuming this and a 256-bit interface.
     
  6. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Okay, but do we really know if a channel is comprised of (at least) two parallel DRAM chips or a memory controller controls two (32 bit) channels?
     
  7. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    And the 64KB of constant (uniform) cache per SM in Fermi, too! ;)
     
  8. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    The spec allows the use of up to 64 kB constant memory. But the cache is much smaller, only 8kB or so, if I remember correctly. And it's not like AMD GPUs do not also have a constant buffer :wink:

    Maybe we should start to count the bytes in the write combining buffers. :lol:
     
    #328 Gipsel, Jun 18, 2011
    Last edited by a moderator: Jun 18, 2011
  9. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    I guess Fermi no longer has an instruction cache either :)

    We don't know what 28nm Fermi (Kepler) looks like yet and it could have 2MB L2 for all we know. I'm half expecting it. The first iteration of FSA could have 1-2MB of L2 as well.
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    @128k / 64b mem, it only comes to 512k for SI.
     
  11. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    Perhaps I'm misunderstanding what you mean, but would that really make any difference?

    Sounds plausible. So far, GPUs tend to have a buttload of registers, lots of small L1 cache, and a tiny L2. The current trend (visible on NVIDIA's architectures anyway) is that registers don't grow as quickly as SPs or L1 cache. L2 is very new so there's not much we can say about it.

    So yeah, maybe the future will see register_size/SP going down a bit with L1 and L2 caches growing significantly. GPUs integrated into APUs are quite likely to have access to a pretty large L3 cache as well (as is already the case in Sandy Bridge), so it's possible that future GPU memory hierarchies will look much more like those of CPUs.
     
  12. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    A memory channel is independent, it can read from a different address than other channels (at least I would define it that way). Using two 32 bit DRAM chips in parallel on a single channel, reads 64 bit/cycle from a single address.

    Edit:
    Register files as used by GPUs are basically cheaper to implement than caches with the same bandwidth and comparable latency.
     
    #332 Gipsel, Jun 18, 2011
    Last edited by a moderator: Jun 18, 2011
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Right, assuming a 256-bit bus :)

    Yeah, GPUs have solved the latency hiding problem for the most part but register files and thread counts can't grow indefinitely. There should be a lot more focus on reducing absolute memory latencies going forward.

    Does anyone know how Cayman's LDS coalescing works? I believe broadcast is supported but does it also support swizzled reads in a single request?
     
  14. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I don't think there's any real alternative to more registers and threads, although returns will diminish. Unless there's a revolution in package tech.

    It would help massively though if they could unify the registers, lds and caches into a single pool. Just registers and lds would be a big step forward as well.
     
  15. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Thinking of Larrabee with a swizzle/permute unit between register file and vector ALU?
     
  16. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    I was expecting AMD to abandon 64-wide SIMD batches and switch to 32 or even 16 as they evolve their architecture to better run non-graphics and less regular workloads. Certainly that would cost something in terms power and performance per area.
     
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    I expected that as well but this is arguably a better overall decision. 32 NOPs on Cayman waste 100% of compute resources for 2 cycles. 32 NOPs on GCN waste only 25% of compute resources for 2 cycles. So you still get a significant benefit.
     
  18. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Not really. From the point of view of a divergent wavefront things might be a bit better due to improved occupancy, certainly not 4x better.
     
  19. chiadog

    Newcomer

    Joined:
    May 21, 2008
    Messages:
    21
    Likes Received:
    0
    Are these updated architecture going to be inside SI or something after? I thought NI was a hybrid between the SI and Evergreen, or was that smoke screen for architectural change?
     
  20. LordEC911

    Regular

    Joined:
    Nov 25, 2007
    Messages:
    877
    Likes Received:
    208
    Location:
    'Zona
    Yes, this new CU architecture will be in SI. The full feature list of FSA will be implemented gradually over the next few generations.
    VLIW4 was a baby step towards this larger architectural overhaul.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...