AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Discussion in 'Architecture and Products' started by UniversalTruth, Dec 17, 2010.

  1. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    195
    Location:
    Stateless
    As a side note I find interesting that AMD is doing what Nick has been advocating for future CPUs, run wide vectors on narrower SIMD (ie running 64bytes vectors on 16 wide SIMD).
     
  2. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Hmm, how a 64B vector is any wider than 16-way SIMD?
     
  3. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Corrected. ;)
    AMD and nvidia (32 wide vectors though) are doing this for ages already. That isn't exactly new.
     
  4. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Not 4x better in the general case but it goes a long way to reducing the waste due to divergence. An idle lane in Cayman wastes 1/16th of available compute resources per clock. With GCN it's only 1/64th.
     
  5. GZ007

    Regular

    Joined:
    Jan 22, 2010
    Messages:
    416
    Likes Received:
    0
    The question is how much beefier is a single CU than a single cayman SIMD. The 2 times denser 28nm will probably offset this, but i still think they will loose something in area.
    They could cram more cayman SIMD-s inside the chip on 28nm. (6990 works fine for graphics)
    For the best case scenario a single cayman SIMD should be equal to a single CU :?:
     
  6. itsmydamnation

    Veteran

    Joined:
    Apr 29, 2007
    Messages:
    1,349
    Likes Received:
    470
    Location:
    Australia
    at what point does scaling VLIW4 ALU's start costing you extra transistors? I dont know much about this kind of stuff but i doupt in linear.
     
  7. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Albeit I don't think it's the case for the above, how about a hot-clocking ALUs hypothesis breaking your theory above? :razz:
     
  8. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    It's too early to make that declaration I think. If Cayman does as well in BF3 compared to Fermi as it does in other titles then you would be correct. There will be a crossover point though where VLIW will be inefficient but that time may still be a few years out.
     
  9. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    I doubt that. They would need to hotclock everything, not just the ALUs. That's different from Fermi, where they can run the schedulers at half the clock.

    The changes on the ALUs and the register files are actually going in the direction of reducing the complexity. That means either they shave off some cycles of latency (so back-to-back issue of dependent instructions work as almost mentioned on a slide [not in a very clear wording]) or the clocks can be raised even without hot clocking (at least the ALUs/register files won't hold them back).
     
  10. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    9,045
    Likes Received:
    1,119
    Location:
    WI, USA
    I like to reflect on the temporal aspects to GPU development. ;)

    Not many enthusiast people consider that like this stuff wasn't designed recently. In reality it's likely been in the works since RV770 days. And yet ATI of course works hard to sell us on Cypress and Cayman being the best things since sliced bread. Less than a year ago VLIW4 was the hottest game in town.

    Undoubtedly this recent presentation was also in a way a smokescreen for the next next generation of GPU hardware. If this stuff is taped out it's definitely old news for them.
     
  11. chiadog

    Newcomer

    Joined:
    May 21, 2008
    Messages:
    21
    Likes Received:
    0
    ^I am really questioning the NI family right now. We do know that NI was supposedly introduced due to the delay of the 32nm node @ TSMC, but I am not so sure now. So I wonder if NI was exactly how it was intended, just on a different node (and may be less SIMDs). Also, did SI always intended to have the new architecture that was to be launched @ 32nm node? If so, may be the 32nm delay was a good thing for AMD so they could have more time to play around with the new SIMD structure. They may have seen the same growing pains same as Fermi. By not stumbling at the same time NV did, they may have picked up a bigger customer base.
    I guess this sounded a little more like conspiracy theory than I intended :)
     
  12. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,245
    Likes Received:
    4,465
    Location:
    Finland
    According to Dave Baumann, Cayman is exactly the same as it would have been on 32nm, but wether it would have been the high end chip on 32nm is completely another thing (since it would have been around the size of Barts or so)
     
  13. chiadog

    Newcomer

    Joined:
    May 21, 2008
    Messages:
    21
    Likes Received:
    0
    Got it. That puts my world back together (in regards to AMD's time line). I guess I may have read too many silly season posts and confused myself.
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The scalar unit's status is something of a second-class citizen, since it doesn't seem capable of writing to memory, and there are only certain ways it can gather data from the SIMD units.
    The CU itself is a multi-issue unit that could with some evolution become a 5-wide FP coprocessor that could allow a CPU/GPU combo where the CPU can issue to a CU like the shared FPU on Bulldozer as one kernel while the GPU shares resources.

    The memory subsystem and the interconnect are more significant changes than the right turn taken by the execution hardware.
    The caches are probably larger per unit of storage, given they take traffic from multiple directions.
    The L1/L2 crossbar would have been changed significantly, adding a write capability and writeback support. The crossbar has more clients than Fermi's and the coherency model sound more intensive on AMD's chip, though it seems it can be optional.
     
  15. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    I thought branches are already co-issued with VLIW instructions on existing architectures.

    I'm struggling to understand how this is any different or less dynamic than what Fermi does. GCN seems to be doing the same thing except that it only has 10 wavefronts to choose from per SIMD instead of 24. The issue logic actually seems to be a lot more complex than Fermi which can only dispatch one instruction per clock (or two in the case of GF10x).
     
  16. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    GCN seems to guarantee that you won't stall for instruction or raw latency. Fermi has no such guarantees hence has a more complex scoreboarding mechanism.
     
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    I don't think there's any such guarantee. If you don't give GCN enough wavefronts to process it will stall, just like Fermi. The difference seems to be that GCN tracks a single instruction per wavefront and is unable to take advantage of ILP. Fermi's scoreboard allows it to track multiple in-flight instructions per warp- seems to be a maximum of ~4 (see linked paper).

    http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf.
     
  18. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Ok, GCN can hide all instruction and raw latencies with 4 threads/cu, the bare minimum. Fermi needs more than bare minimum to hide it all, hence a more complex scoreboard.
     
  19. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    How are you defining bare minimum - pipeline depth? I haven't seen anything conclusive about GCN's ALU pipeline indicating that it's only 4 cycles. The "vector back-to-back instruction issue" in the slides could be referring to the round robin issue and not necessarily back-to-back issue from the same wave.

    You don't need a complex scoreboard if you're relying solely on TLP. Fermi's additional complexity comes from a few things:

    1. Instruction issue runs at warp execution speed, on GCN it runs at 4x wavefront execution speed so Fermi needs 4x the number of dispatchers to feed an equivalent number of SIMDs.
    2. ALU pipeline is deeper so the "bare minimum" required warps for latency hiding is higher.
    3. Multiple instructions can be in-flight from the same warp.

    The score-boarding is only necessary for #3. This actually lets Fermi get away with fewer warps than "bare minimum" would otherwise suggest.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...