AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Discussion in 'Architecture and Products' started by Jawed, Mar 23, 2016.

Tags:
Thread Status:
Not open for further replies.
  1. Geeforcer

    Geeforcer Harmlessly Evil
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,297
    Likes Received:
    465
    This timeframe sweeps through Fiji, Polaris and Vega. In fact, "waiting for Raja" has become somewhat of a troupe: ever since Fury, in the run-up to the new chip unveiling part of the hype includes "This is his first design!"; then after the launch the motion dissipates only to re-surface, verbatim, in the lead-up to the next release.
     
    ArkeoTP and pharma like this.
  2. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,073
    Likes Received:
    5,617
    Raja is not in an engineering position AFAIK.
     
  3. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,124
    Likes Received:
    3,008
    Location:
    Finland
  4. BacBeyond

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    73
    Likes Received:
    43
    Well since its "roughly the same time you'd see Jim Keller's CPUs (Ryzen)" wouldn't that make Vega the first? Vega from the slides seems to be the biggest change to GCN yet, though its not clear why its not performing near where it should be.

    Vega is also supposed to support Infinity Fabric, though I'm not sure where it will come into play. It sounds like Navi will make heavy use of IF to work similar to Ryzen, where you have 1 smaller GPU that you can stack together to make the bigger ones, without the normal downsides of mGPU designs which require game engine support. That would allow them to scale from low end to top end with a single GPU design, each additional "core" giving an additional 1x performance. That would simplify their design process and allow them to keep low R&D costs similar to how CPUs are handled. They are on top of their game when it comes to CPU vs Intel right now, with only 2 intel chips worth their cost.
     
    jacozz likes this.
  5. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,073
    Likes Received:
    5,617
    I don't know if the infinity fabric is being used within Vega, but it's used for CPU-GPU connection in Zen+Vega APUs/SoCs.
     
  6. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,124
    Likes Received:
    3,008
    Location:
    Finland
    It's used inside the GPU too, apparently to connect the UVDs, VCEs and such
     
  7. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,511
    Likes Received:
    873
    Location:
    France
    What's the point of using IF in this configuration ? I get the benefit for connecting "cores" or "full chips" between them,like the multiple Zen configurations are showing, but part of chips ?
     
  8. kalelovil

    Regular

    Joined:
    Sep 8, 2011
    Messages:
    558
    Likes Received:
    95
    http://www.eetimes.com/document.asp?doc_id=1330981&page_number=2
     
    jacozz likes this.
  9. Cat Merc

    Newcomer

    Joined:
    May 14, 2017
    Messages:
    124
    Likes Received:
    108
    Someone should tell that to Vega :lol:
     
    kalelovil and Rootax like this.
  10. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Raw vs effective bandwidth. Current results just mean they are scheduling accesses inefficiently as a result of drivers. Plus the overclocked memory is running much faster than Fiji. Thermal throttling is another possibility for reduced bandwidth.

    In the Linux drivers it was mentioned certain functions don't have to wait to "acquire memory" so some QoS is likely occurring.
     
  11. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    That last time AMD saw the need to talk up something as irrelevant as an internal bus, things didn't exactly go well either. ;-)
     
    pharma and Lightman like this.
  12. kalelovil

    Regular

    Joined:
    Sep 8, 2011
    Messages:
    558
    Likes Received:
    95
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,364
    Likes Received:
    3,950
    Location:
    Well within 3d
    I think it's been indicated that GPU design times have been getting longer than 2-3 years in latest generations, not quite as long as a high-end CPU core like Zen, but enough that Anandtech's estimate at the time may have been off.
    The level of involvement over time has been uncertain, but the amount of blame that could be shifted to prior leadership has been decreasing.
    There's enough time at this point to think he should have had enough of a chance to significantly influence things. Some complications akin to a bubble in the development pipeline might mean he had a less than ideal start, if some of the projects in question actually did start before him and then stalled before a restart. A period of limited resource and corporate shuffling could cause the time frames to stretch.
    There's nothing public that would say either way. The initial patent for a hybrid tiled rasterizer that sounds similar to Vega's rasterizer was filed before Koduri came back to AMD, however.

    The rumor mill has Navi as the current "this time it's his project", after moving on from Vega.

    However, even if that is so he had time to influence Vega, and whether or not he had the chance to make every decision on it or not won't change that the bulk of its development and project finalization are fully under his watch.
    If we were to stipulate that Vega has a basis that was at least partly built before Koduri could change things, one possible scenario is that he did change some things--the disruption or increased risk introduced by a mid-course correction could lead to teething pains.

    What exactly the fabric does and where it plugs in would matter. The full bandwidth of Vega's HBM2 stacks is inferior to the internal crossbar arrangement between the L2 and CUs in any large GCN GPU, a fabric that only offered DRAM bandwidth would be a significant regression. Dropping the ROPs into that fabric would make things even more problematic.

    On the other hand, if there's still high-bandwidth crossbar between the L2 and CUs, then the data fabric's importance to the GPU is unclear. A mesh would be mostly fine if the bulk of its traffic is a direct line between the memory controller and a statically partitioned L2 slice.
     
    Alexko, pharma, jacozz and 1 other person like this.
  14. Pressure

    Veteran Regular

    Joined:
    Mar 30, 2004
    Messages:
    1,431
    Likes Received:
    359
    Ah yes, the good ol' R600 speech. Guess we will find out soon enough.
     
    jacozz likes this.
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,876
    Likes Received:
    768
    Location:
    London
    One of the things that gives me concern with the concept of Navi being merely a set of compute dies with memory stacked atop them is the density ratios: compute versus memory. Ratios concerning both logic and power.

    Already we have 8GB single-stack HBM2 modules. Yes, a consumer GPU with 4 GPU chiplets consisting of 8GB of memory each, totalling 32GB would be a dream come true. But in the real world, apart from being a bad ratio of compute to memory, there's also the power problem: it's very likely that high end GPUs will always be in the region of 300W. Realistically, one or two stacks of HBMx sitting atop compute chips pumping out a hundred or more watts is not going to happen.

    So the processor in memory concept needs to be seen as a subset of GPU functionality, which I think goes back to my initial idea about PIM: that memory-heavy fixed function hardware would be located in PIM, with the rest of the GPU being somewhere else.

    So this patent application is kinda interesting:

    Interposer having a Pattern of Sites for Mounting Chiplets

    Effectively it's describing a "one-size fits all" interposer, upon which varying configurations of chiplets can be deployed. The varying configurations would amount to performance grades, i.e. the complete range of SKUs from mainstream to enthusiast. In this model of a GPU, the chiplets are not necessarily all modules that contain HBM. And power hungry chiplets can be manufactured without memory as part of their module. Thus solving both of the ratio problems I described: logic and power.

    This is similar to, though ultimately different from the model seen in the Exascale APU paper:

    Design and Analysis of an APU for Exascale Computing


    where some of the functionality of the package does not feature memory stacked atop logic. On the other hand, the APU is not described as re-configurable to suit performance grades. It's worth noting that this paper couches GPU chiplets, with memory atop, as being in the low 10s of watts of power consumption, which is incompatible with a consumer GPU consisting of 2 or 4 stacks of HBMx memory...

    Another interpretation would be to disregard HBMx stacks of memory. As I've noted before, the HBM standard effectively describes a signalling layout for a stack of memory dies atop a base controller die, such that the base controller die can be doing anything, as well as controlling the memory. In theory there's no need to have multiple memory dies in a stack atop a logic die, when doing PIM. Instead each PIM chiplet could feature logic with a single die of memory on top. Now the chiplet count would be far higher, e.g. 16 for a high end GPU, with 1GB of memory in a single die as part of the dual-die PIM chiplet. Again, this would swing the ratios for logic and power back toward something realistic.

    This "thin chiplet PIM" model then complies with the desire to scale across performance grades. Also, not all chiplets would have memory atop them. e.g. PCI Express and output driver circuitry would be a common block found in all SKUs, so might be a memory-less chiplet that sits alongside a farm of PIM chiplets.
     
  16. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    11,116
    Likes Received:
    2,208
    couldn't they put the shaders on the hbm2 dies as the bottom stack and run them slower and make up for it with higher shader counts ?
     
  17. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    Memory processors work only well if all the required data + code + execution units are in the same location. Perfect for highly local calculations, but not good for random memory accesses. Shader code can access arbitrary memory locations (not known at scheduling time). We would need a completely different programming model for computing devices like this.
     
    eastmen likes this.
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,364
    Likes Received:
    3,950
    Location:
    Well within 3d
    It might not be in the cards for Navi. The AMD proposals are for a succession of computing projects whose timelines are past 2020.
    Navi does seem to be keyed to next-gen memory and scalability, however it doesn't seem like the likely next-gen memory candidates such HBM3, GDDR6, maybe cost-reduced HBM are necessarily more suitable for the PIM model or well-timed for Navi's 2018-2019 time frame.
    AMD wants to apply the same GPU silicon to many markets, and in addition to the memory and power concerns brought up there are other trade-offs in terms density, communication between chips, and physical optimization whose tradeoffs would compromise Navi if it tried to adopt a TOP-PIM model while still being the GPU that can slip into the current product spaces.

    The active interposer and chiplet scheme, however, is a significant jump in implementation and cost over passive interposers and MCM packaging. MCMs are well-understood, but even AMD's best effort here with the links in EPYC are orders of magnitude short of the needs here. Passive interposers might get closer, but may be marginal even for Navi in terms of cost and complexity--plus AMD needing to significantly do better in terms of interconnect power, bandwidth, and density.
    AMD's active interposer assumption is seemingly further out and may be necessary for what it intends to do.

    Nvidia's similar concept seems to be better-documented, referencing nearer-term products as possible contemporaries, better than AMD's signalling promises, and reliant on fan-out packaging rather than an interposer. It has at least one daughter die with some of the more miscellaneous IO and other hardware that would otherwise be uselessly replicated across the chips.

    AMD's stacked compute papers have assumed something like a maximum of 10W for under-stack logic, and with the memory stack perhaps ~5 W for the rest in order to keep DRAM temperatures at 85C or less.
    The latest exascale concept has a 160W node power budget (200W minus internode and system infrastructure power), and with 8 GPU chiplets that is 20W assuming the CPU sections draw 0W.
    More realistically, AMD's modeling gives 40-70W for off-interposer system memory access across workload profiling. Even assuming the CPU segments and other chiplets wouldn't be drawing power idling or supporting the GPUs, that means probably barely more than 10W can be sustained per GPU chiplet.

    Some of the fixed function hardware still has some relatively high-bandwidth connectivity with the programmable portions, and the latency-hiding for the GPUs isn't that strong for intra-block communication and synchronization like it is for texturing latency.
    Unfortunately, without something like AMD's active interposer concept and a next-gen fabric that somehow allays concerns that its active interposer concept is still insufficient, its efforts so far are unsuitable.


    My interpretation is that this covers various forms of interposer, one of which seems to generally describe the active interposer concept from AMD's latest exascale concept. The exascale paper briefly mentions the interposer providing miscellaneous functionality and networking as part of its duties for supporting the chiplet.
    What is more universals is the standardization of interface site formats, which chiplets can conform to and different interposer designs can combine. I think one rough interpretation is doing for an interposer what various slots, package ball-outs, and sockets do for motherboards.

    It follows with AMD's dis-integration goals with interposers and splitting functions and silicon processes apart and combining them in a 2.5D package. It seems as if most are on the same page, albeit others like Nvidia and Intel are looking at not needing an active interposer and stand a good chance of soon having in practice what AMD has on paper.

    What AMD might be trying to do is dis-integration to the point that various areas can be treated as a kind of Application Specific Standard Product. It can sell chiplets with just subsets of the overall GPU and freely include other blocks/outside IP as a custom product. The silicon itself would be more generic due to this, although I still question how generic it can be given what its exascale project is doing to the CU implementation. Among other things, clock rates are currently too high, and Vega is not moving in a promising direction. Unless AMD starts giving more concrete details on how it intends to improve the bandwidth, power, and interface pitch, the drop in perf/mm2 and perf/W could readily eclipse the benefit of splitting up the die.

    Perhaps Vega's implementation of the infinity fabric is a first step, and we can look at its non-progress in terms of perf/mm2 and perf/W as what even that step can do even before taking the scaling hits from leaving the die. The CU area in its marketing picture is unambiguously much less dense than it is with Polaris, despite currently not offering much more in terms of what it delivers per-CU.

    One of AMD's concepts had a big GPU die that then connected to TOP-PIM or similar HBM stacks with mini GPUs under it.
    Workloads could try to leverage whichever silicon suited them best.
    Granted, I think that was 2 or more of AMD's aspirational compute concepts in the past.
     
    xpea likes this.
  19. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    787
    Likes Received:
    215
    Also, the leaked slide with GPUs and server segments positions Navi 10 and 11 as the direct successors to Vega 10 and 11 respectively, at least where the slide is concerned. I don't see any sign of the speculated multi-die approach with Navi from this slide although I can't rule it out either. (Unless I'm missing something, this slide is the only mention of specific codenames and positioning for Navi so far.)

    (EDIT: Videocardz doesn't allow direct image linking.)

    It would be funny if GDDR6 is the "Nexgen memory" from the Capsaicin roadmap. The timing fits if Vega 11 doesn't use GDDR6.
     
    #59 iMacmatician, Jul 26, 2017
    Last edited: Jul 28, 2017
  20. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    Glad to see that I'm not the only one on this.

    But even if that's the way Navi will go, I don't see how, say, 2 400mm2 dies on an interposer could be cheaper than 1 750mm2 die with a few shader cores disabled. Yield is so easy to recover with a bit of redundancy, especially when the size of an individual core is getting smaller and smaller.

    And that's assuming that this 2 die solution would have the same performance as the one die solution, which I doubt as well.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...